Introduction
This article presents the CDataFile
class.
CDataFile
is a class that is designed to provide an interface to
using and manipulating CSV (comma separated values) or other text-delimited,
tabular, data files. CDataFile
was first published in November,
2002 and has undergone several improvements and enhancements, and ended up
getting totally redesigned. This implementation has removed all dependence on
MFC and now makes great use of the Standard Template Library (STL). Throughout
the class, you will find several interesting tricks and some creative use of
predicates, function objects, and the STL algorithms. But most importantly, you
will find a powerful interface that is designed for usability by the novice and
experienced programmer alike.
Why CDataFile?
When I set out to search for a class to read CSV and other text data, I had
these criteria:
- The class had to be very fast in parsing, lookups, and data manipulation.
- The class had to provide a simple (column, row) type interface.
- The class had to be able to efficiently handle data files in excess of 100MB
or more.
To my surprise, I was hard pressed to find anything that met my criteria. I
found some stuff that used OBDC and even DOM, but nothing lightweight, simple to
use, or very fast. So I decided to write my own, and CDataFile
is
the result.
Using CDataFile
In order to use CDataFile
, you need to add DataFile.cpp
and DataFile.h to your project and include the following header file:
#include "DataFile.h"
I have listed the member functions and operators of CDataFile
in
the tables below. Note that the term variable in this class is synonymous
with field or column, and the term sample is synonymous
with record or row.
CDataFile Member Functions
Function |
Description |
"bool AppendData(const int& iVariable, const char* szValue);"
href="#AppendData">AppendData |
Appends data to a specified variable. |
"void ClearData(void);"
href="#ClearData">ClearData |
Clears all data from the
CDataFile . |
"bool CreateVariable(const char* szName, const double& initial_value, const int& iSize = 0);"
href="#CreateVariable">CreateVariable |
Creates a new variable in the
CDataFile . |
"bool DeleteVariable(const int& iVariable);"
href="#DeleteVariable">DeleteVariable |
Deletes a variable from a
CDataFile . |
"static CDataFile FromVector(const char* szVariableName, const std::vector[double]& vData);"
href="#FromVector">FromVector |
A static function to create a CDataFile from
a vector. |
"double GetData(const int& iVariable, const int& iSample);"
href="#GetData">GetData |
Gets data from the
CDataFile . |
"const char* GetLastError(void) const;"
href="#GetLastError">GetLastError |
Gets the last error encountered by the
CDataFile . |
"int GetNumberOfSamples(const int& iVariable) const;"
href="#GetNumberOfSamples">GetNumberOfSamples |
Gets the number of samples contained in a
variable. |
"int GetNumberOfVariables(void) const;"
href="#GetNumberOfVariables">GetNumberOfVariables |
Gets the number of variables in the
CDataFile . |
"int GetReadFlags(void) const;"
href="#GetReadFlags">GetReadFlags |
Gets the current read flags. |
"int GetVariableIndex(const char* szVariableName, const int& iStartingIndex = 0);"
href="#GetVariableIndex">GetVariableIndex |
Gets the lookup index of a specified
variable. |
"int GetVariableName(const int& iVariable, char* lpStr);"
href="#GetVariableName">GetVariableName |
Gets the name of the variable at a specified
location. |
"bool ReadFile(const char* szFilename);"
href="#ReadFile">ReadFile |
Reads the contents of a file and stores it in the
CDataFile . |
"bool SetData(const int& iVariable, const int& iSample, const double& value);"
href="#SetData">SetData |
Sets data in the
CDataFile . |
"void SetDelimiter(const char* delim);"
href="#SetDelimiter">SetDelimiter |
Sets the delimiter to use for parsing
files. |
SetReadFlags |
Sets the read flags of the
CDataFile . |
"void SetReserve(const int& nReserve);"
href="#SetReserve">SetReserve |
Sets the capacity of contiguous memory for the
CDataFile before a reallocation is needed. |
"bool WriteFile(const char* szFilename, const char* szDelim = ,);"
href="#WriteFile">WriteFile |
Writes the contents of the CDataFile to a
file. |
CDataFile Operators
CDataFile(void);
CDataFile(const int& dwReadFlags);
CDataFile(const char* szFilename,
const int& dwReadFlags = DF::RF_READ_AS_DOUBLE);
CDataFile(const DF::DF_SELECTION& df_selection);
CDataFile(const CDataFile& df);
Parameters
dwReadFlags
Determines how the data is read and stored in the CDataFile
. Can
be any combination of the following:
RF_READ_AS_DOUBLE
(Takes priority if
RF_READ_AS_STRING
is also set)
RF_READ_AS_STRING
RF_APPEND_DATA
(Takes priority if RF_REPLACE_DATA
is also set)
RF_REPLACE_DATA
szFilename
The fully qualified path to the data file to read upon construction.
df_selection
A subset of another CDataFile
to which the constructor will
set initial values.
df
Another CDataFile
to which the constructor will set initial
values.
Remarks
None of the CDataFile
constructors will throw an exception, but
will catch any exceptions internally. Optionally, the user can call
CDataFile::GetLastError()
to retrieve any error information due to
a CDataFile
construction exception.
Example
CDataFile df;
CDataFile df("C:\\MyData.csv");
CDataFile df("C:\\MyStringData.csv", DF::RF_READ_AS_STRING);
CDataFile df(DF::RF_READ_AS_STRING | DF::RF_APPEND_DATA);
CDataFile df1("C:\\MyData.csv");
CDataFile df2(df1(iLeftColumn, iTopRow, iRightColumn, iBottomRow));
The AppendData()
member function appends data to the end of a
specified variable. You can append values one at a time, or append the contents
of an entire vector. AppendData()
will return true if it was
successful, or false if an error was encountered.
bool AppendData(const int& iVariable, const char* szValue);
bool AppendData(const int& iVariable, const double& value);
bool AppendData(const int& iVariable, const std::vector<double>& vData);
bool AppendData(const int& iVariable, const std::vector<std::string>& vszData);
Parameters
iVariable
The index of the variable to which the data will be appended.
szValue
The value to append. (Assumes DF::RF_READ_AS_STRING
.)
value
The value to append. (Assumes DF::RF_READ_AS_DOUBLE
.)
vData
A const reference to a vector containing the values to append. (Assumes
DF::RF_READ_AS_DOUBLE
.)
vszData
A const reference to a vector containing the values to append. (Assumes
DF::RF_READ_AS_STRING
.)
Return Value
Returns true
if the function was successful,
false
if an error was encountered.
Remarks
If the user appends string
data, it is appended to the
CDataFile
internal string
vector, whereas if the user
appends values of type double
, the data is allocated to the class'
internal double
vector.
Example
if(!df.AppendData(0, 0.2123))
cout << df.GetLastError();
if(!df.AppendData(4,v);
cout << df.GetLastError();
The ClearData()
member function is responsible for clearing all
the data from a CDataFile
, reclaiming any allocated memory, and
zeroing the class' internal buffer size.
void ClearData(void);
Remarks
The class will reclaim a block of contiguous memory equal to the
size specified by CDataFile::SetReserve()
when reading a new data
file, otherwise the internal buffer capacity will be zero. The class will
allocate internal buffer storage as needed, but may exhibit the behavior studied
in this article
[^] if the class is required to read excessively large files.
Calling CDataFile::SetReserve()
with an adequate capacity before
CDataFile::ReadFile()
will eliminate this behavior.
CDataFile::ClearData()
is called automatically when a
CDataFile
is destroyed.
Example
df.ClearData();
The member function, CreateVariable()
, is provided as a means to
append a new variable to the end of your CDataFile
. Think of it as
adding a new column to the right of your data in Excel. You can create variables
of pre-determined sizes and an initial value, or you can create a variable from
an existing vector.
bool CreateVariable(const char* szName, const double& initial_value,
const int& iSize = 0);
bool CreateVariable(const char* szName, const std::string& initial_value,
const int& iSize = 0);
bool CreateVariable(const char* szName, const std::vector<double>& vData);
bool CreateVariable(const char* szName,
const std::vector<std::string>& vszData);
Parameters
szName
The name you want to assign to the variable. Think of it as a column label in
Excel or a field name in a database.
initial_value
The value you want all of the new data to contain. Think of it as assigning
all the rows in a column this initial value.
iSize
The number of samples (rows or records) you want your variable to
contain.
vData
A vector of type double
containing the data to assign to the new
variable.
vszData
A vector of type string
containing the data to assign to the new
variable.
Return Value
Returns true
if the function was successful,
false
if an error was encountered.
Remarks
The default size of a new variable is 0. The user must then call
CDataFile::AppendData()
to add values to the variable. If the user
attempts to call CDataFile::SetData()
on a variable of size 0, an
error will occur.
Example
if(!df.CreateVariable("MyVar", 0.0, 27))
cout << df.GetLastError();
if(!df.CreateVariable("My Vector Var",v);
cout << df.GetLastError();
The DeleteVariable()
member function is provided to delete a
variable from a CDataFile
. Think of it as deleting a column in
Excel, or an entire field in a database.
bool DeleteVariable(const int& iVariable);
Parameters
Return Value
Returns true
if the function was successful,
false
if an error was encountered.
Remarks
Use CDataFile::GetLastError()
to retrieve any error
information.
Example
if(!df.DeleteVariable(3))
cout << df.GetLastError();
The static
member function, FromVector()
, is
provided as a means to create a CDataFile
object from an existing
data vector. This function particularly is useful when using
CDataFile::operator+
and CDataFile::operator+=
.
static CDataFile FromVector(const char* szVariableName,
const std::vector<double>& vData);
static CDataFile FromVector(const char* szVariableName,
const std::vector<std::string>& vData);
Parameters
Return Value
Returns a CDataFile
containing the resultant
variable.
Remarks
Use this static function when you need to convert a vector of data
to a CDataFile
.
Example
df += CDataFile::FromVector("MyVectorData", v);
df = CDataFile::FromVector("My V1 Data", v1) +
CDataFile::FromVector("My V2 Data", v2);
There are several overrides provided for the member function,
GetData()
. This is to allow for great flexibility in how the user
wants to retrieve the data. You can get data by variable index or by variable
name, as well as obtain a single sample or all samples.
double GetData(const int& iVariable, const int& iSample);
double GetData(const char* szVariableName, const int& iSample);
int GetData(const int& iVariable, std::vector<double>& rVector);
int GetData(const char* szVariableName, std::vector<double>& rVector);
int GetData(const int& iVariable, const int& iSample, char* lpStr);
int GetData(const char* szVariableName, const int& iSample, char* lpStr);
int GetData(const int& iVariable, const int& iSample, std::string& rStr);
int GetData(const char* szVariableName,
const int& iSample, std::string& rStr);
int GetData(const int& iVariable, std::vector<std::string>& rVector);
int GetData(const char* szVariableName,
std::vector<std::string>& rVector);
Parameters
iVariable
The index of the variable from which to retrieve the data.
iSample
The sample number (record or row, 0-indexed) to retrieve.
rVector
A reference to a vector containing the proper data type to receive the
data.
lpStr
A pointer to a string buffer to receive the data.
rStr
A reference to a std::string
that will receive the
data.
Return Value
double GetData(const int& iVariable, const int& iSample);
double GetData(const char* szVariableName, const int& iSample);
Returns a value of type double
equal to the data at the
specified location if successful, DF::ERRORVALUE
if an error
is encountered.
int GetData(const int& iVariable, std::vector<double>& rVector);
int GetData(const char* szVariableName, std::vector<double>& rVector);
int GetData(const int& iVariable, std::vector<std::string>& rVector);
int GetData(const char* szVariableName,
std::vector<std::string>& rVector);
Returns the new size of rVector
if successful, -1 if an error is
encountered.
int GetData(const int& iVariable,
const int& iSample, char* lpStr);
int GetData(const char* szVariableName,
const int& iSample, char* lpStr);
int GetData(const int& iVariable,
const int& iSample, std::string& rStr);
int GetData(const char* szVariableName,
const int& iSample, std::string& rStr);
Returns the new length of the lpStr
or rStr
if
successful, -1 if an error is encountered.
Remarks
If calling GetData()
with parameters of type
double
, DF::RF_READ_AS_DOUBLE
is assumed. When calling
with parameters of type char*
or std::string
,
DF::RF_READ_AS_STRING
is assumed.
Example
double d = df.GetData("MyVar", 7);
if(d == DF::ERRORVALUE)
cout << df.GetLastError();
CString szValue = "";
int iLength = 0;
iLength = df.GetData(9, 3, szValue.GetBuffer(0));
szValue.ReleaseBuffer();
if(iLength == -1)
cout << df.GetLastError();
else
{
}
std::vector<double> vData;
int iSize = df.GetData("My Variable 2", vData);
if(iSize == -1)
cout << df.GetLastError();
else
{
}
The member function, GetLastError()
, is provided as a means to
extract information from a CDataFile
, regarding the last error
encountered by the class.
const char* GetLastError(void) const;
Return Value
Returns a const char*
containing information about the
last error encountered by the class.
Example
cout << df.GetLastError();
The GetNumberOfSamples()
member function is provided as a means
to determine how many samples are contained in any given variable.
int GetNumberOfSamples(const int& iVariable) const;
Parameters
Return Value
Returns the number of samples if the function was successful, -1 if
an error was encountered.
Remarks
Use CDataFile::GetLastError()
to retrieve any error
information.
Example
int nSamps = df.GetNumberOfSamples(3);
if(nSamps==-1)
cout << df.GetLastError();
The GetNumberOfVariables()
member function is provided so that
the user can obtain the number of variables currently contained in the
CDataFile
.
int GetNumberOfVariables(void) const;
Return Value
Returns the number of variables contained in the
CDataFile
.
Example
int nVars = df.GetNumberOfVariables();
The GetReadFlags()
member function is provided as a means to
obtain the current read flags that have been either set or cleared.
int GetReadFlags(void) const;
Return Value
Returns the current read flags.
Remarks
The function returns an int
that contains the flags
encoded within it. Use the bitwise &
operator to determine
which flags are actually set.
Example
cout << "Append mode is "
<< (df.GetReadFlags() & DF::RF_APPEND_DATA ? "" : "NOT " )
<< "set!";
The GetVariableIndex()
member function is provided as a means to
lookup the index of a variable, given its name and/or other information.
int GetVariableIndex(const char* szVariableName,
const int& iStartingIndex = 0);
int GetVariableIndex(const char* szVariableName,
const char* szSourceFilename, const int& iStartingIndex = 0);
Parameters
szVariableName
The name of the variable for which to find the index.
iStartingIndex
The index from which to begin the search.
szSourceFilename
The name of the file from which the variable originated.
Return Value
Returns the index (0-based) of the variable if the function was
successful, -1 if an error was encountered.
Remarks
GetVariableIndex()
will return the first instance of the
variable name. If you have variables in your data with the same name, and you
need an instance of the variable name other than the one first encountered, you
will want to offset your search with iStartingIndex
.
You may have data in a CDataFile
that comes from different
source files (i.e. using DF::RF_APPEND_DATA
). In these cases, you
may desire an instance of a variable from a particular source file. In this
case, you would use the override provided with
szSourceFilename
.
Example
int iVar = df.GetVariableIndex("MyVar");
if(iVar == -1)
cout << df.GetLastError();
int iVar = df.GetVariableIndex("My Var 2", "C:\\data.csv", 3);
if(iVar == -1)
cout << df.GetLastError();
The GetVariableName()
member function is provided to lookup the
name of a variable (or field) given its index (or 0-based column number).
int GetVariableName(const int& iVariable, char* lpStr);
int GetVariableName(const int& iVariable, std::string& rStr);
Parameters
iVariable
The index of the variable for which to obtain the name.
lpStr
A pointer to a string buffer to receive the name.
rStr
A string to receive the name.
Return Value
Returns the length of the variable name if the function was
successful, -1 if an error was encountered.
Remarks
Use CDataFile::GetLastError()
to retrieve any error
information.
Example
CString szVarName = "";
int iLength = df.GetVariableName(3, szVarName.GetBuffer(0));
szVarName.ReleaseBuffer();
std::string szVar = "";
iLength = df.GetVariableName(3, szVar);
if(iLength==-1)
cout << df.GetLastError();
else
cout << szVar.c_str();
The ReadFile()
member function is provided as a way to easily
read a data file by wrapping up all the file IO stuff.
bool ReadFile(const char* szFilename);
bool ReadFile(const char* szFilename, const unsigned& dwReadFlags);
Parameters
Return Value
Returns true
if the function was successful,
false
if an error was encountered.
Remarks
Use CDataFile::GetLastError()
to retrieve any error
information.
Example
if(!df.ReadFile("C:\\test.csv"))
cout << df.GetLastError();
if(!df.ReadFile("C:\\test2.csv", DF::RF_APPEND_DATA))
cout << df.GetLastError();
The SetData()
member function is provided as a way to set data
values stored in a CDataFile
.
bool SetData(const int& iVariable, const int& iSample, const double& value);
bool SetData(const int& iVariable, const int& iSample, const char* szValue);
Parameters
iVariable
The variable for which to set the data.
iSample
The sample for which to set the data.
value
The value that the data will be set to. (Assumes
DF::RF_READ_AS_DOUBLE
)
szValue
The value that the data will be set to. (Assumes
DF::RF_READ_AS_STRING
)
Return Value
Returns true
if the function was successful,
false
if an error was encountered.
Remarks
Use CDataFile::GetLastError()
to retrieve any error
information.
Example
if(!df.SetData(3,0,2.121))
cout << df.GetLastError();
The SetDelimiter()
member function is provided to set the
delimiter that will separate values in a CDataFile
.
void SetDelimiter(const char* delim);
Parameters
Example
df.SetDelimiter("\t");
The SetReserve()
member function is provided to set the capacity
of a CDataFile
.
void SetReserve(const int& nReserve);
Parameters
Example
df.SetReserve(1000000);
The WriteFile()
member function is provided to simplify writing
a CDataFile
to disk.
bool WriteFile(const char* szFilename, const char* szDelim = ",");
Parameters
Return Value
Returns true
if the function was successful,
false
if an error was encountered.
Remarks
Use CDataFile::GetLastError()
to retrieve any error
information.
Example
if(!df.WriteFile("C:\\test.csv"))
cout << df.GetLastError();
The ()
operator is provided to easily extract data from a
CDataFile
.
double operator()(const int& iVariable, const int& iSample);
int operator()(const int& iVariable, const int& iSample, char* lpStr);
DF::DF_SELECTION operator()(const int& left, const int& top,
const int& right, const int& bottom);
Parameters
iVariable
The variable for which to obtain the data.
iSample
The sample for which to obtain the data.
lpStr
A string buffer to receive the data.
left
, top
, right
, bottom
The coordinates from which to obtain the resulting selection. Think of
it as highlighting a range of cells in Excel.
Return Value
double operator()(const int& iVariable, const int&
iSample);
Returns the value at the specified location
successful, DF::ERRORVALUE
if an error was encountered.
int operator()(const int& iVariable, const int& iSample, char*
lpStr);
Returns the new length of the lpStr
if successful, -1 if an
error is encountered.
DF::DF_SELECTION operator()(...);
Returns a selection (DF_SELECTION
) containing the values in the
specified range.
Remarks
Use CDataFile::GetLastError()
to retrieve any error
information.
Example
d = df(2,4);
CString szData = "";
int iLength = df(1,7,szData.GetBuffer(0));
szData.ReleaseBuffer();
dfNew(df(0,0,9,120));
The [][]
operator is provided to easily extract data of type
double
from a CDataFile
.
Return Value
Returns a reference to the data at the specified
location.
Remarks
This operator is only valid for DF::RF_READ_AS_DOUBLE
.
Since an actual reference is returned, you are able to assign values as well as
read them.
Example
double d = df[2][4];
df[9][0] = d;
The +
operator is provided to combine the contents of multiple
CDataFile
objects.
CDataFile operator+ (const CDataFile&) const;
Example
df = df2 + df3 + df4;
The +=
operator is provided to append a CDataFile
to another CDataFile
.
CDataFile& operator+=(const CDataFile&);
Example
df += df2;
The =
operator sets the internals of one CDataFile
equal to another CDataFile
.
CDataFile& operator =(const CDataFile&);
CDataFile& operator =(const DF::DF_SELECTION&);
Example
df2 = df1;
df3 = df(0,0,4,120);
The <<
operator puts the contents of a
CDataFile
to a stream.
std::ostream& operator << (std::ostream&, const CDataFile&);
std::ostream& operator << (std::ostream&, const DF::DF_SELECTION&);
Example
outStream << df;
outStream << df(0,0,5,15);
The >>
operator gets the contents of a
CDataFile
from a stream.
std::istream& operator >> (std::istream&, CDataFile&);
Example
inStream >> df;