Table of Contents
For C/C++ programmers, the standard way to write and read files are through the C file API and C++ file streams. Because of its cryptic and unintuitive class and function names, I have always found C++ file streams 'too difficult' to use whereas C file API are not type-safe. In this article, I am going to introduce my own type-safe file library, which is based on C file API, unifies the text file and binary file APIs in an almost seamless way. There are still some differences between the text and binary APIs where it makes absolutely no sense to make them similar. The library is meant to write and read structured data, meaning to say write and read integers, boolean, floats, strings and so on. The library can be used to write and read unstructured data (for example, C++ source files) with its lower level classes. However, that is not the focus of the library and the article. This article is meant to teach the readers how to easily access structured data in files.
For .NET people who happened to chance upon this article, you can stop reading this article now. This article is about native C++, not .NET, though I tried to write a C# version of my file library but I have failed because C# does not allow developers to keep a copy of the passed-by-reference POD argument after the method returns.
In this section, we are going to look at how to write and read text files. Let us begin at learning how to write integer and double to a text file.
using namespace Elmax;
xTextWriter writer;
std::wstring file = L"Unicode.txt";
if(writer.Open(file, FT_UNICODE, NEW))
{
int i = 25698;
double d = 1254.69;
writer.Write(L"{0},{1}", i, d);
writer.Close();
}
The code above tries open a new Unicode file and upon success, writes a integer and double value and closes the file. Other text file types supported are ASCII, Big Endian Unicode and UTF-8. Though not shown in the code, user should check the boolean return value of write. xTextWriter
delegates its file work to AsciiWriter
, UnicodeWriter
, BEUnicodeWriter
and UTF8Writer
. Likewise, xTextReader
delegates its file work to AsciiReader
, UnicodeReader
, BEUnicodeReader
and UTF8Reader
. These file writers write the BOM on their first write while the readers read the BOM automatically if it is present. For those readers who are not familiar what is BOM, BOM is an acronym for byte order mark. BOM is a Unicode character used to signal the endianness (byte order) of a text file or stream. BOM is optional but it is generally accepted as good practice to write BOM. The reader might ask for the reason to write a Unicode file library and why not pick up one from CodeProject. I have decided to write my own Unicode file classes because most of those featured on CodeProject make use of MFC CStdioFile
class which does not work on other platforms. Let us now look at how to read the same data we have just written.
using namespace Elmax;
xTextReader reader;
std::wstring file = L"Unicode.txt";
if(reader.Open(file))
{
if(reader.IsEOF()==false)
{
int i2 = 0;
double d2 = 0.0;
StrtokStrategy strat(L",");
reader.SetSplitStrategy(&strat);
size_t totalRead = reader.ReadLine(i2, d2); }
reader.Close();
The reader opens the same file and set its text split strategy. In this case, it is set to use strtok
and its delimiter is set to comma. Other split strategies includes Boost and Regex but it is highly recommended for user to choose strtok
because it is fast. We have seen how to write and read an integer and double. Writing and reading string
s are no difference, but special care must be taken for delimiter which may appears inside the string
. That means we must escape the string
when writing and unescape the string
when reading. There is a function, ReplaceAll
in StrUtil
class which users can use to escape and unescape their strings. Note: This is no longer true for version 2.0.2 which use streams internally: you need not set splitter strategy but you must call SetDelimiter
instead.
There is an overloaded Open
function which takes in the additional Unicode file type as parameter. But foremost, it will always respect the BOM if it detects its presence. Only in the absence of BOM that the xTextReader
will open the file according to the Unicode file type which the user specified.
Writing binary file is similar to writing text file, except the user does not have to write the delimiters in between the data.
using namespace Elmax;
xBinaryWriter writer;
std::wstring file = L"Binary.bin";
if(writer.Open(file))
{
int i = 25698;
double d = 1254.69;
writer.Write(i, d);
writer.Close();
}
Write
returns number of the values successfully written. As shown below, reading is almost similar to writing.
using namespace Elmax;
xBinaryReader reader;
std::wstring file = L"Binary.bin";
if(reader.Open(file))
{
if(reader.IsEOF())
{
int i2 = 0;
double d2 = 0.0;
size_t totalRead = reader.Read(i2, d2); }
reader.Close();
}
Writing strings in binary, most of the time, involves in writing the string length beforehand and before reading the string, we need to read the length and allocate the array first.
using namespace Elmax;
xBinaryWriter writer;
std::wstring file = GetTempPath(L"Binary.bin");
if(writer.Open(file))
{
std::string str = "Coding Monkey";
double d = 1254.69;
writer.Write(str.size(), str, d);
writer.Close();
}
xBinaryReader reader;
if(reader.Open(file))
{
if(reader.IsEOF()==false)
{
size_t len = 0;
double d2 = 0.0;
StrArray arr;
size_t totalRead = reader.Read(len);
totalRead = reader.Read(arr.MakeArray(len), d2);
std::string str2 = arr.GetPtr(); }
reader.Close();
}
We use StrArray
to read a char
array. We read its length first and use the length to allocate the array through MakeArray
method. It is possible to read the length and make the array at the same time, using DeferredMake
. Unlike MakeArray
, DeferredMake
does not allocate the array: the allocation is delayed until when it comes to its turn to read the file. DeferredMake
captures the address of the len
, so when the len
gets updated with the length, it also gets the length. See below.
xBinaryReader reader;
if(reader.Open(file))
{
if(reader.IsEOF()==false)
{
size_t len = 0;
double d2 = 0.0;
StrArray arr;
size_t totalRead = reader.Read(len, arr.DeferredMake(len), d2);
std::string str2 = arr.GetPtr(); }
reader.Close();
}
It is possible to write a structure as an array. This is however not advisable as different platforms may pad unknown number of bytes between structure members for performance reasons. For portability, it is recommended to write out every structure member, than writing structure as a flat array. If you still want to do it, then specify no padding for your structure (below).
#pragma pack(push, 1) // exact fit - no padding
struct MyStruct
{
char b;
int a;
};
#pragma pack(pop)
WStrArray
is available to read wchar_t
array. However, it is not recommended to write std::wstring
and use WStrArray
to read it if you want to keep your file format portable across different OSes. The reason is due to wchar_t
size is different on Windows, Linux and Mac OSX. We will explore this issue on the later section. Note: Text file API do not have this problem as conversion are in place to keep it automatic. The workaround if the user needs to write Unicode strings is write UTF-8 string. Another option is to use BaseArray
class to write 16 bit string. There are 2 types of 16 bit encoding for Unicode, namely UCS-2 and UTF-16. UCS-2 unit is always 16 bits and can only represent 97% of the Unicode. UTF-16 can encode all Unicode code points but its unit could consist of a single or two 16 bit words. For some use cases, UCS-2 is sufficient to store the text of the choice language. UTF-16 is able to store everything that is Unicode but the tradeoff is the conversion time and the need to take note of the potential difference in text length before and after conversion.
xBinaryWriter
and xBinaryReader
also provides Seek
and GetCurrPos
to do file seeking (a common operation in binary file parsing).
xTextWriter
and xTextReader
makes use of DataType
and DataTypeRef
respectively to do the conversion between data types and string. Basically, this library depends on implicit conversion of Plain Old Data (POD) to DataType
object to work. xTextWriter
has many overloaded Write
and WriteLine
which differs by the number of DataType
parameters. WriteLine
basically just add the linefeed (LF) after writing the string
. The Write
below, has five DataType
parameters.
bool xTextWriter::Write( const wchar_t* fmt, DataType D1, DataType D2,
DataType D3, DataType D4, DataType D5 )
{
if(pWriter!=NULL)
{
std::wstring str = StrUtilRef::Format(fmt, D1, D2, D3, D4, D5);
return pWriter->Write(str);
}
return false;
}
DataType
consists many overloaded constructors which convert the Plain Old Data (POD) to string
and store it in string
member (m_str
).
namespace Elmax
{
class DataType
{
public:
~DataType(void);
DataType( int i );
DataType( unsigned int ui );
DataType( const ELMAX_INT64& i64 );
DataType( const unsigned ELMAX_INT64& ui64 );
DataType( float f );
DataType( const double& d );
DataType( const std::string& s );
DataType( const std::wstring& ws );
DataType( const char* pc );
DataType( const wchar_t* pwc );
DataType( char c );
DataType( unsigned char c );
DataType( wchar_t wc );
std::wstring& ToString() { return m_str; }
protected:
std::wstring m_str;
};
Here is the C++11 variadic template Write
version which supports any arbitrary number of arguments. But you need to download and install the Visual C++ Compiler November 2012 CTP to compile the code. Note: The code is much lesser without having to write all those overloaded functions previously.
bool Write( const wchar_t* str )
{
if(pWriter!=nullptr)
{
return pWriter->Write(std::wstring(str));
}
return false;
}
template<typename... Args>
bool Write( const wchar_t* fmt, Args&... args )
{
std::wstring str = StrUtilRef::Format(std::wstring(fmt), 0, args...);
if(pWriter!=nullptr)
{
return pWriter->Write(str);
}
return false;
}
As mentioned earlier, xTextReader
makes use of DataTypeRef
to do the conversion from string to Plain Old Data (POD). xTextReader
has 10 overloaded Read
and ReadLine
which differs only by the number of DataTypeRef
parameters. The ReadLine
shown below, has 5 DataTypeRef
parameters.
size_t xTextReader::ReadLine( DataTypeRef D1, DataTypeRef D2, DataTypeRef D3, DataTypeRef D4,
DataTypeRef D5 )
{
if(pReader!=NULL)
{
std::wstring text;
bool b = pReader->ReadLine(text);
if(b)
{
StrUtilRef strUtil;
strUtil.SetSplitStrategy(m_pSplitStrategy);
return strUtil.Split(text.c_str(), D1, D2, D3, D4, D5);
}
}
return 0;
}
size_t StrUtilRef::Split( const std::wstring& StrToExtract,
DataTypeRef& D1, DataTypeRef& D2, DataTypeRef& D3,
DataTypeRef& D4, DataTypeRef& D5 )
{
std::vector<DataTypeRef*> vecDTR;
vecDTR.push_back(&D1);
vecDTR.push_back(&D2);
vecDTR.push_back(&D3);
vecDTR.push_back(&D4);
vecDTR.push_back(&D5);
assert( m_pSplitStrategy );
return m_pSplitStrategy->Extract( StrToExtract, vecDTR );
}
size_t StrtokStrategy::Extract(
const std::wstring& StrToExtract,
std::vector<Elmax::DataTypeRef*> vecDTR )
{
std::vector<std::wstring> vecSplit;
const size_t size = StrToExtract.size()+1;
wchar_t* pszToExtract = new wchar_t[size];
wmemset( pszToExtract, 0, size );
Wcscpy( pszToExtract, StrToExtract.c_str(), size );
wchar_t *pszContext = 0;
wchar_t *pszSplit = 0;
pszSplit = wcstok( pszToExtract, m_sDelimit.c_str() );
while( NULL != pszSplit )
{
size_t len = wcslen(pszSplit);
if(pszSplit[len-1]==65535&&vecSplit.size()==vecDTR.size()-1) pszSplit[len-1] = L'\0';
vecSplit.push_back(std::wstring( pszSplit ) );
pszSplit = wcstok( NULL, m_sDelimit.c_str() );
}
delete [] pszToExtract;
size_t fail = 0;
for( size_t i=0; i<vecDTR.size(); ++i )
{
if( i < vecSplit.size() )
{
if( false == vecDTR[i]->ConvStrToType( vecSplit[i] ) )
++fail;
}
else
break;
}
return vecSplit.size()-fail;
}
DataTypeRef
keeps a big union to store the address of each POD parameter as a destination for result.
namespace Elmax
{
class DataTypeRef
{
public:
~DataTypeRef(void);
union UNIONPTR
{
int* pi;
unsigned int* pui;
short* psi;
unsigned short* pusi;
ELMAX_INT64* pi64;
unsigned ELMAX_INT64* pui64;
float* pf;
double* pd;
std::string* ps;
std::wstring* pws;
char* pc;
unsigned char* puc;
wchar_t* pwc;
};
enum DTR_TYPE
{
DTR_INT,
DTR_UINT,
DTR_SHORT,
DTR_USHORT,
DTR_INT64,
DTR_UINT64,
DTR_FLOAT,
DTR_DOUBLE,
DTR_STR,
DTR_WSTR,
DTR_CHAR,
DTR_UCHAR,
DTR_WCHAR
};
DataTypeRef( int& i ) { m_ptr.pi = &i; m_type = DTR_INT; }
DataTypeRef( unsigned int& ui ) { m_ptr.pui = &ui; m_type = DTR_UINT; }
DataTypeRef( short& si ) { m_ptr.psi = &si; m_type = DTR_SHORT; }
DataTypeRef( unsigned short& usi ) { m_ptr.pusi = &usi; m_type = DTR_USHORT;}
DataTypeRef( ELMAX_INT64& i64 ) { m_ptr.pi64 = &i64; m_type = DTR_INT64; }
DataTypeRef( unsigned ELMAX_INT64& ui64 ){ m_ptr.pui64 = &ui64; m_type = DTR_UINT64;}
DataTypeRef( float& f ) { m_ptr.pf = &f; m_type = DTR_FLOAT; }
DataTypeRef( double& d ) { m_ptr.pd = &d; m_type = DTR_DOUBLE;}
DataTypeRef( std::string& s ) { m_ptr.ps = &s; m_type = DTR_STR; }
DataTypeRef( std::wstring& ws ) { m_ptr.pws = &ws; m_type = DTR_WSTR; }
DataTypeRef( char& c ) { m_ptr.pc = &c; m_type = DTR_CHAR; }
DataTypeRef( unsigned char& uc ) { m_ptr.puc = &uc; m_type = DTR_UCHAR; }
DataTypeRef( wchar_t& wc ) { m_ptr.pwc = &wc; m_type = DTR_WCHAR; }
bool ConvStrToType( const std::string& Str );
bool ConvStrToType( const std::wstring& Str );
DTR_TYPE m_type;
UNIONPTR m_ptr;
};
The C++11 variadic template version below which calls ReadArg
. The first ReadArg
is the base function which will terminate the recursion of the variadic sibling. Please note that this is not true recursion as in the traditional sense because the function is actually not calling itself: it is calling a different function with the same name but has different number of arguments.
void ReadArg(std::vector<DataTypeRef*>& vec)
{
}
template<typename T, typename... Args>
void ReadArg(std::vector<DataTypeRef*>& vec, T& t, Args&... args)
{
vec.push_back(new DataTypeRef(t));
ReadArg(vec, args...);
}
template<typename... Args>
size_t Read( size_t len, Args&... args )
{
if(pReader!=nullptr)
{
std::wstring text;
bool b = pReader->Read(text, len);
if(b)
{
std::vector<DataTypeRef*> vec;
ReadArg(vec, args...);
size_t ret = m_pSplitStrategy->Extract(text, vec);
for(size_t i=0; i<vec.size(); ++i)
{
delete vec[i];
}
vec.clear();
return ret;
}
}
return 0;
}
xBinaryWriter
makes use of BinaryTypeRef
. The overloaded Write
is different by the number of parameters. xBinaryWriter
has no WriteLine
function. The Write
function shown below, has two BinaryTypeRef
parameters.
size_t xBinaryWriter::Write( BinaryTypeRef D1, BinaryTypeRef D2 )
{
size_t totalWritten = 0;
if(fp!=NULL)
{
if(D1.m_type != BinaryTypeRef::DTR_STR &&
D1.m_type != BinaryTypeRef::DTR_WSTR && D1.m_type != BinaryTypeRef::DTR_BASEARRAY)
{
size_t len = fwrite(D1.GetAddress(), D1.size, 1, fp);
if(len==1)
++totalWritten;
}
else
{
size_t len = fwrite(D1.GetAddress(), D1.elementSize, D1.arraySize, fp);
if(len==D1.arraySize)
++totalWritten;
}
if(D2.m_type != BinaryTypeRef::DTR_STR && D2.m_type
!= BinaryTypeRef::DTR_WSTR && D2.m_type != BinaryTypeRef::DTR_BASEARRAY)
{
size_t len = fwrite(D2.GetAddress(), D2.size, 1, fp);
if(len==1)
++totalWritten;
}
else
{
size_t len = fwrite(D2.GetAddress(), D2.elementSize, D2.arraySize, fp);
if(len==D2.arraySize)
++totalWritten;
}
}
if(totalWritten != 2)
{
errNum = ELMAX_WRITE_ERROR;
err = StrUtil::Format(L"{0}: Less than 2 elements are written!
({1} elements written)", GetErrorMsg(errNum), totalWritten);
if(enableException)
throw new std::runtime_error(StrUtil::ConvToString(err));
}
return totalWritten;
}
BinaryTypeRef
keeps a union to store the address of the POD. No textual to string conversion is necessary: POD is written as it is into the binary file.
namespace Elmax
{
class BinaryTypeRef
{
public:
~BinaryTypeRef(void);
union UNIONPTR
{
const int* pi;
const unsigned int* pui;
const short* psi;
const unsigned short* pusi;
const ELMAX_INT64* pi64;
const unsigned ELMAX_INT64* pui64;
const float* pf;
const double* pd;
std::string* ps;
const std::wstring* pws;
const char* pc;
const unsigned char* puc;
const wchar_t* pwc;
const char* arr;
};
enum DTR_TYPE
{
DTR_INT,
DTR_UINT,
DTR_SHORT,
DTR_USHORT,
DTR_INT64,
DTR_UINT64,
DTR_FLOAT,
DTR_DOUBLE,
DTR_STR,
DTR_WSTR,
DTR_CHAR,
DTR_UCHAR,
DTR_WCHAR,
DTR_BASEARRAY
};
BinaryTypeRef( const int& i )
{ m_ptr.pi = &i; m_type = DTR_INT; size=sizeof(i); }
BinaryTypeRef( const unsigned int& ui )
{ m_ptr.pui = &ui; m_type = DTR_UINT; size=sizeof(ui); }
BinaryTypeRef( const short& si )
{ m_ptr.psi = &si; m_type = DTR_SHORT; size=sizeof(si); }
BinaryTypeRef( const unsigned short& usi )
{ m_ptr.pusi = &usi; m_type = DTR_USHORT; size=sizeof(usi); }
BinaryTypeRef( const ELMAX_INT64& i64 )
{ m_ptr.pi64 = &i64; m_type = DTR_INT64; size=sizeof(i64); }
BinaryTypeRef( const unsigned ELMAX_INT64& ui64 )
{ m_ptr.pui64 = &ui64; m_type = DTR_UINT64; size=sizeof(ui64); }
BinaryTypeRef( const float& f )
{ m_ptr.pf = &f; m_type = DTR_FLOAT; size=sizeof(f); }
BinaryTypeRef( const double& d )
{ m_ptr.pd = &d; m_type = DTR_DOUBLE; size=sizeof(d); }
BinaryTypeRef( std::string& s )
{ m_ptr.ps = &s; m_type = DTR_STR; elementSize=sizeof(char);size=s.length();
arraySize=s.length();}
BinaryTypeRef( const std::wstring& ws )
{ m_ptr.pws = &ws; m_type = DTR_WSTR; elementSize=sizeof(wchar_t);
size=ws.length()*sizeof(wchar_t); arraySize=ws.length();}
BinaryTypeRef( const char& c )
{ m_ptr.pc = &c; m_type = DTR_CHAR; size=sizeof(c); }
BinaryTypeRef( const unsigned char& uc )
{ m_ptr.puc = &uc; m_type = DTR_UCHAR; size=sizeof(uc); }
BinaryTypeRef( const wchar_t& wc )
{ m_ptr.pwc = &wc; m_type = DTR_WCHAR; size=sizeof(wc); }
BinaryTypeRef( const BaseArray& arr )
{ m_ptr.arr = arr.GetPtr(); m_type = DTR_BASEARRAY;
size=arr.GetTotalSize(); elementSize=arr.GetElementSize();
arraySize=arr.GetArraySize(); }
char* GetAddress();
DTR_TYPE m_type;
UNIONPTR m_ptr;
size_t size;
size_t elementSize;
size_t arraySize;
};
This is the C++11 variadic template binary Write
version. The first Write
is the base function which stops the recursive calls. It also makes use of the BinaryTypeRef
class.
size_t Write()
{
return 0;
}
template<typename T, typename... Args>
size_t Write( T t, Args... args )
{
BinaryTypeRef dt(t);
size_t totalWritten = 0;
if(fp!=nullptr)
{
if(dt.m_type != BinaryTypeRef::DTR_STR &&
dt.m_type != BinaryTypeRef::DTR_WSTR && dt.m_type != BinaryTypeRef::DTR_BASEARRAY)
{
size_t len = fwrite(dt.GetAddress(), dt.size, 1, fp);
if(len==1)
++totalWritten;
}
else
{
size_t len = fwrite(dt.GetAddress(), dt.elementSize, dt.arraySize, fp);
if(len==dt.arraySize)
++totalWritten;
}
}
return totalWritten + Write(args...);
}
Lastly, we have come to xBinaryReader
. xBinaryReader
makes use of BinaryTypeReadRef
to do data conversion. Like xTextReader
, xBinaryReader
has overloaded Read
to do its work, but it has no ReadLine
.
size_t xBinaryReader::Read( BinaryTypeReadRef D1, BinaryTypeReadRef D2 )
{
size_t totalRead = 0;
if(fp!=NULL)
{
if(D1.m_type != BinaryTypeReadRef::DTR_STRARRAY &&
D1.m_type != BinaryTypeReadRef::DTR_WSTRARRAY &&
D1.m_type != BinaryTypeReadRef::DTR_BASEARRAY)
{
size_t cnt = fread(D1.GetAddress(), D1.size, 1, fp);
if(cnt==1)
++totalRead;
}
else
{
D1.DeferredMake();
size_t cnt = fread(D1.GetAddress(), D1.elementSize, D1.arraySize, fp);
if(cnt == D1.arraySize)
++totalRead;
}
if(D2.m_type != BinaryTypeReadRef::DTR_STRARRAY &&
D2.m_type != BinaryTypeReadRef::DTR_WSTRARRAY &&
D2.m_type != BinaryTypeReadRef::DTR_BASEARRAY)
{
size_t cnt = fread(D2.GetAddress(), D2.size, 1, fp);
if(cnt==1)
++totalRead;
}
else
{
D2.DeferredMake();
size_t cnt = fread(D2.GetAddress(), D2.elementSize, D2.arraySize, fp);
if(cnt==D2.arraySize)
++totalRead;
}
}
if(totalRead != 2)
{
errNum = ELMAX_READ_ERROR;
err = StrUtil::Format(L"{0}: Less than 2 elements are read!
({1} elements read)", GetErrorMsg(errNum), totalRead);
if(enableException)
throw new std::runtime_error(StrUtil::ConvToString(err));
}
return totalRead;
}
For simplicity, I do not show the BinaryTypeReadRef
class here because the code is quite complicated as it supports DeferredMake
of the array class.
This is the C++11 variadic template binary Read
version. Same as the binary Write
before, the first function is the base function which ends the recursive calls. Like previous Read
, it, too, make use of BinaryTypeReadRef
.
size_t Read()
{
return 0;
}
template<typename T, typename... Args>
size_t Read( T& t, Args&... args )
{
BinaryTypeReadRef dt(t);
size_t totalRead = 0;
if(fp!=nullptr)
{
if(dt.m_type != BinaryTypeReadRef::DTR_STRARRAY &&
dt.m_type != BinaryTypeReadRef::DTR_WSTRARRAY &&
dt.m_type != BinaryTypeReadRef::DTR_BASEARRAY)
{
size_t cnt = fread(dt.GetAddress(), dt.size, 1, fp);
if(cnt==1)
++totalRead;
}
else
{
dt.DeferredMake();
size_t cnt = fread(dt.GetAddress(), dt.elementSize, dt.arraySize, fp);
if(cnt == dt.arraySize)
++totalRead;
}
}
return totalRead + Read(args...);
}
When I was writing the Windows code, I took special care to separate the Windows and Non-Windows code with a _MICROSOFT
macro. _WIN32
macro is not used instead because the Mingw defines it as well. The main difference between Windows and Non-Windows code at that point, is on Windows, linefeed ("\n"
) is converted to a combination of carriage return and line feed ("\r\n"
) during file writing and the reverse process is applied during file reading; On Non-Windows platform, linefeed ("\n"
) remains as linefeed: no conversion is done.
I downloaded and installed Orwell Dev-C++ to test my code on Mingw and GCC on Windows. Orwell Dev-C++ is a continuation of the work of (currently non-active) the Bloodshed Dev-C++. Orwell Dev-C++ comes bundled with Mingw and fairly recent GCC 4.6.x. During compilation, Orwell Dev-C++ complains about the unavailable secure c function (typically name which ends with _s) such as _itow_s
. So I changed them to non-secure version for Non-Windows implementation while Windows implementation is still using the secure version. Dev-C++ also complained it could not find a std::exception
constructor which takes in a string. It turned out that std::exception
was meant to be derived from and not used directly. I changed the use of std::exception
to proper exception types, such as logic_error
, runtime_error
and so on. With these changes done, I assume most of my Linux work is done. I estimated, excluding the time to learn G++ and write makefile, that it would take me at most 1 hour to get the code working. That was when I found out I have grossly underestmated the time that would taken me to resolve the errors on Ubuntu Linux 12.04.
After converting the Orwell Dev-C++ makefile to work on Ubuntu Linux and GCC 4.6.3, the first error which G++ complained was it did not understand the included paths. So I changed the backslash to forward slash.
#include "..\\..\\Common\\Common.h"
The above path is changed to below:
#include "../../Common/Common.h"
This was an easy change, though I had to update most of the 66 source files. The next G++ complaint was it could not find the data conversion function (typically name which starts with underscore) such as _ultow
. It turned out that Microsoft standard conversion functions were not the standard after all. I have to use stringstream
to replace _ultow
and its cousins. All compilation errors are resolved at this point. And I ran the unit tests. It crashed at the first Unicode test! Upon some investigation, I discovered, to my dismay, the size of wchar_t
on Linux and Mac OSX is 4 bytes, instead of 2 bytes! That meant all the wchar_t
related functions did not work correctly on Linux and Mac OSX. It was clearly a showstopper! It took me three laborious days to implement UTF-16 conversions and handle all the instances where wchar_t
size was 4
; Unicode files are essentially UTF16 files. On Windows, UTF-16 is supported natively. On Ubuntu Linux, I have to convert the 4 bytes wchar_t
(UTF-32) to UTF-16 before writing to Unicode file. The reverse conversion applies during reading.
If you are interested to run the Linux tests, you can run the command line below to build the library (FileLib.a) and the test application (UnitTest.exe) and execute it:
cd FileLib
cd FileIO
make all
cd ..
cd PreVS2012UnitTest
make all
./UnitTest.exe
In total, there are 55 unit tests for Windows and 65 unit tests for Linux. Whenever I made a change or fix a bug for either OS, I ran the unit tests for both to make sure I have not broken anything on either side.
Clang 3.1 on Ubuntu 12.04 is able to compile the library using GCC 4.7 standard library. However, Clang compilation failed on Mac OSX 10.8 due to the failure to find an overloaded constructor with size_t
parameter. size_t
is synonymous with unsigned 32 bit integer on 32 bit platform and is 64 bit on 64 platform. Apparently, Clang sees size_t
as another type. The attempt to add that constructor failed on Microsoft compiler which complained of similar constructor already exists and the fix is to hide it under the __APPLE__
check.
#ifdef __APPLE__
DataType( size_t ui );
#endif
In order to compile under Clang successfully, remove Microsoft specific files, like stdafx.h, WinOperation.h/cpp and Boost files like BoostStrategy.h/cpp and RegExStrategy.h/cpp. For unit testing, LinuxUnitTest.cpp can be used.
Version 2.0.2 of text file library (variadic template version, not 1.0.x C++98 version) use custom streams internally, so the users can write non-intrusive insertion and extraction operations for arbitrary data types including enum
s. The istream
and ostream
class make use of Boost lexical_cast
to perform the data conversion so it should perform better than STL stringstream
. With istream
, there is no need to set the splitter strategy when reading but delimiter need to specified through SetLimiter
. Your overloaded <<
, >>
operators can use same or different delimiters. Let us first look at overloading using same delimiters as the rest of the file format.
This is the structure, MyStruct
.
struct MyStruct
{
int a;
int b;
};
These are the overloaded <<
, >>
operators placed in your source files.
Elmax::ostream operator <<(Elmax::ostream& os, const MyStruct& val)
{
os << val.a;
os << L",";
os << val.b;
os << L",";
return os;
}
Elmax::istream operator >>(Elmax::istream& is, MyStruct& val)
{
is >> val.a;
is >> val.b;
return is;
}
Now we can write and read MyStruct
objects as follows:
xTextWriter writer;
std::wstring file = L"...";
writer.Open(file, FT_UTF8, NEW);
writer.Close();
int i = 25698;
double d = 1254.5;
MyStruct my = { 22, 33 };
writer.Write(L"{0},{1},{2}", i, my, d);
xTextReader reader;
reader.Open(file);
int i2 = 0;
double d2 = 0.0;
MyStruct my2 = { 0, 0 };
reader.SetDelimiter(L",");
size_t totalRead = reader.ReadLine(i2, my2, d2);
The next example we are going to use pipe, "|"
for delimiter for our structure while the rest of document uses comma.
struct DiffDelimiterStruct
{
int a;
float b;
};
Elmax::ostream operator <<(Elmax::ostream& os, const DiffDelimiterStruct& val)
{
os << val.a;
os << L"|";
os << val.b;
os << L"|";
return os;
}
Elmax::istream operator >>(Elmax::istream& is, DiffDelimiterStruct& val)
{
std::wstring old_delimiter = is.set_delimiter(L"|");
is >> val.a;
is >> val.b;
is.set_delimiter(old_delimiter);
return is;
}
As shown below, writing and reading is the same as the previous example.
xTextWriter writer;
std::wstring file = L"...";
writer.Open(file, FT_UTF8, NEW);
int i = 25698;
double d = 1254.5;
DiffDelimiterStruct my = { 22, 33 };
writer.Write(L"{0},{1},{2}", i, my, d);
writer.Close();
xTextReader reader;
reader.Open(file);
int i2 = 0;
double d2 = 0.0;
DiffDelimiterStruct my2 = { 0, 0 };
reader.SetDelimiter(L",");
size_t totalRead = reader.ReadLine(i2, my2, d2);
This is a list of issues that the users need to be aware of when using this file library.
- Do not use size_t type for binary files:
size_t
is 32 bit unsigned integer on 32 bit platform and is 64 bit unsigned integer on 64 bit platform. The automatic promotion to 64 bit on 64 bit OS is sometimes desired but is wrong in file format. When a data is 32 bit in binary, we always want it to remain 32 bit in file to be consistent. - Non-Windows implementation use fopen: Windows provide a
_wfopen
function to open file with Unicode name. Unfortunately, Linux and GCC (or rather C Standard Library) does not have such function. C and C++ Standard does not make any notes on how to open Unicode named file. The workaround is, on other platforms, when your user is about to open a file with a name which consists of Unicode code point (> 255), the application should copy the file to another ASCII name and open that file instead. - Put the file code in try/catch: The exceptions that could be thrown by the library, are
logic_error
, runtime_error
, overflow_error
and underflow_error
. Exception are enabled by default. Although exceptions can be disabled through the EnableException
function, exceptions will still be thrown when there are data conversion errors. These errors are considered as serious errors, because the file could be corrupted, so silent failure is not acceptable. When exception is disabled, the user have to check the return value of each function call and call GetLastError
.
Up to until now, we discussed mainly about source code portability. Let us discuss some data portability issues. We did not encounter any data issue because platforms used are based on Intel x86. Other platforms may have a different endianness (Little Endian versus Big Endian); The file format should have a field to store the byte ordering, not unlike the TIFF image format and flip the bytes as needed during reading. Due to different alignment. it is best to write out individual struct
members, instead of writing struct
as a flat array. Do not use size_t
as its size is dependant on the processor width (32bits versus 64bits). Not all platforms use 2s-complement for negative numbers; 1-complement or sign-magnitude could be used; you may need to store that information as well. If you bet on -1
having all the ones(eg, 0xFFFF
), you are better off using ~0
for portability. Enum
s may have different values and data-size. You can assign numerical value and force the enum
to be certain size. However, it is recommended a switch-case
is used, instead of casting the enum
to integer; switch-case
works for enum
and C++11 enum class
.
enum MYCOLORS
{
RED = 0,
YELLOW = 1,
....
NO_USED = 0xFFFFFFFF }
For floating point portability, we should check for IEEE 754 compliance using numeric_limits<float>::is_iec559
. IEC 60559 is synonymous with IEEE 754 standard for floating point; IEC 60559 is also sometimes referred to as IEC 559.
The original source code uploaded in this article, was tested to have no memory leaks using Visual Studio 2010/2012 and Valgrind (Linux) for correct program operation. There could be leaks in the event of when exceptions thrown, deallocation is prevented from being called. Another problem was the exceptions were allocated on the heap and not freeded in the catch handler (an oversight). All these have been rectified to use Resource Acquisition Is Initialization (RAII) for all arrays to freed the memory and exception, if thrown, is now allocated on the stack.
There are plans to move the library to C++11 features like variadic templates, nullptr
and move semantics. It is much clearer to use standard integer types like uint32_t
as opposed to the unsigned int
. A preliminary C++11 version is already available for download. C++98 version will still be maintained on a different GIT branch.
The table below shows the number of lines of code (loc) for each of the class for C++98 and C++11. The percentage of reduced loc after applying C++11 variadic template is greater than 50%.
Class | C++98 loc | C++11 loc | Reduced by % |
xTextWriter | 437 | 186 | 57.4% |
xTextReader | 636 | 259 | 59.3% |
xBinaryWriter | 1067 | 180 | 83.1% |
xBinaryReader | 1123 | 181 | 83.9% |
The reader may have or may not have noticed the Elmax namespace used in the code snippets. As anyone would have guessed, the file library is for future cross-platform Elmax XML Library, but why include a binary file API as well? The reason is because there will be a version of Elmax which can save XML in binary form. Let us briefly recall the Elmax syntax to write a value to an XML element.
using namespace Elmax;
Element elem;
elem[L"Price"] = 30.65f;
elem[L"Qty"] = 1200;
As the reader can see from the above sample code, Elmax element is aware of the data type before it converts the data to textual form. By using the data type information, Elmax can build a metadata section about the XML. The metadata can be separated or embedded inside the Binary XML. If the XML contains mainly recurring elements, the metadata can be concise and small. However, if the XML file is consisted free form XML like SOAP XML, HTML or XAML, the metadata can be quite big with respect to the Binary XML. Binary file has the advantage of being fast because the data-type conversion from textual form is out of the picture.
I have modified an old OpenGL demo to read binary file to showcase the file library. Set the global variable, g_bLoadBinary
according to which file type you want the demo to load. Please note the OpenGL code is not cross-plaform and only runs on Windows. Previously, I have uploaded an OpenGL demo for another article. Since I have only access to NVidia graphics card, I was not aware that the code does not run correctly on Intel graphics chipset. This demo should not have the same problem. Please let me know if you have any problem running the OpenGL demo. The demo is written in OpenGL 2.0. A OpenGL 4.0 version is being developed for a future OpenGL article. Stay tuned if you are interested in OpenGL 4.0!
This is the wood clip model loaded. The model is modelled using very old Milkshape shareware.
This is the screenshot of the demo.
In this article, we have seen a new file API which makes writing and reading structured data intuitive and productive. By keeping both the text and binary API similar, the user can maintain both file formats with minimal efforts. The file library would be used for the new Elmax XML library to save to textual and binary XML files. The XML work is a ongoing effort. The estimated date of completion is unknown. The source code is currently hosted at Github.
- Microsoft Visual C++ 8.0, 9.0, 10.0 and 11.0
- MingW 4.7.x
- GCC 4.6 and 4.7 (Ubuntu 12.04)
- Clang 3.1 (Ubuntu and Mac OSX 10.8)
Elmax C++ File Library is available on NuGet Gallery for VS2010 and VS2012! Remember to update your Nuget to latest version 2.5 first.
- Write Portable Code by Brian Hook
- 28th June, 2022: Removed Boost
lexical_cast
. - 26th November, 2013: Added Streams section. The source code is updated to use Boost
lexical_cast
. - 31st October, 2013: Added Data Portability discussion. Important! Please read.
- 4th May, 2013: Changed the file open functions not to throw exception because file open failure is common error, not exceptional error. Added Nuget section.
- 2nd January, 2013: Added a table to show the lines of code reduced after changing to C++11 variadic template.
- 23rd December, 2012: Added C++11 variadic versions of the functions
- 14th December, 2012: Added Preventing Memory Leak section
- 12th December, 2012: Added Clang support
- 25th September, 2012: Initial release