Introduction
As Unicode becomes more popular, programmers will find themselves performing more file based operations using Unicode. Currently, familiar MFC classes such as CFile
and CStdioFile
do not properly handle reading and writing of a Unicode file. The class file presented addresses the need to read and write files as UTF-16 Unicode files.
Using the Code
During construction or with the use of the Open()
member function, the class will examine the first two bytes of the file after appropriate size checking. The two byte sequence (BOM) 0xFE
, 0xFF
indicates the file is UTF-16 encoded. If this is the case, m_bIsUnicode
is set to TRUE
. If the bytes are not present, the class performs a CStdioFile::Seek(0, CFile::begin )
to return the consumed bytes.
CStdioFile::Read( &wcBOM, sizeof( WCHAR ) );
if( wcBOM == UNICODE_BOM ) {
m_bIsUnicode = TRUE;
m_bByteSwapped = FALSE;
}
if( wcBOM == UNICODE_RBOM ) {
m_bIsUnicode = TRUE;
m_bByteSwapped = TRUE;
}
if( FALSE == m_bIsUnicode ) {
CStdioFile::Seek( 0, CFile::begin );
}
ReadString(...)
occurs as follows: if m_bIsUnicode
is FALSE
, the class returns the appropriate CStdioFile::ReadString(...)
operation. If the file is UTF-16 encoded, the class will draw from an internal accumulator until a "\r"
or "\n"
is encountered when using CUTF16File::ReadString(CString& rString )
. If using the CUTF16File::ReadString( LPWSTR lpsz, UINT nMax )
overload, CStdioFile::ReadString()
behavior is duplicated. See the underlying comment from fgets()
.
The above read is accomplished through an accumulator. The accumulator is a STL list of WCHAR
s. When filling the accumulator, byte swapping occurs if a Big Endian stream (0xFF
, 0xFE
) is encountered.
Writing to a file is accomplished by extending the normal function with WriteString(LPCTSTR lpsz, BOOL bAsUnicode )
. CStdioFile
will handle the ANSI conversion internally, so CUTF16File
simply yields to CStdioFile
. If bAsUnicode
is TRUE
, the program will write the BOM (if file position is 0), and then call CFile::Write(...)
.
The program will open two files on the hard drive, write out both Unicode and ANSI text files, then read the files back in. The driver program then uses OutputDebugString(...)
to write messages to the debugger's output window.
CUTF16File output1( L"unicode_write.txt", CFile::modeWrite |
CFile::modeCreate );
output1.WriteString( L"Hello World from Unicode land!", TRUE );
output1.Close();
...
CString szInput;
CUTF16File input1( L"unicode_write.txt", CFile::modeRead );
input1.ReadString( szInput );
Figure 1 is the result of writing a test file with the provided driver program. Notice that the BOM bytes are swapped on the disk.
Figure 1: Result of test program.
Figure 2 examines a similar file created with Notepad on Windows 2000 while saving the file as Unicode.
Figure 2: A Unicode sample created in Notepad.
Additional Reading
- http://www.unicode.org/
- International Programming for Microsoft Windows by D. Schmitt, ISBN 1-57231-956-9
- Programming Windows with MFC by J. Prosise, ISBN 1-57231-695-0
- Programming Server-Side Applications for Microsoft Windows 2000 by J. Richter and J. Clark, ISBN 0-73560-753-2
Revisions
- 10 Feb 2005 Original release
- 23 Dec 2006 Added Jordan Walters' improvements and bug fixes
- 23 Dec 2006 Added Jordan Walters as an author
- 17 Sep 2008 Fixed long-standing bug in 2nd constructor
- 13 Jul 2009 Correct handling of Unicode characters. If UNICODE/_UNICODE project settings specified, writing ANSI still produces a Unicode output file.