Introduction
This article is about ANSI to Unicode and Unicode to ANSI document conversion. With the presented code, you will be able to simply load or save a text document from your project in either ANSI or Unicode format.
Background
You can read the article from Chris Maunder on enabling Unicode source compiling for a project. Also, David Pritchard has an interesting article on extending CStdioFile
class to enable Unicode support when reading and writing to a file.
Using the code
You can use code fragments from this article and adjust them to your needs, or you can download a sample project and see how it deals with Unicode and ANSI files. The only thing that is important to know is that your project must define _UNICODE
flag as preprocessor directive to enable Unicode source compiling. See above articles for explanation.
Loading Unicode or ANSI text document
Loading the most important thing (byte-order mask) of the Unicode text file looks like this:
_TCHAR buffer[1024];
_TCHAR bom;
CFile* pFile = new CFile();
pFile->Open( strFile, CFile::modeRead );
pFile->Read( &bom, sizeof(_TCHAR) );
pFile->Close();
If there is a byte-order mask at the beginning of the text file and its value is 0xFEFF, you certainly have a Unicode text document to worry about. So, the question is how to read it to a simple CString
object?
Follow next:
if ( bom == _TCHAR(0xFEFF ) )
{
CFile* pFile = new CFile();
pFile->Open( strFile, CFile::modeRead );
pFile->Read( &bom, sizeof(_TCHAR) );
UINT ret = pFile->Read( buffer,
_tcslen(buffer)*sizeof(_TCHAR) );
buffer[ret] = _T('\0');
pFile->Close();
strText = buffer;
int nLen = strText.GetLength();
strText = strText.Left( nLen/2 );
}
Now, you have your file in CString
object. If you are wondering what the last two lines of code do, then do know that this is the simple way to cut extra characters which appear due to double-byte encoding of Unicode text in the file stream.
But, what if your file isn't a Unicode file, that is, if the byte-order mask is not equal to 0xFEFF? Then, it is possible that you have to deal with ANSI file. I say it is possible because it doesn't mean that the file is ANSI, it may be encoded in some other way (to UTF-8 or to Unicode BIG ENDIAN or to something else).
But if the text file is ANSI encoded, then you should do the following:
{
CStdioFile* pStdioFile = new CStdioFile();
pStdioFile->Open( strFile, CFile::modeRead );
pStdioFile->ReadString( strText );
pStdioFile->Close();
}
As a result, an ANSI text file is loaded to a CString
object.
Saving Unicode or ANSI text document
Saving a Unicode text file goes like this:
_TCHAR bom = (_TCHAR)0xFEFF;
CFile* pFile = new CFile();
pFile->Open( strFile, CFile::modeCreate | CFile::modeWrite );
pFile->Write( &bom, sizeof(_TCHAR) );
pFile->Write( LPCTSTR(strText), strText.GetLength()*sizeof(_TCHAR) );
pFile->Close();
If you would like to save the file as ANSI, do the following:
CStdioFile* pStdioFile = new CStdioFile();
pStdioFile->Open( strFile, CFile::modeCreate | CFile::modeWrite );
pStdioFile->WriteString( strText );
pStdioFile->Close();
What to do with loaded text?
You can use this CString
object further in your source, like: display it on the screen (you will see the exact characters you typed, like in MSWord application). To do this, use simple TextOut
method of CDC
class to pass CString
object and also the number of characters (that is the length of the string). But, do know that you won't see correct result on the screen if you use just any type of the font you have on your system. Used font must have table mappings for the selected Unicode character set.
This is how would I do it in OnDraw
method:
CFont font;
font.CreateFont( 15, 8, 0, 0, FW_BOLD, FALSE, FALSE, FALSE, DEFAULT_CHARSET,
OUT_DEFAULT_PRECIS, CLIP_DEFAULT_PRECIS, DEFAULT_QUALITY,
DEFAULT_PITCH | FF_DONTCARE, _T("Times New Roman") );
CFont* pOldFont = pDC->SelectObject( &font );
pDC->TextOut( 100, 100, strText, strText.GetLength() );
pDC->SelectObject( pOldFont );
font.DeleteObject();
Points of Interest
While I was analyzing bytes from text documents written in Notepad, I found out that there is difference between Unicode, Unicode BIG ENDIAN, and UTF-8 encoding, but solution for simple and universal text document reader/writer might be close from this point.