(untagged)

Easy text document conversion - ANSI/Unicode and Unicode/ANSI

darkoman

0.00/5 (No votes)

31 Oct 2004

An article on direct ANSI to Unicode text document conversion from the source code.

Download demo project - 12.1 Kb

Sample Project

Introduction

This article is about ANSI to Unicode and Unicode to ANSI document conversion. With the presented code, you will be able to simply load or save a text document from your project in either ANSI or Unicode format.

Background

You can read the article from Chris Maunder on enabling Unicode source compiling for a project. Also, David Pritchard has an interesting article on extending CStdioFile class to enable Unicode support when reading and writing to a file.

Using the code

You can use code fragments from this article and adjust them to your needs, or you can download a sample project and see how it deals with Unicode and ANSI files. The only thing that is important to know is that your project must define _UNICODE flag as preprocessor directive to enable Unicode source compiling. See above articles for explanation.

Loading Unicode or ANSI text document

Loading the most important thing (byte-order mask) of the Unicode text file looks like this:

//   You will notice that strFile is a file name that you have to supply.


   // Reading buffer

   _TCHAR buffer[1024];

   // Byte-order mark goes at the begining of the UNICODE file

   _TCHAR bom;

   CFile* pFile = new CFile();
   pFile->Open( strFile, CFile::modeRead );
   pFile->Read( &bom, sizeof(_TCHAR) );
   pFile->Close();

If there is a byte-order mask at the beginning of the text file and its value is 0xFEFF, you certainly have a Unicode text document to worry about. So, the question is how to read it to a simple CString object?

Follow next:

//   As before, you have to supply the file name (strFile)

//   and also a CString object (strText) 

//   where you will save text from the file.


   // If we are reading UNICODE file

   if ( bom == _TCHAR(0xFEFF ) )
   {
      CFile* pFile = new CFile();
      pFile->Open( strFile, CFile::modeRead );
      pFile->Read( &bom, sizeof(_TCHAR) );
      UINT ret = pFile->Read( buffer, 
                              _tcslen(buffer)*sizeof(_TCHAR) );
      buffer[ret] = _T('\0');
      pFile->Close();

      strText = buffer;

      // Release extra characters

      int nLen = strText.GetLength();
      strText = strText.Left( nLen/2 );
   }

Now, you have your file in CString object. If you are wondering what the last two lines of code do, then do know that this is the simple way to cut extra characters which appear due to double-byte encoding of Unicode text in the file stream.

But, what if your file isn't a Unicode file, that is, if the byte-order mask is not equal to 0xFEFF? Then, it is possible that you have to deal with ANSI file. I say it is possible because it doesn't mean that the file is ANSI, it may be encoded in some other way (to UTF-8 or to Unicode BIG ENDIAN or to something else).

But if the text file is ANSI encoded, then you should do the following:

//   As before, you have to supply the file name (strFile)

//   and also a CString object (strText) 

//   where you will save text from the file.


   // If we are reading ANSI file

   {
      CStdioFile* pStdioFile = new CStdioFile();
      pStdioFile->Open( strFile, CFile::modeRead );
      pStdioFile->ReadString( strText );
      pStdioFile->Close();
   }

As a result, an ANSI text file is loaded to a CString object.

Saving Unicode or ANSI text document

Saving a Unicode text file goes like this:

//

//   As before, you have to supply the file name (strFile)

//   and also a CString object (strText) 

//   where you hold text to be saved in the file.


   // Byte-order mark goes at the begining of the UNICODE file

   _TCHAR bom = (_TCHAR)0xFEFF;

   CFile* pFile = new CFile();
   pFile->Open( strFile, CFile::modeCreate | CFile::modeWrite );
   pFile->Write( &bom, sizeof(_TCHAR) );
   pFile->Write( LPCTSTR(strText), strText.GetLength()*sizeof(_TCHAR) );
   pFile->Close();

If you would like to save the file as ANSI, do the following:

//

//   As before, you have to supply the file name (strFile)

//   and also a CString object (strText) 

//   where you hold text to be saved in the file.


   CStdioFile* pStdioFile = new CStdioFile();
   pStdioFile->Open( strFile, CFile::modeCreate | CFile::modeWrite );
   pStdioFile->WriteString( strText );
   pStdioFile->Close();

What to do with loaded text?

You can use this CString object further in your source, like: display it on the screen (you will see the exact characters you typed, like in MSWord application). To do this, use simple TextOut method of CDC class to pass CString object and also the number of characters (that is the length of the string). But, do know that you won't see correct result on the screen if you use just any type of the font you have on your system. Used font must have table mappings for the selected Unicode character set.

This is how would I do it in OnDraw method:

   CFont font;
   font.CreateFont( 15, 8, 0, 0, FW_BOLD, FALSE, FALSE, FALSE, DEFAULT_CHARSET,
                OUT_DEFAULT_PRECIS, CLIP_DEFAULT_PRECIS, DEFAULT_QUALITY,
                DEFAULT_PITCH | FF_DONTCARE, _T("Times New Roman") );
   CFont* pOldFont = pDC->SelectObject( &font );
   pDC->TextOut( 100, 100, strText, strText.GetLength() );
   pDC->SelectObject( pOldFont );
   font.DeleteObject();

Points of Interest

While I was analyzing bytes from text documents written in Notepad, I found out that there is difference between Unicode, Unicode BIG ENDIAN, and UTF-8 encoding, but solution for simple and universal text document reader/writer might be close from this point.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here