It’s seems to be a trivial and obvious task, but in fact it’s not and I nearly have slipped over it. I wanted to dynamically load some shader resources and my app crushed at runtime. I didn’t know why so performed debugging and got the enemy. The enemy is hidden in two functions: philosophy of Windows string
comparisons and MultiByteToWideChar function.
Code Snippet 1
Not working C++ code that throws an exception:
wchar_t* ConvertToWchar_t(const char * txt)
{
size_t argLength = strlen(txt);
wchar_t * result = new wchar_t[argLength+1];
ZeroMemory(result,argLength+1);
int convertedCharacters = MultiByteToWideChar
(CP_UTF8,MB_COMPOSITE,(LPCSTR)result,(int)argLength,(LPWSTR)txt,0);
if(!convertedCharacters)
{
DWORD lastErr = GetLastError();
switch(lastErr)
{
case ERROR_INSUFFICIENT_BUFFER:
throw exception("insufficient buffer");
case ERROR_NO_UNICODE_TRANSLATION:
throw exception("invalid unicode char");
case ERROR_INVALID_FLAGS:
throw exception("no unicode translation");
case ERROR_INVALID_PARAMETER:
throw exception("error invalid parameter");
}
}
return result;
}
As you know, Windows is working on Unicode and everything should be working on it. This imposes programmers to have some knowledge about it, for example, how to convert from char *
to wchar_t*
and when two string
s are equals. Don’t forget that you can write Unicode text in some different ways. Bear it in mind that if you want to compare two strings, they have to be normalized before comparison!! A normalization is a process of proper characters byte interpretation so that strings could be interchangeably interpreted.
In many cases, Unicode allows multiple representations of what is, linguistically, the same string
. For example:
- Capital A with dieresis (umlaut) can be represented either as a single Unicode code point “Ä” (U+00C4) or the combination of Capital A and the combining Dieresis character (“A” + “¨”, that is, U+0041 U+0308). Similar considerations apply for many other characters with diacritic marks.
- Capital A itself can be represented either in the usual manner (Latin Capital Letter A, U+0041) or by Fullwidth Latin Capital Letter A (U+FF21). Similar considerations apply for the other simple Latin letters (both uppercase and lowercase) and for the katakana characters used in writing Japanese.
- The
string
“fi
” can be represented either by the characters “f
” and “i
” (U+0066 U+0069) or by the ligature “?
” (U+FB01). Similar considerations apply for many other combinations of characters for which Unicode defines ligatures.
What’s their binary code depends on the norm. See the full description of available norms here. To check whether your wchar_t*
is normalized, use IsNormalizedString Function and to make your string
normalized, use NormalizeString Function. But what if you’re programming in .NET and not in unmanaged Windows API? You have to care for that also, but in another way. Look at the C# code below to see the small program I wrote for myself to check it.
Code Snippet 2
class Program
{
static void Main(string[] args)
{
string s = "A\u0308\uFB03n";
string s2 = "Äffin";
String sm1 = new String( s.ToCharArray() );
String sm2 = new String( s2.ToCharArray() );if (String.Compare( s, s2 ) == 0)
{
Console.WriteLine( "equal" );
}if (s == s2)
{
Console.WriteLine( "equal" );
}
else
{
Console.WriteLine( "not equal" );
}if (sm1.Equals( sm2 ))
{
Console.WriteLine( "equals" );
}
else
{
Console.WriteLine( " not equal" );
}if (String.Compare( sm1, sm2 )==0)
{
Console.WriteLine( "equal" );
}
else
{
Console.WriteLine( "not equal" );
}
}
}
It appears that not every string
comparison from the above returns true
as it should. Only String.Compare
guaranteed correct result but this code isn’t perfect. In order to satisfy security rules, always perform string
comparison with proper CultureInfo then it would be correct. C# programming isn’t as easy as it seems to be if you want to do your job right and not only to get it working in a special runtime environment.
In Win32 unmanaged environment, you had better use functions like lstrcmpi to compare two string
s. Microsoft provided an article on how to perform safe string
comparisons and internationalizations features. Windows has a magic win32 API MulityByteToWideChar function with non working parameters: MB_PRECOMPOSED
, MB_COMPOSITE
. See the post here to get more knowledge about it. If you want to convert the char*
to wchar_t*
, please use mbstowcs_s function. There is an MSDN article that shows some basic string
format conversions here.
Code Snippet 3
The well coded safe char*
to wchar_t*
conversion function:
wchar_t* ConvertToWchar_t(const char * txt)
{
size_t argLength = strlen(txt);
wchar_t * result = new wchar_t[argLength+1];
ZeroMemory(result,argLength+1);
size_t convertedChars;
mbstowcs_s(&convertedChars,result,argLength+1,txt,_TRUNCATE);
if(!convertedChars)
{
DWORD lastErr = GetLastError();
switch(lastErr)
{
case ERROR_INSUFFICIENT_BUFFER:
throw exception("insufficient buffer");
case ERROR_NO_UNICODE_TRANSLATION:
throw exception("invalid unicode char");
case ERROR_INVALID_FLAGS:
throw exception("no unicode translation");
case ERROR_INVALID_PARAMETER:
throw exception("error invalid parameter");
}
}
return result;
}
If you want to know more, take a look at this article. There is also a sample on how to correctly perform normalization process on MSDN.
Filed under: C#, C/C++, CodeProject
Tagged: C++