(untagged)

When Two Strings are Equal on Windows

pytelg

0.00/5 (No votes)

14 Nov 2010

When two strings are equal on Windows

It’s seems to be a trivial and obvious task, but in fact it’s not and I nearly have slipped over it. I wanted to dynamically load some shader resources and my app crushed at runtime. I didn’t know why so performed debugging and got the enemy. The enemy is hidden in two functions: philosophy of Windows string comparisons and MultiByteToWideChar function.

Code Snippet 1

Not working C++ code that throws an exception:

wchar_t* ConvertToWchar_t(const char * txt)
{
size_t argLength = strlen(txt);
wchar_t * result = new wchar_t[argLength+1];
ZeroMemory(result,argLength+1);
int convertedCharacters = MultiByteToWideChar
	(CP_UTF8,MB_COMPOSITE,(LPCSTR)result,(int)argLength,(LPWSTR)txt,0);
if(!convertedCharacters)
{
DWORD lastErr = GetLastError();
switch(lastErr)
{
case ERROR_INSUFFICIENT_BUFFER:
throw exception("insufficient buffer");
case ERROR_NO_UNICODE_TRANSLATION:
throw exception("invalid unicode char");
case ERROR_INVALID_FLAGS:
throw exception("no unicode translation");
case ERROR_INVALID_PARAMETER:
throw exception("error invalid parameter");
}
}
return result;
}

As you know, Windows is working on Unicode and everything should be working on it. This imposes programmers to have some knowledge about it, for example, how to convert from char * to wchar_t* and when two strings are equals. Don’t forget that you can write Unicode text in some different ways. Bear it in mind that if you want to compare two strings, they have to be normalized before comparison!! A normalization is a process of proper characters byte interpretation so that strings could be interchangeably interpreted.

In many cases, Unicode allows multiple representations of what is, linguistically, the same string. For example:

Capital A with dieresis (umlaut) can be represented either as a single Unicode code point “Ä” (U+00C4) or the combination of Capital A and the combining Dieresis character (“A” + “¨”, that is, U+0041 U+0308). Similar considerations apply for many other characters with diacritic marks.
Capital A itself can be represented either in the usual manner (Latin Capital Letter A, U+0041) or by Fullwidth Latin Capital Letter A (U+FF21). Similar considerations apply for the other simple Latin letters (both uppercase and lowercase) and for the katakana characters used in writing Japanese.
The string “fi” can be represented either by the characters “f” and “i” (U+0066 U+0069) or by the ligature “?” (U+FB01). Similar considerations apply for many other combinations of characters for which Unicode defines ligatures.

What’s their binary code depends on the norm. See the full description of available norms here. To check whether your wchar_t* is normalized, use IsNormalizedString Function and to make your string normalized, use NormalizeString Function. But what if you’re programming in .NET and not in unmanaged Windows API? You have to care for that also, but in another way. Look at the C# code below to see the small program I wrote for myself to check it.

Code Snippet 2

class Program
{
static void Main(string[] args)
{
string s = "A\u0308\uFB03n";
string s2 = "Äffin";
String sm1 = new String( s.ToCharArray() );
String sm2 = new String( s2.ToCharArray() );if (String.Compare( s, s2 ) == 0)
{
Console.WriteLine( "equal" );
}if (s == s2)
{
Console.WriteLine( "equal" );
}
else
{
Console.WriteLine( "not equal" );
}if (sm1.Equals( sm2 ))
{
Console.WriteLine( "equals" );
}
else
{
Console.WriteLine( " not equal" );
}if (String.Compare( sm1, sm2 )==0)
{
Console.WriteLine( "equal" );
}
else
{
Console.WriteLine( "not equal" );
}
}
}

It appears that not every string comparison from the above returns true as it should. Only String.Compare guaranteed correct result but this code isn’t perfect. In order to satisfy security rules, always perform string comparison with proper CultureInfo then it would be correct. C# programming isn’t as easy as it seems to be if you want to do your job right and not only to get it working in a special runtime environment.

In Win32 unmanaged environment, you had better use functions like lstrcmpi to compare two strings. Microsoft provided an article on how to perform safe string comparisons and internationalizations features. Windows has a magic win32 API MulityByteToWideChar function with non working parameters: MB_PRECOMPOSED, MB_COMPOSITE. See the post here to get more knowledge about it. If you want to convert the char* to wchar_t*, please use mbstowcs_s function. There is an MSDN article that shows some basic string format conversions here.

Code Snippet 3

The well coded safe char* to wchar_t* conversion function:

wchar_t* ConvertToWchar_t(const char * txt)
{
 size_t argLength = strlen(txt);
 wchar_t * result = new wchar_t[argLength+1];
 ZeroMemory(result,argLength+1);
 size_t convertedChars;
 mbstowcs_s(&convertedChars,result,argLength+1,txt,_TRUNCATE);
 if(!convertedChars)
 {
 DWORD lastErr = GetLastError();
 switch(lastErr)
 {
 case ERROR_INSUFFICIENT_BUFFER:
 throw exception("insufficient buffer");
 case ERROR_NO_UNICODE_TRANSLATION:
 throw exception("invalid unicode char");
 case ERROR_INVALID_FLAGS:
 throw exception("no unicode translation");
 case ERROR_INVALID_PARAMETER:
 throw exception("error invalid parameter");
 }
 }
 return result;
}

If you want to know more, take a look at this article. There is also a sample on how to correctly perform normalization process on MSDN.

Filed under: C#, C/C++, CodeProject
Tagged: C++

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here