In the previous tutorial, we passed a single string
to a native C/C++ function by using P/Invoke.
This function was defined like this:
void print_line(const char* str);
[DllImport("NativeLib.dll")]
private static extern void print_line(string str);
However, there exists a hidden pitfall here:
What happens when the user passes a non-ASCII character to this function?
ASCII and Unicode: A Historical Overview
Historically, there was ASCII which defined characters up to character number 127 (i.e., everything that fits into 7 bits). However, these 128 characters contained only letters used in English. Umlauts (like ä, ö, ü) and other characters were not present. So, the 8th bit was used to map these characters, but the mapping was not standardized. Basically, each country had its own mapping of the region 128 – 255. These different mappings were called code pages.
For example, on code page 850 (MS-DOS Latin 1), the character number 154 is Ü (German Umlaut) while on code page 855 (MS-DOS Cyrillic), the very same character number represents ? (Cyrillic small letter DZHE).
To unify these different mappings, the Unicode standard was established in 1991. The idea was (and is) to give each existing character a unique id. These ids are called code points. So basically, the Unicode standard is “just” a much bigger version of the ASCII standard. The latest version as of writing is Unicode version 6.1 which covers over 110,000 characters.
Along with the Unicode standard, several encodings were developed. Each encoding describes how to convert Unicode code points into bytes. The most famous ones are UTF-8 and UTF-16.
Please note that all encodings can encode all Unicode code points. They just differ in the way they do this.
If you want to experiment a little bit with Unicode, there is a Unicode Explorer I’ve written. Go ahead and give it a try.
P/Invoke String Conversions
Back to the actual problem. With the parameter of print_line()
defined as const char*
(and char
being 8 bit), it’s not clear which code page to use for the string
s passed to this function.
Instead, let’s change the parameter type to Unicode (also sometimes referred to as “wide characters”):
void print_line(const wchar_t* str);
No, let’s also adopt the C# mapping:
[DllImport("NativeLib.dll", CharSet = CharSet.Unicode)]
private static extern void print_line(string str);
The only difference here it that we specified the CharSet
to be Unicode.
With this, C# will pass strings as UTF-16 encoded strings to the C++ function.
UTF-16 is, as said before, an encoding the converted Unicode code points into bytes and the other way around. In UTF-16, each code point is either encoded with one or with two WORD
s (16 bit values). The most frequently used code points will fit into one WORD
, the less frequently used code points fit into two WORD
s (called a “surrogate pair“).
Important: There is no ISO C way of how to print Unicode characters to the console. wprintf()
won’t work – at least on Windows.
CodeProject