Unicode with Surrogate Pairs in .Net

Question

4.47/5 (7 votes)

See more:

As I understand it, .Net represents 32-bit characters using a pair (or "surrogate pair") of 16-bit characters. However, I haven't been able to find any functions which deal with these pairs as a single character. For example, Windows forms is capable of displaying this surrogate pair as a single character:

C#

// This displays the character as I expect.
MessageBox.Show(char.ConvertFromUtf32(int.Parse("2A601", NumberStyles.HexNumber)));

However, when I get the length of that string, it is 2 (I would expect it to be 1):

C#

// Shows 2 rather than 1.
MessageBox.Show(char.ConvertFromUtf32(int.Parse("2A601", NumberStyles.HexNumber)).Length.ToString());

Also, when I get the first character, some block character is shown rather than the character I expect:

C#

// Shows �� rather than 𪘁.
MessageBox.Show(char.ConvertFromUtf32(int.Parse("2A601", NumberStyles.HexNumber)).Substring(0, 1));

FYI, you may need something installed to see the special characters above, but you should get the point even if you can't see them.
Basically, I would like to know if there are any string functions to handle surrogate pairs properly (e.g., index them correctly, count them as a single character rather than two). Or, if I'm looking at the concept of surrogate pairs wrong, feel free to correct me.

Posted 16-Feb-11 11:48am

AspDotNetDev

Add a Solution

Comments

AspDotNetDev 16-Feb-11 17:50pm

Note that the editor is messing something up and the two blocks shown after "Shows " (in the code comment) should only be one block.

Sergey Alexandrovich Kryukov 16-Feb-11 18:29pm

Good question, my 5.
--SA

3 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Espen Harlinn · Accepted Answer · 2011-02-16T12:26:00

"As I understand it, .Net represents 32-bit characters using a pair (or "surrogate pair") of 16-bit characters"

.Net uses UTF16 - and you may find this interesting:
http://www.unicode.org/notes/tn12/[^]

and this http://www.yoda.arachsys.com/csharp/unicode.html[^]

Libraries like http://site.icu-project.org/[^] takes surrogate pairs into account, using an iterator approach - while .Net seems to treat UTF16 as UCS16. While I suspect that that the underlying OS features implements and uses UTF16 more in line with the standard.

As SAKryukov mentions UnicodeEncoding actually takes these things into account - but it seems that the usual practise is to only consider the length of the string - and that usually tends to work out nicely anyway, unless you are doing character by character processing.

To get more than a box - you need to use a font that supports the characters you want to display.

Regards
Espen Harlinn

Sergey Alexandrovich Kryukov · Accepted Answer · 2011-02-16T12:28:00

There is a number of issues about it. There is no need to support surrogate pairs, they are supported automatically by OS (Windows 2000 needs a tweak to support them, later versions of Windows are bundled with surrogate support).

The notion of surrogate pair is only relevant to two UTF-16 encodings (UTF-16LE and UTF16BE); UTF-32 and UTF-8 support characters beyond BMP (Basic Multilingual Plane) directly or using UTF-8 algorithm, respectively. In application memory, UTF-16LE is used; and a character type does not really represent a Unicode code point: some code points are represented as two characters, as you correctly point out, so some care is needed to index characters, see below.

One can use characters above BMP in UI directly, without any re-coding. The text should be placed in XML resources. As XML files can declare UTF-8 charset, anyone can type such text directly using any editor capable of saving data in UTF-8 format:

XML

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

The XML file will be embedded as a resource in the .NET Assembly; during run-time, the text will be loaded/converted into UTF-16 memory representation with the code point above BMP represented as surrogate pairs. In principle, such UTF-8 text can even be entered in C# code in the form of hard-coded string literals, but I would strongly recommend to avoid it. Any hard-coded string literals, even ASCII-only are best avoided in the code, with rare exclusions.

The biggest concern is deployment of fonts implementing code point ranges above BMP. From what I know, no such fonts are bundled with Windows. However, I tested Unicode implementation above BMP using some Open Source fonts and had no problems with them.

The mixed-size nature of character string is reflected in the members of abstract class System.Text.Encoding. For example, look at the following methods of this class: GetByteCount, GetBytes, GetCharCount, GetChars. They reflect the fact that there is no one-to-one correspondence between bytes and chars: these methods accept a string of char[] parameter on input.

There is no direct access to character indexing though. I would guess, this is because this information is rarely used and needs a lot of redundant data (see below). Controls process surrogate pairs automatically. If necessary, anyone can build such index in code. To do that, one need to create a separate index map represented by index set, for example, as array of integers.

Traverse the string's "characters" (in the .NET sense, not code points) in a loop and for every character examine it using predicates (static methods): System.Char.IsLowSurrogate(char), IsHighSurrogate or IsSurrogate, incrementing the code point index correspondently: by 1 per one "real" character (representing a code point) or per two "surrogate" characters representing a surrogate pair.
When you obtain the indexing map, you can index a string by code points and use other functions in code point semantics.

The implementation would look like this (not tested):

C#

public class CodePointIndexer {

    public CodePointIndexer(string value) {
        this.value = value;
        indexMap = BuildCodePointMap(value);
    } //CodePointIndexer

    public string Value { get { return this.value; } }

    public char[] this[int index] { //may throw out-of-range exception
        get {
            int codePointIndex = this.indexMap[index];
            char start = value[codePointIndex];
            if (System.Char.IsSurrogate(start))
                return new char[] { start };
            else
                return new char[] { start, value[codePointIndex + 1] };
        } //get this as code point
    } //this

    String value;
    int[] indexMap;

    #region implementation

    static int[] BuildCodePointMap(string source) {
        if (source == null) return null;
        if (source.Length < 1) return new int[] { };
        System.Collections.Generic.List<int> list =
            new System.Collections.Generic.List<int>();
        int currectIndex = 0;
        bool surrogateMode = false;
        foreach (char @char in source) {
            list.Add(currectIndex);
            if (surrogateMode) continue;
            surrogateMode = System.Char.IsSurrogate(@char);
            currectIndex++;
        } //loop
        return list.ToArray();
    } //BuildCodePointMap

    #endregion implementation

} //class CodePointIndexer

Sorry if I did not list comprehensive set of relevant .NET APIs — working above BMP is quite exotic requirement. At the same time, the methods I already mentioned are enough to implement any Unicode computing task.

—SA

yetibrain · Accepted Answer · 2014-02-17T03:44:00

If you character is beyond the BMP (and 2A601 is > 0xFFFF e. g. decimal 173569) then you will have a high- as well as low-surrogate within your string that encodes your codepoint. This means that TWO elements e. g. TWO words encode ONE character. Length will always obtain the number of array elements, not the number of characters/codepoints! This is true due to the fact that codepoints within a utf-16 stream appear as a dword if greater than 0xFFFF. Because a high- and a low-surrogate are TWO words, the length of 2 is as appropriate. Length means "number of elements" on an array, not as you expect "CharCount" or "CodepointCount".

There is a class called StringInfo that should do the job you are looking for. It checks for surrogate-pairs (and hopefully skips orphaned surrogates) and obtains the number of codepoints, not array elements. Try it.

If your control that you want the codepoint to display with is surrogate-aware, it will decode the codepoint that is encoded within the high- and low-surrogate pair and queries the configured font for the glyph. Be sure you have configured a font that has the proper glyph for your codepoint (e. g. Arial Unicode MS has many glyphs but not all).

kind regards,
yb

Unicode with Surrogate Pairs in .Net

3 solutions

Solution 1

Solution 2

Solution 3

Add your solution here

Preview 0