There is a number of issues about it. There is no need to support surrogate pairs, they are supported automatically by OS (Windows 2000 needs a tweak to support them, later versions of Windows are bundled with surrogate support).
The notion of surrogate pair is only relevant to two UTF-16 encodings (UTF-16LE and UTF16BE); UTF-32 and UTF-8 support characters beyond BMP (
Basic Multilingual Plane) directly or using UTF-8 algorithm, respectively. In application memory, UTF-16LE is used; and a character type does not really represent a Unicode code point: some code points are represented as two characters, as you correctly point out, so some care is needed to index characters, see below.
One can use characters above BMP in UI directly, without any re-coding. The text should be placed in XML resources. As XML files can declare UTF-8 charset, anyone can type such text directly using any editor capable of saving data in UTF-8 format:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
The XML file will be embedded as a resource in the .NET Assembly; during run-time, the text will be loaded/converted into UTF-16 memory representation with the code point above BMP represented as surrogate pairs. In principle, such UTF-8 text can even be entered in C# code in the form of hard-coded string literals, but I would strongly recommend to avoid it. Any hard-coded string literals, even ASCII-only are best avoided in the code, with rare exclusions.
The biggest concern is deployment of fonts implementing code point ranges above BMP. From what I know, no such fonts are bundled with Windows. However, I tested Unicode implementation above BMP using some Open Source fonts and had no problems with them.
The mixed-size nature of character string is reflected in the members of abstract class
System.Text.Encoding
. For example, look at the following methods of this class:
GetByteCount
,
GetBytes
,
GetCharCount
,
GetChars
. They reflect the fact that there is no one-to-one correspondence between bytes and chars: these methods accept a
string
of
char[]
parameter on input.
There is no direct access to character indexing though. I would guess, this is because this information is rarely used and needs a lot of redundant data (see below). Controls process surrogate pairs automatically. If necessary, anyone can build such index in code. To do that, one need to create a separate index map represented by index set, for example, as array of integers.
Traverse the string's "characters" (in the .NET sense, not code points) in a loop and for every character examine it using predicates (static methods):
System.Char.IsLowSurrogate(char)
,
IsHighSurrogate
or
IsSurrogate
, incrementing the code point index correspondently: by 1 per one "real" character (representing a code point) or per two "surrogate" characters representing a surrogate pair.
When you obtain the indexing map, you can index a string by code points and use other functions in code point semantics.
The implementation would look like this (not tested):
public class CodePointIndexer {
public CodePointIndexer(string value) {
this.value = value;
indexMap = BuildCodePointMap(value);
}
public string Value { get { return this.value; } }
public char[] this[int index] {
get {
int codePointIndex = this.indexMap[index];
char start = value[codePointIndex];
if (System.Char.IsSurrogate(start))
return new char[] { start };
else
return new char[] { start, value[codePointIndex + 1] };
}
}
String value;
int[] indexMap;
#region implementation
static int[] BuildCodePointMap(string source) {
if (source == null) return null;
if (source.Length < 1) return new int[] { };
System.Collections.Generic.List<int> list =
new System.Collections.Generic.List<int>();
int currectIndex = 0;
bool surrogateMode = false;
foreach (char @char in source) {
list.Add(currectIndex);
if (surrogateMode) continue;
surrogateMode = System.Char.IsSurrogate(@char);
currectIndex++;
}
return list.ToArray();
}
#endregion implementation
}
Sorry if I did not list comprehensive set of relevant .NET APIs — working above BMP is quite exotic requirement. At the same time, the methods I already mentioned are enough to implement any Unicode computing task.
—SA