Introduction
This article shows a technique to allow you to directly type in Unicode to a text-box without the use of a dedicated IME (Input Method Editor) or using the Character Map Tool. It also discusses about surrogate pair encoding and the implementation of a fun tool to create simple web pages that can display fanciful fonts.
Background
Some preliminary concepts:
Unicode code point
A Unicode code point is referred to by writing "U+" followed by its
hexadecimal number. For code points in the
Basic Multilingual Plane (BMP), four digits are used. For example the U+222B is the code point for the Mathematical symbol for Integration "∫". Other Multilingual Plane can have code points with 5 hexadecimal digits. For example the ancient Egyptian Hieroglyphs are from U+F3000 - U+F4B92.
Unicode encoding
All Unicode code points can be encoded in either of the 2 standard encoding formats: UTF16 and UTF8.
UTF16 are mostly double byte encoding (except for surrogate pairs). The encoding for U+222B is hexadecimal 22 2B if the byte ordering is Big endian and hexadecimal 2B 22 if the ordering is Little endian. For encoding Unicode code points outside of the Basic Multilingual Plane, 2 sets of 4 hexadecimal numbers are used. See Surrogate Support in Microsoft Products for more details on how to do the encoding.
UTF8 is an encoding standard that uses 1 or more bytes to encode each Unicode code point.
Glyphs
These are graphics used to render the character representing the Unicode code point in a display. Note that for the same Unicode code point, for language like Arabic, the glyph used is different depending on the neighbouring characters.
Fonts
These are collection of glyphs that are normally grouped together based on language or usage. Each glyph in the font file is tagged to a Unicode code point. For some interesting font files, you may want to visit this site: Unicode Fonts for Ancient Scripts
IME (Input Method Editor)
A language specific tool used to efficiently create Unicode code point to be entered into a Unicode supporting text input interface. In Windows 7, you can install new IME via the Control Panel -> Region and Language -> Keyboard and Language.
Character Map Tool
A generic tool provided by Microsoft that can generate all Unicode code point for the Basic Multilingual Plane and you can copy and paste into a Unicode supporting text input interface. You can access the tool via Start->All Programs-> Accessories->System Tools-> Character Map.
A little known tool that can be used to create and edit characters for the Private Character Area U+E000 - U+F877. This area can hold 6400 characters. It is reserved for private use. The Private Character Editor can be accessed as c:\windows\system32\eudcedit.exe.
The glyphs created are found in the files c:\windows\fonts\eudc.euf and c:\windows\fonts\eudc.tte. These files are hidden if you try to access it using explorer. However, you can copy out the files using the cmd prompt.
To view the glyphs created, you can use Character Map and search for font: All Fonts (Private Characters). Alternatively, you can use our program developed here.
Using the code
The code below performs the main task of generating the Unicode code point. When the user type into the bottom text-box (textBox3), the code kicks in. It checks if the key typed is a <space> and that the preceding characters are in the format of "U+####" or "U+#####", and replaced these characters with the encoding for the Unicode code point they represent.
Note that the code works for Basic Multilingual Plane "U+####", as well as all the other planes "U+#####" where each Unicode code point is represented by 5 hexadecimal digits.
You would also need a Unicode font for the text-box. I use Arial Unicode MS, 14.25pt font that comes with Windows 7.
private void HandleKeyPress(object sender, KeyPressEventArgs e)
{
TextBox textbox = (TextBox)sender;
string s = "";
if (e.KeyChar == ' ' && textbox.SelectionStart >= 6)
{
textbox.SelectedText = "";
int n = (textbox.SelectionStart == 6) ? 6 : 7;
s = textbox.Text.Substring(textbox.SelectionStart - n, n);
int n1 = s.ToUpper().IndexOf("U+");
if (n1 >= 0)
{
string s1 = s.Substring(n1 + 2, s.Length - (n1 + 2));
string s2 = "";
unicodepoint2utf16(s1, ref s2);
if (s2 != "")
{
uint d = Convert.ToUInt32(s2, 16);
uint maskb0 = Convert.ToUInt32("FF000000", 16);
uint maskb1 = Convert.ToUInt32("FF0000", 16);
uint maskb2 = Convert.ToUInt32("FF00", 16);
uint maskb3 = Convert.ToUInt32("FF", 16);
byte b0 = (byte)((d & maskb0) >> 24);
byte b1 = (byte)((d & maskb1) >> 16);
byte b2 = (byte)((d & maskb2) >> 8);
byte b3 = (byte)((d & maskb3));
byte[] bytes;
if (b0 == 0 && b1 == 0)
bytes = new byte[] { b3, b2 };
else
bytes = new byte[] { b1, b0, b3, b2 };
UnicodeEncoding u = new UnicodeEncoding();
string s3 = u.GetString(bytes);
textbox.SelectionStart = textbox.SelectionStart - (n - n1);
textbox.SelectionLength = (n - n1);
textbox.SelectedText = s3;
e.Handled = true;
}
}
}
}
The unicodepoint2utf16()
function takes in as parameters a Unicode code point string and a ref string that will be modified to hold the resulting UTF-16 encoding. The resulting UTF-16 string can have 4 (for U+####) or 8 (for U+#####) hexadecimal digits. For 8 hexadecimal digits output string, the first 4 hexadecimal digits and the last 4 hexadecimal digits form the surrogate pair for UTF-16 encoding. For example U+2040A will be encoded as the pair D841, DC0A. A surrogate pair will have the encoding in 2 code units. The range for the code units are:
High: U+D800 - U+DBFF
Low: U+DC00 - U+DFFF
This encoding standard allows for (DBFF-D800 +1)*(DFFF-DC00+1) = 1048576 code points!
private void unicodepoint2utf16(string unp, ref string utf16)
{
utf16 = "";
uint testint=0;
string simplified_unp = "";
try
{
testint = Convert.ToUInt32(unp, 16);
}
catch
{ return;
}
simplified_unp=testint.ToString("x");
if (simplified_unp.Length == 5)
{
try
{
uint d = Convert.ToUInt32(simplified_unp, 16);
uint d1 = Convert.ToUInt32("10000", 16);
uint d2 = d - d1;
uint p1 = d2 >> 10;
uint m1 = Convert.ToUInt32("1111111111", 2);
uint p2 = d2 & m1;
uint d800 = Convert.ToUInt32("d800", 16);
uint dc00 = Convert.ToUInt32("dc00", 16);
uint s1 = d800 + p1;
uint s2 = dc00 + p2;
utf16 = s1.ToString("x4") + s2.ToString("x4");
}
catch
{
return;
}
return;
}
if (unp.Length == 4 && unp.TrimStart(' ').Length ==4)
{
try
{
uint d = Convert.ToUInt32(simplified_unp, 16);
utf16 = d.ToString("x4");
}
catch
{
return;
}
}
}
Basic Demo
When the demo starts, the bottom text-box's content will be : ....U+265b<press space to get the character for this unicode>
Press <space> bar and the code U+265b will be replaced by the character represented by U+265b. Guess what that is?
You can select "Help" from the combo-box to get help on using the top left text-box.
Below are some of the sample Unicode points that you may like to test out:
CJK ( Simplified Chinese meaning East ) : U+4E1C. Type U+4e1c follow by space
Greek ( Pi ) : U+03c0. Type U+03c0 follow by space
Symbols ( White Spade ) : U+2664. Type U+2664 follow by space
A 5 hexadecimal digits Unicode: U+2040b. Type U+2040b follow by space
If you have downloaded the Aegyptus
font from Unicode Fonts for Ancient Scripts, you can installed it by copying the font file to the Windows Fonts directory at c:\\windows\fonts. Then change the font for the text-box to Aegyptus
. Double clicking on any of the text-boxes pop up the Font-Dialog to select the font to assign to the text-box.
You may want to try out the Unicode code points shown in the picture below. For example, to get the character of the owl (top row 9th item after the first item), the Unicode code point would be U+10980 + 9 (hex for 9) = U+10989. So for the double wave (10th item after the the first item), it would be U+10980 + A (hex for 10) = U+1098A. You should be able to quite easily work out the Unicode code points for the rest of the figures below.
Type U+#####<space> for example, U+10989 follow by space will have the owl typed out into the text-box.
To use the other 2 text-boxes:
Type in using keyboard or IME to the top left text-box. You can also get the characters from the Character Map Tool and paste into this text-box. To find out the Unicode code point for any character, click to the right of the character to set the cursor and a tool-tip will pop up showing the Unicode code point. For more features select Help from the combo-box.
Click the -> button next to this text-box to display all the UTF-16 encoding to the right text-box.
Similarly, you can type in sets of 4 digits space-seperated hexadecimal UTF-16 encoding into the right text-box and click the <-- button to see the characters on the left text-box.
These are the Unicode groups that you can select from the combo-box
Meroitic U+10980 - U+109ff Aegyptus,36,BOLD
Hieroglyphs U+f3000 - U+f4b92 Aegyptus,36,BOLD
Chinese U+4e00 - U+9fa5 Arial Unicode MS,14,REGULAR
Phaistos Disc1 U+F01D0 - U+F01E7 Aegean,36,REGULAR
Phaistos Disc2 U+F0200 - U+F0247 Aegean,36,REGULAR
Cypro-Minoan U+F1000 - U+F1136 Aegean,36,REGULAR
Cypriot Syllabary U+F1700-U+F1853 Aegean,36,REGULAR
A whole list of other groups has been added. See the Top picture.
Advance Demo
For this demo, you would need to download both the Aegyptus
and the Aegean
fonts. These can be done via the links at the top of this article.
After you have downloaded and installed these fonts, you should be able to display all the glyph for each of the Unicode ranges above.
However a Windows text-box can only be assigned one font at any one time and currently there is no universal font that can support all possible Unicode code point.
If you type U+10980<space>U+F1000<space> in the bottom text-box, at least one of two characters would not be displayed correctly. This is because the glyph for U+10980 is found in Aegyptus
font and U+F1000 is found in Aegean
font. If you assign Aegean
font to the text-box, U+10980 will not display correctly, and if you assign Aegyptus
font, U+F1000 will not display correctly. Unless you can find a font that support both of these code points, you do not have a solution.
Ah....but, we can use a rich text-box control right? No. Current version of rich text-box control does not support surrogate pairs encoding, although it support multiple fonts. Both U+10980 and U+F1000 are encoded using surrogate pairs, so we would not be able to use the rich text-box to display these characters.
One of the solution to this problem is to use a web browser control. The current version of the web browser control supports surrogate pair encoding. To display characters in the Unicode range correctly, we put the characters within <div> or <span> tags with the correct font assigned to the CSS style for the these tags.
<span style="font-family:@font@;color:@color@;font-size:@font-size@px"><b>&#x@unicode@;</b></span>
<div style="font-family:@font@;color:@color@;font-size:@font-size@px"><b>@block@</b></div>
The above are the templates we use to generate the <span> and <div> tags. We can replace the placeholders (those @xx@ items) using the getHTMLformatEntry(string s1, string font)
function below.
string getHTMLformatEntry(string s1, string font)
{
string s = Resource1.sSpan_Template;
string f = font;
string[] vf = f.Split(',');
s=s.Replace("@font@", vf[0]);
int font_size = (int.Parse(vf[1]) * 3) / 2;
s = s.Replace("@font-size@", font_size+"");
Random r= new Random();
int i=r.Next(0,7);
string[] colors = new string[] {"red","green","blue","magenta",
"cyan","black","orange","pink" };
string color = colors[i];
s=s.Replace("@color@",color);
s=s.Replace("@unicode@", s1);
if (vf[2] != "BOLD")
{
s = s.Replace("<b>", "");
s= s.Replace("</b>","");
}
return s;
}
For instance, if we want to display U+F1000, we pass as parameters
s1
: "f1000"
font
: "Aegean,36,REGULAR"
The output would be:
<span style="font-family:Aegyptus;color:magenta;font-size:54px"><b>󱀀</b></span>
The color is randomly assigned, but the rest of the placeholders are replaced by data from the input parameters
Similarly we can replace the placeholders in the <div> template.
The main difference between the <div> tag and the <span> tag is that the <div>tag will take up the entire line in the web page (if we do not use table and cell). If we want 2 characters with different fonts to be side by side, we would use the <span> tag. The <div> tag is used for block of characters all having the same font.
Steps for this Demo
1) Type some message in the bottom text-box on the left
2) Click the -> button next to this text-box
3) Select "Cypro-Minoan U+F1000 - U+F1136" Unicode range from the combo-box
4) Hold the Alt key and mouse left click at the first character in the top left text-box
5) Select "Meroitic U+10980 - U+109ff" Unicode range from the combo-box
6) Hold the Alt key and mouse left click at the first character in the top left text-box
Analysis and Explanation
In step 2 when the -> button is clicked, we make use of the <div> template to generate the <div> tag as shown below:
<div style="font-family:Arial Unicode MS;color:black;font-size:14.25px"><b>Demo:
Putting "U+F1000" Aegean font with
"U+10980" side by side</b></div>
In step 4, from the mouse click, we set the cursor position behind the intended character to get the Unicode code point of that character, in this case we get "f1000". The Alt key is to indicate that we also want to paste the character to the web browser control. We call getHTMLformatEntry()
function, passing in this code point, and current font ("Aegean,36,REGULAR" ) to create the tag below:
<span style="font-family:Aegean;color:cyan;font-size:54px">󱀀</span>
Similarly step 6 will also generate a <span> tag, but now the font is different, and the tag below would be generated
<span style="font-family:Aegyptus;color:black;font-size:54px"><b>𐦀</b></span>
Beside these 2 templates, we also have another template that we would use to create the entire HTML page . The placeholder @@ would be replaced by the concatenation of all of the previously generated <div> and <span> tags. As tags are generated, we store them in the global variable htmlelements
. Replacing @@ with the content htmlelements would give us a well formatted html page that we could use to update the web browser control
<!DOCTYPE html><html><body>@@</body></html>
After you have completed all the 6 steps, click "View Source" button to view the html page in Notepad. The file is created in the current directory and the default name is temp.html.txt. Rename to temp.html and view the page in any web browser.
Alternatively you can just click "View in External Browser" button to launch the page directly to the default web browser in your system.
I have tested the page created on IE 8 and Chrome successfully. If the referenced fonts are installed in your Windows system, the page should be rendered correctly, as the newer browsers mostly support surrogate pair encoding.
You can also click "Remove Last Insert" to remove the last item you inserted into the web browser.
Finally click "Clear" to clear the content of the web browser control.
Points of Interest
1) The code fragments to enable direct Unicode typing in a text-box is quite small and simple that you can easily include in your project. To enable this feature in any text-box
private UnicodeProcessing uniprocessing = new UnicodeProcessing();
textBox3.KeyPress += new KeyPressEventHandler(uniprocessing.HandleKeyPress);
2) With Version 3, you can create fanciful web pages that has all those interesting glyphs.
Have fun!
History
19 May 2014: Version 1
21 May 2014: Version 2: Add support for surrogate pairs.
23 May 2014: Version 2d: Add in a combo box to select Unicode Range
24 May 2014: Version 3: Add in web browser control to allow for multiple fonts support
26 May 2014: Version 3b: Encapsulate all unicode processing functions, making it easier to reuse these features. Add more features to Html procesing in the demo, allowing deletion of last insert. Fix bug to handle leading spaces and commas in html page
28 May 2014: Version 3c: Added in extensive list of character groupings, including the private area U+e000 - U+f8ff. Also include discussion on the Private Character Editor, eudcedit.exe.
Reference
Wikipedia: Unicode
Wikipedia: UTF-16