Class to read Unicode character names and a tool to display/search them

Eric Lynch

0.00/5 (No votes)

3 Aug 2011CPOL3 min read

19.3K

571

A class to read Unicode character names and a tool to display/search them.

Introduction

This article describes a class to read Unicode character names. The names are read from the file UnicodeData.txt, which is one of the files that make up the Unicode Character Database. A copy of the file is included with the demo project, but it can also be obtained from the aforementioned link.

A demo application is also provided. The application allows the user to enter a decimal or hexadecimal code point, type a character, or search for a character name. While the application is useful for touring the available names, it is overly complex for learning how to use the classes themselves.

Background

The UnicodeNames class, which this article describes, relies on some supporting classes that I described in previous articles. Specifically, it uses the CsvReader class described in the article Flexible CSV reader/writer with progress reporting. It also uses the PreviewTextBox control described in the article WPF TextBox with PreviewTextChanged event for filtering.

To simply use the UnicodeNames class described in this article, neither of the other articles is required for reading.

There are two name fields available in UnicodeNames.txt. The Name field, located in the second column of this file, is used preferentially. However, for control characters, the name is always <control>. The alternative name field Unicode_1_Name is used, in these instances, when it is available.

For example, code point 10 (decimal) has a name of <control> and a Unicode_1_Name of LINE FEED (LF). In this case, the latter is used.

Additionally, there are large ranges of code points that do not have names. However, the range has a named starting and ending code point. The names of these two code points are suffixed with First and Last. The UnicodeNames class will return this same name (without the suffix) for all characters between these two code points.

Examples of these code points (in hexadecimal) are 100000 (First), 100001, and 10FFFD (Last). All are within the range <Plane 16 Private Use>.

Using the Code

The code could not be much simpler to use. Creating an instance of the class is done as follows:

string path = "UnicodeData.txt";
UnicodeNames names = new UnicodeNames(path);
names.LoadFile();

Getting the name for a character is as simple as the following:

int codePoint = 10; // LINE FEED (LF)
string name = names[codePoint];

You should also plan on disposing of the UnicodeNames instance when you are no longer going to use it, as follows:

names.Dispose();

Finally, if you have concerns about the time required to load the file into memory, there are methods and properties that make it easy to integrate with a BackgroundWorker component.

The RowEnd event is raised each time a row of the file is read. To further assist, the ProgressPercentage property describes the percentage of the file that has been loaded. Lastly, the CancelAsync method is available to abort the load operation.

Points of Interest

This was my first full fledged foray into WPF development. I did cheat a little bit and use WinForms for its AboutBox. To the WPF purists out there, I apologize.

History

8/3/2011 - The original version was uploaded.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)