Removing or replacing non-printable Unicode characters

Eric Lynch

4.81/5 (12 votes)

16 Mar 20076 min read

A class for removing or replacing non-printable Unicode characters.

Introduction

Have you ever required better control over how non-printable characters are displayed? Would you like to remove or replace non-printable characters? Would you like to do this for Unicode text? Recently, I ran into these problems.

I expected an easy solution, but found none. Windows GDI provides an unmanaged function GetFontUnicodeRanges that does a lot of the work. I wrote a C# wrapper class, FontGlyphSet, to do the rest.

The FontGlyphSet class provides methods for replacing or removing characters that do not have representations (glyphs) in a specified font. The remainder of the article describes this class and how it was implemented. The class is freely available for use in both commercial and non-commercial applications, providing the copyright notice is retained.

Background

Recently, I needed to remove or replace non-printable characters in one of my applications. Initially, I considered doing what I have always done...treating characters outside the range of ASCII 32 (space) through ASCII 126 (tilde) as non-printable.

The very term "non-printable" is a bit of a misnomer that doesn't apply well to modern systems. In a world where printing is increasingly rare, Unicode dominates, and a dizzying array of fonts are available, it is hopelessly outdated.

What I really needed was a way to remove or replace characters that lacked a glyph in the font I was using to draw. So, armed with an updated vocabulary, I set out to search for a solution.

I started out with high hopes that the Font class would provide something like a bool HasGlyph(thisChar) method. But, no such luck. The search continued.

After a little more hunting around MSDN, it seemed that two unmanaged, Windows GDI functions GetFontUnicodeRanges and SelectObject could be combined to get the information I needed.

The first of these acquires a data structure, GLYPHSET, that includes an array of WCRANGE structures. Each element of the array, describes a range of characters that have representations (or glyphs).

So, now it was time to build a friendly wrapper class.

The Sample Program

The sample program is easy to use. Click on the "Remove" button, and the program will remove all of the non-printable characters in the corresponding text box. Click on the "Replace" button, and non-printable characters will be replaced by the character specified as a "Replacement character". You can experiment by editing the text or changing the replacement character. Clicking the "Font" button, allows you to try out different fonts. The "Reset" button will set the text and replacement character back to their original values.

The following image shows what happens after the "Remove" and "Replace" buttons are clicked...

The next paragraph contains a brief description of the features of the sample text. If you are unfamiliar with Unicode, don't panic; some explanation and references are provided later in the article.

In the sample text, used by default, the "ã" in the word "amanhã" is actually formed from two characters: an "a" followed by a combining character for the tilde U+0303. Following this, the Unicode code point U+10000 is encoded as a surrogate pair, U+D800 and U+DC00. The other non-printable characters are all control characters with values less than U+0020 (the space character). Interestingly, in .NET 1.1, the TextBox control appears to incorrectly render the surrogate pair as two separate characters.

Using the Code

The class is also easy to use. First, it is necessary to create a FontGlyphSet object from the Font used to draw the text...

// Create a font glyph set for the font
FontGlyphSet = new FontGlyphSet(Font);

When replacing or removing characters with FontGlyphSet, it is important to enable a performance enhancement. It uses approximately 8K of memory, but speeds up the Contains method (used by every other method). If you do not do this, the code will still work, but it may take longer to search through all of separate ranges for a specific character.

// Enable fast Contains method
FontGlyphSet.IsFastContains = true;

Then, to replace or remove characters just use the ReplaceAbsent or RemoveAbsent methods...

// Replace non-printable characters in "myString" with the '.'
myString = FontGlyphSet.ReplaceAbsent(myString, '.');

// Remove non-printable characters from "myString"
myString = FontGlyphSet.RemoveAbsent(myString);

Methods

The FontGlyphSet class includes the following public methods...

// Test if character has representation in font
bool Contains(char charValue);

// Remove characters with no representation from string
string RemoveAbsent(string inText);

// Remove characters with no representation from char array
int RemoveAbsent(char [] inText, int inStart, int inLength,
    char [] outText, int outStart);

// Replace characters with no representation in string
string ReplaceAbsent(string inText, char replacement);

// Replace characters with no representation in char array
int ReplaceAbsent(char [] inText, int inStart, int inLength,
    char [] outText, int outStart, char replacement);

Properties

The FontGlyphSet class includes the following public properties...

// Gets number of code points (characters) with representations
uint CodePointCount;

// Gets GS_8BIT_INDICES value from GLYPHSET flAccel flags
bool Is8BitIndices;

// Gets or sets state enabling fast Contains method
bool IsFastContains;

// Gets value of GLYPHSET flAccel flags
uint Flags;

// Gets number of character ranges for font
uint RangeCount;

// Gets array of character ranges for font
FontRange [] Ranges;

The Implementation

The first step is to import GetFontUnicodeRanges and SelectObject so that they can be used within the class...

using System.Runtime.InteropServices;

[DllImport("gdi32.dll")]
private static extern uint GetFontUnicodeRanges (
    IntPtr hdc, IntPtr lpgs);

[DllImport("gdi32.dll")]
private static extern IntPtr SelectObject (
    IntPtr hdc, IntPtr hgdiobj);

The next step is to get the information from these unmanaged functions. To do this, get a device context, associate the font with it, invoke GetFontUnicodeRanges to get the size of the GLYPHSET structure, and then call the same function a second time to get the data for the structure...

// Get a GDI+ drawing surface (any will do)
Graphics g = Graphics.FromHwnd(IntPtr.Zero);

// Get handle to device context for the graphics object
IntPtr hdc = g.GetHdc();

// Get a handle to our font
IntPtr hFont = font.ToHfont();

// Replace currently selected font with our font
IntPtr savedFont = SelectObject(hdc, hFont);

// Get the size (in bytes) of the GLYPHSET structure
uint size = GetFontUnicodeRanges(hdc, IntPtr.Zero);

// Allocate memory to receive GLYPHSET structure
IntPtr glyphSetData = Marshal.AllocHGlobal((int) size);

// Copy the GLYPHSET structure into allocated memory
GetFontUnicodeRanges(hdc, glyphSetData);

After this, the Marshal.ReadInt16 and Marshal.ReadInt32 methods are used to get the data from the structure. The data is loaded into an array of FontRange objects. The FontRange class is another class included with this solution. However, since it is unlikely one would call it directly, I will leave its documentation to the code comments.

Following this, because many of the resources are unmanaged, there is a bit of clean up required...

// Free the memory used by the GLYPHSET structure
Marshal.FreeHGlobal(glyphSetData);

// Restore the previously selected font
SelectObject(hdc, savedFont);

// Release handle to device context for graphics object
g.ReleaseHdc(hdc);

// Dispose of GDI+ graphics surface
g.Dispose();

The removal and replacement of characters is pretty obvious with a couple of small exceptions. While debugging, I discovered that some fonts have quite a lot of character ranges. I became concerned that the lookup time to test if a character exists in one of these ranges might be prohibitive.

So, I added the IsFastContains property. By building a BitArray with bits for each of the 65,536 possible character values, the lookup time could be reduced considerably.

Another small problem arises with the way strings are encoded internally. They are encoded as a series of 16-bit words. The coding scheme is called UTF-16. The most commonly used characters fall in a range known as the BMP (Basic Multi-Lingual Plane). Each of these characters have a value less than 65,536 and can be encoded in a single word.

However, Unicode also allows for characters with values greater than 65,535. These values are encoded as surrogate pairs. Each 16-bit word of the pair contains half of the value. All such characters are treated as non-printable by FontGlyphSet.

The problem arises during replacement of these characters. Since the Unicode character actually takes up two 16-bit words in the string, it is important to replace both words of this surrogate pair with a single character. Changes were made to ReplaceAbsent to properly handle this situation.

Removal of characters is less problematic. Each word of the surrogate pair uses a range of values that uniquely identifies it as such (0xD800-0xDBFF and 0xDC00-0xDFFF). According to ISO/IEC 10646 these values should not be associated with valid code points. For this reason, one would not expect glyphs for these characters in a properly designed font. The code simply depends on the fact that both words of the surrogate pair will be removed independently as a consequence of their values.

Conclusions

The FontGlyphSet class that I have implemented works sufficiently for my needs. It is certainly an improvement over the technique of removing/replacing characters outside the printable ASCII range. However, the Unicode character set is very large and my time is very limited. So, it is quite likely that the class may need further refinements. I will do my best to keep the code here up-to-date with any new discoveries.

While a long time reader, this is my first submission to Code Project. I hope others find it helpful.

References

GetFontUnicodeRanges (MSDN), http://msdn2.microsoft.com/en-us/library/ms533944.aspx

Mapping of Unicode characters (Wikipedia), http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters

GLYPHSET (MSDN), http://msdn2.microsoft.com/en-us/library/ms533995.aspx

SelectObject (MSDN), http://msdn2.microsoft.com/en-us/library/ms533272.aspx

Unicode (Wikipedia), http://en.wikipedia.org/wiki/Unicode

UTF-16 (Wikipedia), http://en.wikipedia.org/wiki/UTF-16

WCRANGE (MSDN), http://msdn2.microsoft.com/en-us/library/ms534022.aspx

Revision History

03-12-2007

Original article.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here