Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

A Simple GB2312/GBK Encoder for Silverlight

0.00/5 (No votes)
31 Mar 2011 1  
A simple GB2312/GBK encoder for Silverlight

Introduction

This article describes a GBK encoder which is designed for Silverlight. It's implemented as a single class GBKEncoder which includes some public static methods to encode string and decode byte array with GBK encoding. It's very simple, you can customize it easily.

Background

In Silverlight 4, you can save data to a file in local disk. You can use this feature to export DataGrid to a local CSV file. If you are using Microsoft Excel 2003 and opening a CSV document that includes Chinese characters, you'll find that Excel 2003 cannot open a utf8 CSV file correctly by directly double clicking the CSV file. In order to open the CSV file directly by double clicking, you must export the CSV file using GB2312/GBK encoding. Unfortunately, the only encoding formats supported by Silverlight 4 are UTF-8 and UTF-16. There are several solutions for this problem:

  1. Send the data that is being exported back to server, generate the CSV file on server and redirect user to download it. 
  2. Send the data back to server and return the generated encoded binary format of the CSV file to the Silverlight app, then save it to a local file.
  3. Tell your customers: "Excel 2007, please!", this is the easiest way, but I'm very sure that's we would not do that :) 
  4. Implement some methods to export GBK encoded file directly within Silverlight before Microsoft provides the solution.  This is just what this article talks about.

GBKEncoder Class

public static class GBKEncoder
{
    /// <summary>
    /// Writes a string to the stream.
    /// </summary>
    /// <param name="s">The stream to write into.</param>
    /// <param name="str">The string to write. </param>
    static public void Write(System.IO.Stream s, string str);

    /// <summary>
    /// Encodes string into a sequence of bytes.
    /// </summary>
    /// <param name="str">The string to encode.</param>
    /// <returns>A byte array containing the results of encoding 
    /// the specified set of characters.</returns>
    static public byte[] ToBytes(string str);

    /// <summary>
    /// Decodes a sequence of bytes into a string.
    /// </summary>
    /// <param name="buffer">The byte array containing the sequence of bytes to decode. 
    /// </param>
    /// <returns>A string containing the results of decoding 
    /// the specified sequence of bytes.
    /// </returns>
    static public string Read(byte[] buffer);

    /// <summary>
    /// Decodes a sequence of bytes into a string.
    /// </summary>
    /// <param name="buffer">The byte array containing the sequence of bytes to decode. 
    /// </param>
    /// <param name="iStartIndex">The index of the first byte to decode.</param>
    /// <param name="iLength">The number of bytes to decode. </param>
    /// <returns>A string containing the results of decoding 
    /// the specified sequence of bytes.
    /// </returns>
    static public string Read(byte[] buffer, int iStartIndex, int iLength);

    /// <summary>
    /// Read string from stream.
    /// </summary>
    /// <param name="s">The stream to read from.</param>
    /// <returns>A string containing all characters in the stream.</returns>
    static public string Read(Stream s);
}

Using the Code

The GBKEncoder class is very intuitive to use, download the GBKEncoder first, add the GBKEncoder.cs to your project, maybe you also want to rename the name of namespace. Although this class is designed for Silverlight, you can test it in any type of C# project. For a console application, like this: 

using (System.IO.FileStream fs = new System.IO.FileStream
("test.txt", System.IO.FileMode.Create))
{
    GBKEncoder.Write
	(fs, "This is a test for GBKEncoder.这是一段用来测试GBKEncoder类的代码。 ");
}

Run the code and open outputted test.txt, you'll find it is encoded as GBK.

In Silverlight, it may be like this: 

SaveFileDialog dlg = new SaveFileDialog() { 
    DefaultExt = "csv", 
    Filter = "CSV Files (*.csv)|*.csv|All files (*.*)|*.*", 
    FilterIndex = 1 
};
StringBuilder sb = new StringBuilder();
// some code to fill sb ...
if (dlg.ShowDialog() == true)
{
    using (Stream s = dlg.OpenFile())
    {
        GBKEncoder.Write(s, sb.ToString());
    }
}

Performance

The following code was designed to test the performance of GBKEncoder class:

static void PerformanceTest()
{
    StringBuilder sb = new StringBuilder();
    Random r = new Random();
    for (int i = 0; i < 200; i++)
    {
        for (int u = 0x4E00; u < 0x9Fa0; u++)
        {
            sb.Append((char)u);
            if (r.Next(0, 5) == 0)
            {
                sb.Append((char)r.Next(32, 126));
            }
        }
    }

    string str = sb.ToString();
    Console.WriteLine("Total character count : {0}", str.Length);

    HighPrecisionTimer timer = new HighPrecisionTimer();

    timer.Start();
    using (System.IO.FileStream fs = new System.IO.FileStream
		("test1.txt", System.IO.FileMode.Create))
    {
         GBKEncoder.Write(fs, str);
    }
    timer.Stop();
    timer.ShowDurationToConsole();

    timer.Start();
    using (StreamWriter sw = new StreamWriter("test2.txt", false, 
				Encoding.GetEncoding("GBK")))
    {
        sw.Write(str);
    }
    timer.Stop();
    timer.ShowDurationToConsole();

    timer.Start();
    string str2 = "";
    using (FileStream fs = new FileStream("test1.txt", FileMode.Open))
    {
        str2 = GBKEncoder.Read(fs);
    }
    timer.Stop();
    timer.ShowDurationToConsole();

    timer.Start();
    string str3 = File.ReadAllText("test2.txt", Encoding.GetEncoding("GBK"));
    timer.Stop();
    timer.ShowDurationToConsole();

    if (str == str2 && str2 == str3)
    {
        Console.WriteLine("Success!");
    }
    else
    {
        Console.WriteLine("Error!!!");
    }
}

Test environment: Vista 32bit, Q6600 OC3.0GHz, 4G Mem

Result:

Unit: millisecond
Character count: 5014060

Encode Decode
GBKEncoder 39.9  62.4 
.NET Native  75.6  63.6 

Because the implementation of GBKEncoder is very simple and straightforward and does not have any consideration about complex situation that may exist, its encode speed is better than .NET native encoder.

GBKEncoder will take up 50KB of space in your xap file and consume about 260KB of memory at runtime. 

Implementation Details  

GBK is an extension of the GB2312 character set for simplified Chinese characters, a character is encoded as 1 or 2 bytes, 1 byte for standard ASCII code, 2 bytes for Chinese ideograph characters and punctuation characters.  

GBKEncoder is implemented using two arrays, one for unicode to GBK mapping named sm_mapUnicode2GBK and another for GBK to unicode mapping named sm_mapGBK2Unicode. These mappings are generated according to http://www.cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.html. The value of sm_mapUnicode2GBK is hardcoded in the source code, and the value of sm_mapGBK2Unicode is generated according to sm_mapUnicode2GBK at runtime.  

Method for generating the value of sm_mapUnicode2GBK:

/// <summary>
/// Generate unicode to GBK mapping according to
/// http://www.cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.html
/// </summary>
static private void GenUnicode2GBKMapping()
{
    XmlDocument doc = new XmlDocument();
    doc.Load("Unicode2GBK.xml");
    int iCount = 0;
    byte[] sm_mapUnicode2GBK = new byte[0xFFFF * 2];
    foreach (XmlNode n in doc.DocumentElement.ChildNodes)
    {
        if (n.ChildNodes.Count == 18)
        {
            string strUnicode = n.ChildNodes[0].InnerText;
            if (strUnicode.Substring(0, 2) != "U+")
                 throw new ApplicationException(string.Format("{0}不是有效的Unicode编码", 
		strUnicode));

            int u = int.Parse(n.ChildNodes[0].InnerText.Substring(2), 
			System.Globalization.NumberStyles.HexNumber);

            for (int i = 2; i < 18; i++)
            {
                int j = (i - 2 + u) * 2;

                foreach (XmlNode subNode in n.ChildNodes[i])
                {
                     if (subNode.Name.ToLower() == "small")
                     {
                         string str = subNode.InnerText.Trim().Trim('*');
                         if (str.Length == 2)
                         {
                             sm_mapUnicode2GBK[j] = 0;
                             sm_mapUnicode2GBK[j + 1] = byte.Parse
				(str, System.Globalization.NumberStyles.HexNumber);
                             iCount++;
                         }
                         else if (str.Length == 4)
                         {
                             sm_mapUnicode2GBK[j] = byte.Parse(str.Substring(0, 2), 
				System.Globalization.NumberStyles.HexNumber);
                             sm_mapUnicode2GBK[j + 1] = byte.Parse(str.Substring(2), 
				System.Globalization.NumberStyles.HexNumber);
                             iCount++;
                         }
                         else
                         {
                             throw new ApplicationException
			(string.Format("{0}不是有效的编码", n.ChildNodes[i].OuterXml));
                         }
                     }
                }
            }
        }
    }
    Console.WriteLine("共计转换{0}个字符", iCount);

    StringBuilder sb = new StringBuilder();
    sb.AppendLine("static byte[] sm_mapUnicode2GBK = new byte[] {");

    for (int i = 0; i < sm_mapUnicode2GBK.Length; i++)
    {
        if (i != 0 && i % 16 == 0) sb.AppendLine();
        sb.Append(sm_mapUnicode2GBK[i]);
        if (i < sm_mapUnicode2GBK.Length - 1) sb.Append(", ");
    }
    sb.AppendLine("};");

    File.WriteAllText("sm_mapUnicode2GBK.cs", sb.ToString());
}

Unicode2GBK.xml is an XML file including the unicode to gbk mapping information which is generated according to http://www.cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.html.

History

  • 2011-03-29 Initial, just encode 
  • 2011-03-31 Improve performance, encode and decode 

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here