Introduction
This article describes a GBK encoder which is designed for Silverlight. It's implemented as a single class GBKEncoder
which includes some public static
methods to encode string
and decode byte array with GBK encoding. It's very simple, you can customize it easily.
Background
In Silverlight 4, you can save data to a file in local disk. You can use this feature to export DataGrid
to a local CSV file. If you are using Microsoft Excel 2003 and opening a CSV document that includes Chinese characters, you'll find that Excel 2003 cannot open a utf8 CSV file correctly by directly double clicking the CSV file. In order to open the CSV file directly by double clicking, you must export the CSV file using GB2312/GBK encoding. Unfortunately, the only encoding formats supported by Silverlight 4 are UTF-8 and UTF-16. There are several solutions for this problem:
- Send the data that is being exported back to server, generate the CSV file on server and redirect user to download it.
- Send the data back to server and return the generated encoded binary format of the CSV file to the Silverlight app, then save it to a local file.
- Tell your customers: "Excel 2007, please!", this is the easiest way, but I'm very sure that's we would not do that :)
- Implement some methods to export GBK encoded file directly within Silverlight before Microsoft provides the solution. This is just what this article talks about.
GBKEncoder Class
public static class GBKEncoder
{
static public void Write(System.IO.Stream s, string str);
static public byte[] ToBytes(string str);
static public string Read(byte[] buffer);
static public string Read(byte[] buffer, int iStartIndex, int iLength);
static public string Read(Stream s);
}
Using the Code
The GBKEncoder
class is very intuitive to use, download the GBKEncoder
first, add the GBKEncoder.cs to your project, maybe you also want to rename the name of namespace. Although this class is designed for Silverlight, you can test it in any type of C# project. For a console application, like this:
using (System.IO.FileStream fs = new System.IO.FileStream
("test.txt", System.IO.FileMode.Create))
{
GBKEncoder.Write
(fs, "This is a test for GBKEncoder.这是一段用来测试GBKEncoder类的代码。 ");
}
Run the code and open outputted test.txt, you'll find it is encoded as GBK.
In Silverlight, it may be like this:
SaveFileDialog dlg = new SaveFileDialog() {
DefaultExt = "csv",
Filter = "CSV Files (*.csv)|*.csv|All files (*.*)|*.*",
FilterIndex = 1
};
StringBuilder sb = new StringBuilder();
if (dlg.ShowDialog() == true)
{
using (Stream s = dlg.OpenFile())
{
GBKEncoder.Write(s, sb.ToString());
}
}
Performance
The following code was designed to test the performance of GBKEncoder
class:
static void PerformanceTest()
{
StringBuilder sb = new StringBuilder();
Random r = new Random();
for (int i = 0; i < 200; i++)
{
for (int u = 0x4E00; u < 0x9Fa0; u++)
{
sb.Append((char)u);
if (r.Next(0, 5) == 0)
{
sb.Append((char)r.Next(32, 126));
}
}
}
string str = sb.ToString();
Console.WriteLine("Total character count : {0}", str.Length);
HighPrecisionTimer timer = new HighPrecisionTimer();
timer.Start();
using (System.IO.FileStream fs = new System.IO.FileStream
("test1.txt", System.IO.FileMode.Create))
{
GBKEncoder.Write(fs, str);
}
timer.Stop();
timer.ShowDurationToConsole();
timer.Start();
using (StreamWriter sw = new StreamWriter("test2.txt", false,
Encoding.GetEncoding("GBK")))
{
sw.Write(str);
}
timer.Stop();
timer.ShowDurationToConsole();
timer.Start();
string str2 = "";
using (FileStream fs = new FileStream("test1.txt", FileMode.Open))
{
str2 = GBKEncoder.Read(fs);
}
timer.Stop();
timer.ShowDurationToConsole();
timer.Start();
string str3 = File.ReadAllText("test2.txt", Encoding.GetEncoding("GBK"));
timer.Stop();
timer.ShowDurationToConsole();
if (str == str2 && str2 == str3)
{
Console.WriteLine("Success!");
}
else
{
Console.WriteLine("Error!!!");
}
}
Test environment: Vista 32bit, Q6600 OC3.0GHz, 4G Mem
Result:
Unit: millisecond
Character count: 5014060
|
Encode |
Decode |
GBKEncoder |
39.9 |
62.4 |
.NET Native |
75.6 |
63.6 |
Because the implementation of GBKEncoder
is very simple and straightforward and does not have any consideration about complex situation that may exist, its encode speed is better than .NET native encoder.
GBKEncoder
will take up 50KB of space in your xap file and consume about 260KB of memory at runtime.
Implementation Details
GBK is an extension of the GB2312 character set for simplified Chinese characters, a character is encoded as 1 or 2 bytes, 1 byte for standard ASCII code, 2 bytes for Chinese ideograph characters and punctuation characters.
GBKEncoder
is implemented using two arrays, one for unicode to GBK mapping named sm_mapUnicode2GBK
and another for GBK to unicode mapping named sm_mapGBK2Unicode
. These mappings are generated according to http://www.cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.html. The value of sm_mapUnicode2GBK
is hardcoded in the source code, and the value of sm_mapGBK2Unicode
is generated according to sm_mapUnicode2GBK
at runtime.
Method for generating the value of sm_mapUnicode2GBK
:
static private void GenUnicode2GBKMapping()
{
XmlDocument doc = new XmlDocument();
doc.Load("Unicode2GBK.xml");
int iCount = 0;
byte[] sm_mapUnicode2GBK = new byte[0xFFFF * 2];
foreach (XmlNode n in doc.DocumentElement.ChildNodes)
{
if (n.ChildNodes.Count == 18)
{
string strUnicode = n.ChildNodes[0].InnerText;
if (strUnicode.Substring(0, 2) != "U+")
throw new ApplicationException(string.Format("{0}不是有效的Unicode编码",
strUnicode));
int u = int.Parse(n.ChildNodes[0].InnerText.Substring(2),
System.Globalization.NumberStyles.HexNumber);
for (int i = 2; i < 18; i++)
{
int j = (i - 2 + u) * 2;
foreach (XmlNode subNode in n.ChildNodes[i])
{
if (subNode.Name.ToLower() == "small")
{
string str = subNode.InnerText.Trim().Trim('*');
if (str.Length == 2)
{
sm_mapUnicode2GBK[j] = 0;
sm_mapUnicode2GBK[j + 1] = byte.Parse
(str, System.Globalization.NumberStyles.HexNumber);
iCount++;
}
else if (str.Length == 4)
{
sm_mapUnicode2GBK[j] = byte.Parse(str.Substring(0, 2),
System.Globalization.NumberStyles.HexNumber);
sm_mapUnicode2GBK[j + 1] = byte.Parse(str.Substring(2),
System.Globalization.NumberStyles.HexNumber);
iCount++;
}
else
{
throw new ApplicationException
(string.Format("{0}不是有效的编码", n.ChildNodes[i].OuterXml));
}
}
}
}
}
}
Console.WriteLine("共计转换{0}个字符", iCount);
StringBuilder sb = new StringBuilder();
sb.AppendLine("static byte[] sm_mapUnicode2GBK = new byte[] {");
for (int i = 0; i < sm_mapUnicode2GBK.Length; i++)
{
if (i != 0 && i % 16 == 0) sb.AppendLine();
sb.Append(sm_mapUnicode2GBK[i]);
if (i < sm_mapUnicode2GBK.Length - 1) sb.Append(", ");
}
sb.AppendLine("};");
File.WriteAllText("sm_mapUnicode2GBK.cs", sb.ToString());
}
Unicode2GBK.xml is an XML file including the unicode to gbk mapping information which is generated according to http://www.cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.html.
History
- 2011-03-29 Initial, just encode
- 2011-03-31 Improve performance, encode and decode