Updated tip after comments (alternative 4 below) of
jocko3d that the test function was not correct due to the difference between
Chr()
and
ChrW()
functions.
Introduction
In device communications, data is usually sent as bytes and commands stored as either byte arrays or strings. The strings are used for easy manipulation of commands/responses. If you would like to split incoming data using a terminator sequence (for example a CR LF pair), then splitting a byte array requires some coding while splitting a string can be easily done using the
string.split
method.
All in all, proper translation between the different types is required. From the title of the article you might figure, that's easy, isn't that just what the encoders do... Well, no they don't. The biggest issue is the encodings used, they have a mapping of characters to codes, and if not all codes have been defined, it won't work (or might stop working in future .NET versions).
The goal: Transform a byte array into a string and back (and fast). Simple as that.
Using the Code
Writing a loop or using character arrays to do the transformation is rather slow. The .NET encoders are much faster, so that's where to look. If we're going to use the encoders, then we need an encoding that has a character representation for each of the possible 256 byte values in the array. For example; the ASCII encoding fails, as it only has 7bits (0 - 127 dec). If you would use it, then for any value above 127 the encoder doesn't have a character and it will insert a '?' (question mark) character. Additionally many encodings use multi-byte sequences to represent a single character, so for example grabbing the first 10 characters of a string might not end up the same as grabbing the first 10 bytes of an array.
From the documentation, the Latin1 (or ISO 8859-1) set should be a good one, but if you look closer (see
this article), this set also fails quite some characters as it doesn't contain the control characters (00 to 1F and 7F to 9F in hex). There seems to be another set called ISO-8859-1 (mind the extra dash!) that adds the control characters to this set, but it's not supported very much (.NET throws an exception if you use the name). UTF8 if often thought to be the solution, and it may be so for text manipulation, but not for the purpose of manipulating bytes using string methods (due to UTF8 being a multibyte encoding).
Testing the .NET Supported Codepages
By running a simple test, you can check which codepage do support the translation. Using the
MSDN page with encodings supported, I created the following code which tests them all and displays the results. The list of codepages from the MSDN page;
Dim list() As Integer = {37, 437, 500, 708, 720, 737, 775, 850, 852, 855, 857, _
858, 860, 861, 862, 863, 864, 865, 866, 869, 870, 874, 875, 932, 936, _
949, 950, 1026, 1047, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, _
1148, 1149, 1200, 1201, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, _
1258, 1361, 10000, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, _
10010, 10017, 10021, 10029, 10079, 10081, 10082, 12000, 12001, 20000, _
20001, 20002, 20003, 20004, 20005, 20105, 20106, 20107, 20108, 20127, _
20261, 20269, 20273, 20277, 20278, 20280, 20284, 20285, 20290, 20297, _
20420, 20423, 20424, 20833, 20838, 20866, 20871, 20880, 20905, 20924, _
20932, 20936, 20949, 21025, 21866, 28591, 28592, 28593, 28594, 28595, _
28596, 28597, 28598, 28599, 28603, 28605, 29001, 38598, 50220, 50221, _
50222, 50225, 50227, 51932, 51936, 51949, 52936, 54936, 57002, 57003, _
57004, 57005, 57006, 57007, 57008, 57009, 57010, 57011, 65000, 65001}
To run the test the following test data is created
For n As Integer = 0 To 255
expected(n) = CByte(n)
strChr = strChr & Chr(n)
strChrW = strChrW & ChrW(n)
Next
The difference between the
Chr()
and
ChrW()
is that the former will use the codepage of the current thread (which is usually defined by the system locale), and the latter always uses the unicode codepage (28591).
Here's the actual testing code;
Public Sub TestConversion(ByVal enc As System.Text.Encoding)
arrayequal(enc.GetBytes(enc.GetString(expected)))
arrayequal(enc.GetBytes(strChr))
arrayequal(enc.GetBytes(strChrW))
If enc.GetString(expected).Length <> 256 Then
Console.Write(" Length failure test 1 (" & enc.GetString(expected).Length & ")")
End If
End Sub
Public Function arrayequal(ByVal result() As Byte)
For n As Integer = 0 To 255
If result(n) <> expected(n) Then
Console.Write(" - |")
Return False
End If
Next
Console.Write(" Ok |")
Return True
End Function
The code runs 3 tests;
- convert a byte array, with 0 - 255 as bytes, to a string, then convert it back to a byte array. Verify that the byte array OUT is identical to the byte array IN. Finally there is a length test, to determine if the generated intermediate string actually has the same length as the byte array (256 bytes). This wil catch multi-byte encodings that will return 1 character for multiple bytes.
- create a string with chars 0 - 255, using the
Chr()
function, and convert it to a byte array. Verify that the byte values in the array match the expected 0 - 255 values. - create a string with chars 0 - 255, using the
ChrW()
function, and convert it to a byte array. Verify that the byte values in the array match the expected 0 - 255 values.
The difference between 2 and 3 is small but important.
If you run the test code, this will be the result (on my system)
+----------+--------+--------+--------+
| Codepage | Test 1 | Test 2 | Test 3 |
+----------+--------+--------+--------+
| 37 | Ok | - | - |
| 437 | Ok | - | - |
| 500 | Ok | - | - |
| 708 | Ok | - | - |
| 720 | Ok | - | - |
| 737 | Ok | - | - |
| 775 | Ok | - | - |
| 850 | Ok | - | - |
| 852 | Ok | - | - |
| 855 | Ok | - | - |
| 857 | Ok | - | - |
| 858 | Ok | - | - |
| 860 | Ok | - | - |
| 861 | Ok | - | - |
| 862 | Ok | - | - |
| 863 | Ok | - | - |
| 864 | Ok | - | - |
| 865 | Ok | - | - |
| 866 | Ok | - | - |
| 869 | Ok | - | - |
| 870 | Ok | - | - |
| 874 | Ok | - | - |
| 875 | - | - | - |
| 932 | - | - | - | Length failure test 1 (225)
| 936 | Ok | - | - | Length failure test 1 (193)
| 949 | - | - | - | Length failure test 1 (193)
| 950 | - | - | - | Length failure test 1 (193)
| 1026 | Ok | - | - |
| 1047 | Ok | - | - |
| 1140 | Ok | - | - |
| 1141 | Ok | - | - |
| 1142 | Ok | - | - |
| 1143 | Ok | - | - |
| 1144 | Ok | - | - |
| 1145 | Ok | - | - |
| 1146 | Ok | - | - |
| 1147 | Ok | - | - |
| 1148 | Ok | - | - |
| 1149 | Ok | - | - |
| 1200 | - | - | - | Length failure test 1 (128)
| 1201 | - | - | - | Length failure test 1 (128)
| 1250 | Ok | - | - |
| 1251 | Ok | - | - |
| 1252 | Ok | Ok | - | Thread codepage.
| 1253 | Ok | - | - |
| 1254 | Ok | - | - |
| 1255 | Ok | - | - |
| 1256 | Ok | - | - |
| 1257 | Ok | - | - |
| 1258 | Ok | - | - |
| 1361 | - | - | - | Length failure test 1 (199)
| 10000 | Ok | - | - |
| 10001 | - | - | - | Length failure test 1 (225)
| 10002 | - | - | - | Length failure test 1 (194)
| 10003 | Ok | - | - | Length failure test 1 (211)
| 10004 | Ok | - | - |
| 10005 | - | - | - |
| 10006 | Ok | - | - |
| 10007 | Ok | - | - |
| 10008 | - | - | - | Length failure test 1 (215)
| 10010 | Ok | - | - |
| 10017 | Ok | - | - |
| 10021 | - | - | - |
| 10029 | Ok | - | - |
| 10079 | Ok | - | - |
| 10081 | Ok | - | - |
| 10082 | Ok | - | - |
| 12000 | - | - | - | Length failure test 1 (64)
| 12001 | - | - | - | Length failure test 1 (65)
| 20000 | - | - | - | Length failure test 1 (209)
| 20001 | - | - | - | Length failure test 1 (203)
| 20002 | - | - | - | Length failure test 1 (215)
| 20003 | - | - | - | Length failure test 1 (203)
| 20004 | - | - | - | Length failure test 1 (209)
| 20005 | - | - | - | Length failure test 1 (201)
| 20105 | - | - | - |
| 20106 | - | - | - |
| 20107 | - | - | - |
| 20108 | - | - | - |
| 20127 | - | - | - |
| 20261 | - | - | - | Length failure test 1 (248)
| 20269 | - | - | - |
| 20273 | Ok | - | - |
| 20277 | Ok | - | - |
| 20278 | Ok | - | - |
| 20280 | Ok | - | - |
| 20284 | Ok | - | - |
| 20285 | Ok | - | - |
| 20290 | - | - | - |
| 20297 | Ok | - | - |
| 20420 | - | - | - |
| 20423 | - | - | - |
| 20424 | - | - | - |
| 20833 | - | - | - |
| 20838 | - | - | - |
| 20866 | Ok | - | - |
| 20871 | Ok | - | - |
| 20880 | Ok | - | - |
| 20905 | - | - | - |
| 20924 | - | - | - |
| 20932 | - | - | - | Length failure test 1 (208)
| 20936 | - | - | - | Length failure test 1 (215)
| 20949 | Ok | - | - | Length failure test 1 (211)
| 21025 | Ok | - | - |
| 21866 | Ok | - | - |
| 28591 | Ok | - | Ok |
| 28592 | Ok | - | - |
| 28593 | Ok | - | - |
| 28594 | Ok | - | - |
| 28595 | Ok | - | - |
| 28596 | Ok | - | - |
| 28597 | Ok | - | - |
| 28598 | Ok | - | - |
| 28599 | Ok | - | - |
| 28603 | Ok | - | - |
| 28605 | Ok | - | - |
| 29001 | Ok | - | - |
| 38598 | Ok | - | - |
| 50220 | - | - | - | Length failure test 1 (254)
| 50221 | - | - | - | Length failure test 1 (254)
| 50222 | - | - | - | Length failure test 1 (254)
| 50225 | - | - | - | Length failure test 1 (254)
| 50227 | Ok | - | - | Length failure test 1 (193)
| 51932 | - | - | - | Length failure test 1 (208)
| 51936 | Ok | - | - | Length failure test 1 (193)
| 51949 | Ok | - | - | Length failure test 1 (211)
| 52936 | - | - | - |
| 54936 | - | - | - | Length failure test 1 (193)
| 57002 | - | - | - |
| 57003 | - | - | - |
| 57004 | - | - | - |
| 57005 | - | - | - |
| 57006 | - | - | - |
| 57007 | - | - | - |
| 57008 | - | - | - |
| 57009 | - | - | - |
| 57010 | - | - | - |
| 57011 | - | - | - |
| 65000 | - | - | - | Length failure test 1 (255)
| 65001 | - | - | - |
+----------+--------+--------+--------+
Press any key...
Anything that fails
Test 1
basically does not have a complete mapping for character codes 0 through 255. These will be useless for our purpose, using any of those codepages will result in data loss, unless the application you're building doesn't use the missing characters/bytes. Now even if they are 'Ok' then still some fail the length test (last column, number shown is the length of the resulting string which should have been 256), due to multi-byte encodings.
Test 2
only passes for the one codepage that is the current threads codepage (as shown in last column), whereas
Test 3
only passes on the unicode codepage (28591).
Conclusion
If you only use encoders to transform the strings/characters to bytes and back, then any of them that passes test 1 (Note: see remark on length test under Points of Interest below), will do the job.
If you also use the
Chr(), ChrW(), Asc(), AscW()
functions, then you should be more careful.
Chr()
and
Asc()
should only be used if you can fix/set the current threads codepage, and it still requires a code page that passes
Test 1
.
The probably best solution is using the unicode codepage 28591 (internal representation of strings used by .NET), combined with the
ChrW()
and
AscW()
.
The tricky part of the last two is that you cannot mix the use of the
Chr(), ChrW(), Asc(), AscW()
functions, if you pick one, you can't use the other.
A generic utility class to do the conversion looks like this;
Public Class Utility
Public Shared Function String2Bytes(ByVal str As String) As Byte()
Return Text.Encoding.GetEncoding(28591).GetBytes(str)
End Function
Public Shared Function Bytes2String(ByVal bytes As Byte()) As String
Return Text.Encoding.GetEncoding(28591).GetString(bytes)
End Function
Public Shared Function Byte2Char(ByVal b As Byte) As Char
Return ChrW(b)
End Function
Public Shared Function Char2Byte(ByVal c As Char) As Byte
Return AscW(c)
End Function
End Class
This will always work, just don't mix it with using
Chr()
and
Asc()
functions.
Points of Interest
Passing the length test is not conclusive that the codepage is OK, because the sequence tested is a sequential 0 to 255, so a byte sequence of 100, 120, 32 is not in the test set and might cause less than 3 characters for any multi-byte encoding.
Doesn't UTF8 do bidirectional and lossless transformations? No, it does not. UTF8 is codepage 65001 (see the
mentioned MSDN link) and failed the test.
The often made mistake here is that UTF8 is the all encompassing encoding, and it does work when you are taking a string with a piece of text and then encode it to a byte array and back. But in this case, the byte array was not created by the encoder, it's just a bunch of data that I want to convert to a string for easy handling. There is no guarantee that this set of bytes will be valid UTF8. The test proved the point.
Some of the codepages passing
Test 1
still have unmapped characters,
Windows-1252 codepage still has 5 characters unmapped. So if in any future version of .NET, the encoders get stricter in their behaviour (as they did in the migration from 1.1 to 2.0), then all of a sudden these 5 unmapped characters might also be translated into '?' (question mark) characters and the 1-on-1 conversion will fail. So that's something to be aware of when picking any of these codepages.
Source
Based on
this discussion and alternative 4 below.
The
sourcecode (VS2010 project) is available as well.