String to byte conversion using .NET encoders (8bit 1-on-1)

Tieske8

4.86/5 (6 votes)

29 Feb 2012CPOL6 min read

55K

Fast conversion between strings and byte arrays, while bypassing the codepage limitations

Updated tip after comments (alternative 4 below) of jocko3d that the test function was not correct due to the difference between Chr() and ChrW() functions.

Introduction

In device communications, data is usually sent as bytes and commands stored as either byte arrays or strings. The strings are used for easy manipulation of commands/responses. If you would like to split incoming data using a terminator sequence (for example a CR LF pair), then splitting a byte array requires some coding while splitting a string can be easily done using the string.split method.
All in all, proper translation between the different types is required. From the title of the article you might figure, that's easy, isn't that just what the encoders do... Well, no they don't. The biggest issue is the encodings used, they have a mapping of characters to codes, and if not all codes have been defined, it won't work (or might stop working in future .NET versions).
The goal: Transform a byte array into a string and back (and fast). Simple as that.

Using the Code

Writing a loop or using character arrays to do the transformation is rather slow. The .NET encoders are much faster, so that's where to look. If we're going to use the encoders, then we need an encoding that has a character representation for each of the possible 256 byte values in the array. For example; the ASCII encoding fails, as it only has 7bits (0 - 127 dec). If you would use it, then for any value above 127 the encoder doesn't have a character and it will insert a '?' (question mark) character. Additionally many encodings use multi-byte sequences to represent a single character, so for example grabbing the first 10 characters of a string might not end up the same as grabbing the first 10 bytes of an array.
From the documentation, the Latin1 (or ISO 8859-1) set should be a good one, but if you look closer (see this article), this set also fails quite some characters as it doesn't contain the control characters (00 to 1F and 7F to 9F in hex). There seems to be another set called ISO-8859-1 (mind the extra dash!) that adds the control characters to this set, but it's not supported very much (.NET throws an exception if you use the name). UTF8 if often thought to be the solution, and it may be so for text manipulation, but not for the purpose of manipulating bytes using string methods (due to UTF8 being a multibyte encoding).

Testing the .NET Supported Codepages

By running a simple test, you can check which codepage do support the translation. Using the MSDN page with encodings supported, I created the following code which tests them all and displays the results. The list of codepages from the MSDN page;

' list of codepages to test
Dim list() As Integer = {37, 437, 500, 708, 720, 737, 775, 850, 852, 855, 857, _
        858, 860, 861, 862, 863, 864, 865, 866, 869, 870, 874, 875, 932, 936, _
        949, 950, 1026, 1047, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, _
        1148, 1149, 1200, 1201, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, _
        1258, 1361, 10000, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, _
        10010, 10017, 10021, 10029, 10079, 10081, 10082, 12000, 12001, 20000, _
        20001, 20002, 20003, 20004, 20005, 20105, 20106, 20107, 20108, 20127, _
        20261, 20269, 20273, 20277, 20278, 20280, 20284, 20285, 20290, 20297, _
        20420, 20423, 20424, 20833, 20838, 20866, 20871, 20880, 20905, 20924, _
        20932, 20936, 20949, 21025, 21866, 28591, 28592, 28593, 28594, 28595, _
        28596, 28597, 28598, 28599, 28603, 28605, 29001, 38598, 50220, 50221, _
        50222, 50225, 50227, 51932, 51936, 51949, 52936, 54936, 57002, 57003, _
        57004, 57005, 57006, 57007, 57008, 57009, 57010, 57011, 65000, 65001}

To run the test the following test data is created

' initialize values
For n As Integer = 0 To 255
    expected(n) = CByte(n)
    strChr = strChr & Chr(n)
    strChrW = strChrW & ChrW(n)
Next

The difference between the Chr() and ChrW() is that the former will use the codepage of the current thread (which is usually defined by the system locale), and the latter always uses the unicode codepage (28591).

Here's the actual testing code;

Public Sub TestConversion(ByVal enc As System.Text.Encoding)
    ' Test 1; decode and encode byte array, then compare;  byte() --> string --> byte
    arrayequal(enc.GetBytes(enc.GetString(expected)))
    ' Test 2; Chr() build string
    arrayequal(enc.GetBytes(strChr))
    ' Test 3; ChrW() build string
    arrayequal(enc.GetBytes(strChrW))
    ' check string length
    If enc.GetString(expected).Length <> 256 Then
        Console.Write("  Length failure test 1 (" & enc.GetString(expected).Length & ")")
    End If
End Sub

Public Function arrayequal(ByVal result() As Byte)
    For n As Integer = 0 To 255
        If result(n) <> expected(n) Then
            ' Test failed
            Console.Write("   -    |")
            Return False
        End If
    Next
    ' Test passed
    Console.Write("   Ok   |")
    Return True
End Function

The code runs 3 tests;

convert a byte array, with 0 - 255 as bytes, to a string, then convert it back to a byte array. Verify that the byte array OUT is identical to the byte array IN. Finally there is a length test, to determine if the generated intermediate string actually has the same length as the byte array (256 bytes). This wil catch multi-byte encodings that will return 1 character for multiple bytes.
create a string with chars 0 - 255, using the Chr() function, and convert it to a byte array. Verify that the byte values in the array match the expected 0 - 255 values.
create a string with chars 0 - 255, using the ChrW() function, and convert it to a byte array. Verify that the byte values in the array match the expected 0 - 255 values.

The difference between 2 and 3 is small but important.

If you run the test code, this will be the result (on my system)

+----------+--------+--------+--------+
| Codepage | Test 1 | Test 2 | Test 3 |
+----------+--------+--------+--------+
|      37  |   Ok   |   -    |   -    |
|     437  |   Ok   |   -    |   -    |
|     500  |   Ok   |   -    |   -    |
|     708  |   Ok   |   -    |   -    |
|     720  |   Ok   |   -    |   -    |
|     737  |   Ok   |   -    |   -    |
|     775  |   Ok   |   -    |   -    |
|     850  |   Ok   |   -    |   -    |
|     852  |   Ok   |   -    |   -    |
|     855  |   Ok   |   -    |   -    |
|     857  |   Ok   |   -    |   -    |
|     858  |   Ok   |   -    |   -    |
|     860  |   Ok   |   -    |   -    |
|     861  |   Ok   |   -    |   -    |
|     862  |   Ok   |   -    |   -    |
|     863  |   Ok   |   -    |   -    |
|     864  |   Ok   |   -    |   -    |
|     865  |   Ok   |   -    |   -    |
|     866  |   Ok   |   -    |   -    |
|     869  |   Ok   |   -    |   -    |
|     870  |   Ok   |   -    |   -    |
|     874  |   Ok   |   -    |   -    |
|     875  |   -    |   -    |   -    |
|     932  |   -    |   -    |   -    |  Length failure test 1 (225)
|     936  |   Ok   |   -    |   -    |  Length failure test 1 (193)
|     949  |   -    |   -    |   -    |  Length failure test 1 (193)
|     950  |   -    |   -    |   -    |  Length failure test 1 (193)
|    1026  |   Ok   |   -    |   -    |
|    1047  |   Ok   |   -    |   -    |
|    1140  |   Ok   |   -    |   -    |
|    1141  |   Ok   |   -    |   -    |
|    1142  |   Ok   |   -    |   -    |
|    1143  |   Ok   |   -    |   -    |
|    1144  |   Ok   |   -    |   -    |
|    1145  |   Ok   |   -    |   -    |
|    1146  |   Ok   |   -    |   -    |
|    1147  |   Ok   |   -    |   -    |
|    1148  |   Ok   |   -    |   -    |
|    1149  |   Ok   |   -    |   -    |
|    1200  |   -    |   -    |   -    |  Length failure test 1 (128)
|    1201  |   -    |   -    |   -    |  Length failure test 1 (128)
|    1250  |   Ok   |   -    |   -    |
|    1251  |   Ok   |   -    |   -    |
|    1252  |   Ok   |   Ok   |   -    |  Thread codepage.
|    1253  |   Ok   |   -    |   -    |
|    1254  |   Ok   |   -    |   -    |
|    1255  |   Ok   |   -    |   -    |
|    1256  |   Ok   |   -    |   -    |
|    1257  |   Ok   |   -    |   -    |
|    1258  |   Ok   |   -    |   -    |
|    1361  |   -    |   -    |   -    |  Length failure test 1 (199)
|   10000  |   Ok   |   -    |   -    |
|   10001  |   -    |   -    |   -    |  Length failure test 1 (225)
|   10002  |   -    |   -    |   -    |  Length failure test 1 (194)
|   10003  |   Ok   |   -    |   -    |  Length failure test 1 (211)
|   10004  |   Ok   |   -    |   -    |
|   10005  |   -    |   -    |   -    |
|   10006  |   Ok   |   -    |   -    |
|   10007  |   Ok   |   -    |   -    |
|   10008  |   -    |   -    |   -    |  Length failure test 1 (215)
|   10010  |   Ok   |   -    |   -    |
|   10017  |   Ok   |   -    |   -    |
|   10021  |   -    |   -    |   -    |
|   10029  |   Ok   |   -    |   -    |
|   10079  |   Ok   |   -    |   -    |
|   10081  |   Ok   |   -    |   -    |
|   10082  |   Ok   |   -    |   -    |
|   12000  |   -    |   -    |   -    |  Length failure test 1 (64)
|   12001  |   -    |   -    |   -    |  Length failure test 1 (65)
|   20000  |   -    |   -    |   -    |  Length failure test 1 (209)
|   20001  |   -    |   -    |   -    |  Length failure test 1 (203)
|   20002  |   -    |   -    |   -    |  Length failure test 1 (215)
|   20003  |   -    |   -    |   -    |  Length failure test 1 (203)
|   20004  |   -    |   -    |   -    |  Length failure test 1 (209)
|   20005  |   -    |   -    |   -    |  Length failure test 1 (201)
|   20105  |   -    |   -    |   -    |
|   20106  |   -    |   -    |   -    |
|   20107  |   -    |   -    |   -    |
|   20108  |   -    |   -    |   -    |
|   20127  |   -    |   -    |   -    |
|   20261  |   -    |   -    |   -    |  Length failure test 1 (248)
|   20269  |   -    |   -    |   -    |
|   20273  |   Ok   |   -    |   -    |
|   20277  |   Ok   |   -    |   -    |
|   20278  |   Ok   |   -    |   -    |
|   20280  |   Ok   |   -    |   -    |
|   20284  |   Ok   |   -    |   -    |
|   20285  |   Ok   |   -    |   -    |
|   20290  |   -    |   -    |   -    |
|   20297  |   Ok   |   -    |   -    |
|   20420  |   -    |   -    |   -    |
|   20423  |   -    |   -    |   -    |
|   20424  |   -    |   -    |   -    |
|   20833  |   -    |   -    |   -    |
|   20838  |   -    |   -    |   -    |
|   20866  |   Ok   |   -    |   -    |
|   20871  |   Ok   |   -    |   -    |
|   20880  |   Ok   |   -    |   -    |
|   20905  |   -    |   -    |   -    |
|   20924  |   -    |   -    |   -    |
|   20932  |   -    |   -    |   -    |  Length failure test 1 (208)
|   20936  |   -    |   -    |   -    |  Length failure test 1 (215)
|   20949  |   Ok   |   -    |   -    |  Length failure test 1 (211)
|   21025  |   Ok   |   -    |   -    |
|   21866  |   Ok   |   -    |   -    |
|   28591  |   Ok   |   -    |   Ok   |
|   28592  |   Ok   |   -    |   -    |
|   28593  |   Ok   |   -    |   -    |
|   28594  |   Ok   |   -    |   -    |
|   28595  |   Ok   |   -    |   -    |
|   28596  |   Ok   |   -    |   -    |
|   28597  |   Ok   |   -    |   -    |
|   28598  |   Ok   |   -    |   -    |
|   28599  |   Ok   |   -    |   -    |
|   28603  |   Ok   |   -    |   -    |
|   28605  |   Ok   |   -    |   -    |
|   29001  |   Ok   |   -    |   -    |
|   38598  |   Ok   |   -    |   -    |
|   50220  |   -    |   -    |   -    |  Length failure test 1 (254)
|   50221  |   -    |   -    |   -    |  Length failure test 1 (254)
|   50222  |   -    |   -    |   -    |  Length failure test 1 (254)
|   50225  |   -    |   -    |   -    |  Length failure test 1 (254)
|   50227  |   Ok   |   -    |   -    |  Length failure test 1 (193)
|   51932  |   -    |   -    |   -    |  Length failure test 1 (208)
|   51936  |   Ok   |   -    |   -    |  Length failure test 1 (193)
|   51949  |   Ok   |   -    |   -    |  Length failure test 1 (211)
|   52936  |   -    |   -    |   -    |
|   54936  |   -    |   -    |   -    |  Length failure test 1 (193)
|   57002  |   -    |   -    |   -    |
|   57003  |   -    |   -    |   -    |
|   57004  |   -    |   -    |   -    |
|   57005  |   -    |   -    |   -    |
|   57006  |   -    |   -    |   -    |
|   57007  |   -    |   -    |   -    |
|   57008  |   -    |   -    |   -    |
|   57009  |   -    |   -    |   -    |
|   57010  |   -    |   -    |   -    |
|   57011  |   -    |   -    |   -    |
|   65000  |   -    |   -    |   -    |  Length failure test 1 (255)
|   65001  |   -    |   -    |   -    |
+----------+--------+--------+--------+

Press any key...

Anything that fails Test 1 basically does not have a complete mapping for character codes 0 through 255. These will be useless for our purpose, using any of those codepages will result in data loss, unless the application you're building doesn't use the missing characters/bytes. Now even if they are 'Ok' then still some fail the length test (last column, number shown is the length of the resulting string which should have been 256), due to multi-byte encodings.
Test 2 only passes for the one codepage that is the current threads codepage (as shown in last column), whereas Test 3 only passes on the unicode codepage (28591).

Conclusion

If you only use encoders to transform the strings/characters to bytes and back, then any of them that passes test 1 (Note: see remark on length test under Points of Interest below), will do the job.
If you also use the Chr(), ChrW(), Asc(), AscW() functions, then you should be more careful. Chr() and Asc() should only be used if you can fix/set the current threads codepage, and it still requires a code page that passes Test 1.
The probably best solution is using the unicode codepage 28591 (internal representation of strings used by .NET), combined with the ChrW() and AscW().
The tricky part of the last two is that you cannot mix the use of the Chr(), ChrW(), Asc(), AscW() functions, if you pick one, you can't use the other.

A generic utility class to do the conversion looks like this;

Public Class Utility
    Public Shared Function String2Bytes(ByVal str As String) As Byte()
        Return Text.Encoding.GetEncoding(28591).GetBytes(str)
    End Function

    Public Shared Function Bytes2String(ByVal bytes As Byte()) As String
        Return Text.Encoding.GetEncoding(28591).GetString(bytes)
    End Function

    Public Shared Function Byte2Char(ByVal b As Byte) As Char
        Return ChrW(b)
    End Function

    Public Shared Function Char2Byte(ByVal c As Char) As Byte
        Return AscW(c)
    End Function
End Class

This will always work, just don't mix it with using Chr() and Asc() functions.

Points of Interest

Passing the length test is not conclusive that the codepage is OK, because the sequence tested is a sequential 0 to 255, so a byte sequence of 100, 120, 32 is not in the test set and might cause less than 3 characters for any multi-byte encoding.

Doesn't UTF8 do bidirectional and lossless transformations? No, it does not. UTF8 is codepage 65001 (see the mentioned MSDN link) and failed the test.
The often made mistake here is that UTF8 is the all encompassing encoding, and it does work when you are taking a string with a piece of text and then encode it to a byte array and back. But in this case, the byte array was not created by the encoder, it's just a bunch of data that I want to convert to a string for easy handling. There is no guarantee that this set of bytes will be valid UTF8. The test proved the point.

Some of the codepages passing Test 1 still have unmapped characters, Windows-1252 codepage still has 5 characters unmapped. So if in any future version of .NET, the encoders get stricter in their behaviour (as they did in the migration from 1.1 to 2.0), then all of a sudden these 5 unmapped characters might also be translated into '?' (question mark) characters and the 1-on-1 conversion will fail. So that's something to be aware of when picking any of these codepages.

Source

Based on this discussion and alternative 4 below.
The sourcecode (VS2010 project) is available as well.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)