During the writing of my last post, I did the due diligence thing and considered alternative implementations and algorithms to solve the problem at hand (converting a string
representation of an 8-bit hexadecimal value to an unsigned 8-bit integer value). Because I was, in effect, documenting code written some years ago, I can't recall exactly what other options, if any, I tried at the time.
I think I first tried using a std::stringstream
, but gave up on that as being too slow, and went with strtoul
instead. I might also have played around with using a std::map
lookup table, with all the headaches that brought in terms of storage and initialization, and decided against it.
What I didn't try was a straight, non-clever switch-based lookup table to find the integer value of a hexadecimal character digit:
inline unsigned char hex_digit_to_nybble(char ch)
{
switch (ch)
{
case '0': return 0x0;
case '1': return 0x1;
case '2': return 0x2;
...
case 'f': return 0xf;
case 'F': return 0xf;
default: throw std::invalid_argument();
}
}
I did when writing that post, though, and found that this was twice as fast as using strtoul
. That in itself isn’t surprising, as it’s a much simpler functional requirement to fulfill (the standard strtoul
is a very capable parser).
As it happens, using this digit-to-nybble function is not only faster, but also simplifies the main conversion function calling it. For one thing, this function handles all sanity checking and validation of the characters given, and for another, it removes the need to copy the two characters into a hex string before doing the conversion.
Faster, neater, tidier code – brilliant!
However, I was troubled by one question: what character set does C++ use?
See, I was using literal characters, and we've all been warned about making assumptions about character sets. I even referred to that kind of dangerous assumption in the post I wrote, without, if I'm honest, thinking much about it.
Looking at the code again, I realized that I was already assuming things about the character set, since I use an array of characters when encoding integer values into hex string
s:
static const std::string hex_char = "0123456789abcdef";
...
char high_nybble_char = hex_char[[(data_char >> 4) & 0x0f];
char low_nybble_char = hex_char[[data_char & 0x0f];
In other words, if this was safe, the switch was safe. If it wasn't, neither was the switch, and while I already had a portable safe and working version for the decoding (using strtoul
), I might have to come up with an alternative for the encoding.
Essentially, for this code to be guaranteed to be portable, standards-compliant, and safe, three potentially different character sets need to agree that the sixteen characters used to represent hexadecimal numbers are the same. Those three are:
- The character set used to store the source code files
- The character set used by the compiler when creating executable code
- The character set used by the data sent into the decoding function, and accepted as output by the caller of the encoding function
The problem is that a character is only a character when printed or written or displayed. It’s a visual shape, a representation. This A (probably) looks like a character to you, but it isn’t stored as one. It’s stored in some sort of binary representation, using some defined rule set and look-up that tells your computer (or mobile, iPad, text-to-speech system, etc.) that it should be rendered in a manner that represents the concept of the letter “A” in one of the many Latin alphabets. Probably.
So the compiler needs to be able to read the source file, which is a stream of bytes (which, of course, may be of any number of lengths, although 8 bits is the most common), and interpret those into a representation where, for instance, the bytes {0×69, 0×66, 0×20, 0×28} are parsed as “if (” and not “%ö”.
Of course, this is what a compiler is required to do, and regardless of how the source code is stored (well, provided it’s in a form the compiler supports), the compiler will read it and convert it to “basic source character set”. This character set includes all hexadecimal digits, so we're safe there. (It actually includes most of your basic printable 7-bit ASCII. There are a couple of good explanations of this on StackOverflow.)
Next, the compiler has to worry about the “basic execution character set”, which is what character set – at minimum – is used during execution. This is defined as the “basic source character set” plus a few extra characters, so we're good there, too.
Finally, the data sent in to the function is safe, too, because the “basic source character set” covers the calling functions too, and in the case of data read from disk or otherwise externally sourced, it is the responsibility of the calling function to provide it in that form.
For more information on these character sets, see the C++ standard, or these two articles, which have been very helpful to me:
So, anyway, it’s safe to use character literals to represent hexadecimal digits, which means that the bytes_to_hex
function can remain unchanged, and the hex_to_bytes
can be rewritten to use the much faster switch
lookup:
inline unsigned char hex_digit_to_nybble(char ch)
{
switch (ch)
{
case '0': return 0x0;
case '1': return 0x1;
...
case 'f': return 0xf;
case 'F': return 0xf;
default:
{
std::string e = "Invalid character in hex string: \'";
e += ch;
e += "'";
throw std::invalid_argument(e);
}
}
}
void hex_to_bytes2(const std::string& str,
std::vector<unsigned char>& data)
{
static_assert<8 == CHAR_BIT>::valid_expression();
data.clear();
if (str.empty())
return;
size_t lengthOverflow = str.length() % 2;
const size_t length = lengthOverflow + str.length() / 2;
data.resize(length);
size_t i = 0;
size_t c = 0;
if (1 == lengthOverflow)
{
data[i++] = hex_digit_to_nybble(str[c++]);
}
while (i < length)
{
data[i++] = (hex_digit_to_nybble(str[c++]) << 4) |
hex_digit_to_nybble(str[c++]);
}
}
So there we are. I spent an evening learning a bit more about C++, improving some old code, and writing this post saying “You can always improve”.
Tagged: C++, string