Here’s a problem that tends to crop up in a lot of communication domains: how do you transfer binary data in a protocol which limits what characters are permitted? The answer is to encode it into permissible characters (for historical reasons often 7-bit printable ASCII), and because there are few things this wonderful industry likes more than re-inventing the wheel, there’s a plethora of binary-to-text encoding schemes around. Each has its own trade-offs in terms of speed and space efficiency, and almost every one has a more or less glorious history of being the favoured scheme on some platform, or in some protocol or application.
The simplest encoding is (in my opinion) the “hexadecimal text” encoding. It’s so simple, it doesn’t even have a fancy or clever name. You simply take each byte and type its value as a hexadecimal number. Working on the assumption that a byte is 8 bits, its value can be expressed in two characters – 0×00-0xff. Assuming that a character occupies one byte, we see that the size of the data will double by writing it as hexadeximal text, so it’s not very efficient space-wise. But it is simple to understand and implement, and quite useful, so I wrote a pair of encoding/decoding functions.
Let’s start with the encoding function, as that’s the simplest. I'll use std::string
to store the resulting string
here, with the above assumptions. (Those assumptions – 8-bit memory bytes, 8-bit characters – are quite reasonable, in that if you're working on a platform where they're not true, you probably know about it.)
#include <limits> // For char size
void bytes_to_hex(const std::vector<unsigned char>& data,
std::string& str)
{
bytes_to_hex(&data[0], data.size(), str);
}
void bytes_to_hex(const unsigned char* data, size_t length,
std::string& str)
{
static_assert<8 == CHAR_BIT>::valid_expression();
str.clear();
if (0 == length)
return;
str.resize(length * 2, ' ');
static const std::string hex_char = "0123456789abcdef";
for (size_t i = 0; i < length; ++i)
{
str[i<<1] = hex_char[(data[i] >> 4) & 0x0f];
str[(i<<1) + 1] = hex_char[data[i] & 0x0f];
}
}
As you see, it’s very simple. Given a buffer of bytes {7, 233, 57, 42, 198}, the string
“07e9392ac6? is generated into the output parameter.
While there is a standard way of turning a number into a string
– with std::stringstream
– I elected to write the code to do it myself here, since it’s a very simple and safe algorithm.
First, I set up an array of the sixteen hexadecimal digits, and then simply use the high and low nybble as an index into this array to get the character corresponding to the value of the nybble. The high nybble has to be shifted down to get the correct range, and that’s all there is to it.
Since I had already resized the output string, I can write directly into the correct position, instead of appending (which would likely be significantly slower).
It’s tempting to write the decoding function as a straight reverse, but this stumbles on the character-to-value lookup. How do you get from a character to its corresponding nybble? The naive solution looks as follows:
std::string hex = "f3"; ...
char hi_nybble = hex[0];
char lo_nybble = hex[1];
unsigned char result = 0;
if (hi_nybble > '9')
result |= (hi_nybble - 'a' + 0xa) << 4;
else
result |= (hi_nybble - '0') << 4;
if (lo_nybble > '9')
result |= (lo_nybble - 'a' + 0xa);
else
result |= (lo_nybble - '0');
...
Ignoring for the moment that there’s no sanity checking of the input data, assuming only the characters [0-9,a-f] will be present, this function still fails the “good engineering” test by making assumptions about character ordering and values. It’s not safe to use, and may stop working when used with a different character set.
An alternative would be to make a proper lookup table, mapping characters to nybble values, with separate entries for upper and lower case characters (a-f), either by populating a std::map
or a big switch
:
inline unsigned char hex_digit_to_nybble(char ch)
{
switch (ch)
{
case '0': return 0x0;
case '1': return 0x1;
case '2': return 0x2;
...
case 'f': return 0xf;
case 'F': return 0xf;
default: throw std::invalid_argument();
}
}
Then, after I have the nybbles, I could shift the high one up, and do a bitwise OR to join them. But frankly, while this works, it feels clunky. And besides, there are lots of standard ways to convert a string
to a number; from the standard C library functions atoi
and strtol
, to the standard C++ std::stringstream
(and even boost::lexical_cast
which isn’t standard, but fairly popular). However, only two of those can handle numbers in bases other than decimal – strtol
and std::stringstream
– and of those, the latter is much more powerful, and therefore likely to be slower.
The strtol
function expects a character string, so I’ll have to copy each pair of characters into a zero-terminated buffer, and use that as input to get a byte. That’s simple enough, but what do I do if there isn’t a pair of characters, but a single one?
In other words, if I have the hex string “3da
” to convert, the function should treat it like “03da
”, and produce {03, da}
rather than {3d, a0}
. Rather than making a copy of the string with an extra “0
? prepended, I’ll treat this as a special case.
Any other potential problems with using strtol
? Well, yes, the matter of what characters count as valid input. I’ll use the unsigned version, strtoul
, since I'm expecting an unsigned output, but even this is far too lenient in what it accepts: [whitespace][{+|–}] [0[{x|X}]][digits].
Since I’m not converting numbers, but encoding bytes, I can’t accept any whitespace, signs, or anything that isn’t a hexadecimal digit. Looking into this further, it’s less of a problem than it would first appear, as strtoul
will let me know if there’s an un-parsed character at the end. In other words, “-3? is fully parsed, but for “3-” it will only parse the first digit and then stop, so that’s a simple thing to check for. Furthermore, by telling it explicitly what base to use, it will disallow an initial “0x”.
Still, we need to make sure the initial character is valid hex, which means breaking out isxdigit
to make sure we only accept hexadecimal digits. Now, this is a function that is both available from the standard C library, via the <cctype>
header, and from the standard C++ library, via the <locale>
header. The difference is that the std::isxdigit
takes a std::locale
, which I assume is just for completeness’ sake, as the C isxdigit
is not affected by any changes to locale. At best, it’s just a through call only adding a level of indirection, and at worst, it adds more unnecessary computing, so I’ll just stick to the C version.
#include <cctype> // For isxdigit
void hex_to_bytes(const std::string& str,
std::vector<unsigned char>& data)
{
static_assert<8 == CHAR_BIT>::valid_expression();
data.clear();
if (str.empty())
return;
size_t lengthOverflow = str.length() % 2;
const size_t length = lengthOverflow + str.length() / 2;
data.resize(length);
static char buf[3];
buf[2] = 0;
char* pend = &buf[2];
size_t i = 0;
size_t c = 0;
if (1 == lengthOverflow)
{
buf[0] = '0';
buf[1] = str[c++];
unsigned char x = static_cast<unsigned char>
(strtoul(buf, &pend, 16));
if (pend != &buf[2])
{
std::string e = "Invalid character in hex string: \'";
e += *(pend);
e += "'";
throw std::invalid_argument(e);
}
data[i++] = x;
}
for (; i < length; ++i)
{
buf[0] = str[c++];
if (!isxdigit(buf[0]))
{
std::string e = "Invalid character in hex string: \'";
e += buf[0];
e += "'";
throw std::invalid_argument(e);
}
buf[1] = str[c++];
unsigned char x = static_cast<unsigned char>
(strtoul(buf, &pend, 16));
if (pend != &buf[2])
{
std::string e = "Invalid character in hex string: \'";
e += *(pend);
e += "'";
throw std::invalid_argument(e);
}
data[i] = x;
}
}
(As it happens, when writing this post I came across a very interesting blog entry by Tino Didriksen testing different methods of converting strings to integers in C++, using decimal strings, so all the methods I mention above are timed. Looking at those results there’s little to recommend std::stringstream
in terms of speed, which was a nice validation of code I wrote the first version of years ago.
I also used the benchmark code from Tino so as to compare the usage of strtol
to the hex_digit_to_nybble
outlined above, and found that the latter was almost twice as fast. I’m currently pondering what drawbacks there might be in using what was quickly conceived as a rhetorical strawman while writing this article.)
Tagged:
C++,
string