The C++ source files for the stand-alone base64 encoder and decoder discussed in this post can be found here:
http://coolcowstudio.co.uk/source/cpp/coding.zip.
There is a quote that goes “Standards are great! Everyone should have one.” or something along those lines. (Somewhat ironically, this quote, too, has many different variations, and has many attributions. The earliest I’ve found attributes it to George Morrow in InfoWorld 21 Oct 1985).
A case in point is the base64 encoding. Put simply, it’s a method of encoding an array of 8-bit bytes using an alphabet consisting of 64 different printable characters from the ASCII character set. This is done by taking three 8-bit bytes of source data, arranging them into a 24-bit word, and converting that into four 6-bit characters that maps onto the 64-character alphabet (since 6 bits is 0-63).
The original implementation was for privacy-enhanced e-mail (RFC 1421), then altered slightly for MIME (RFC 2045), and again in its own standard (RFC 4648).
When I was looking at base64, I was interested in three different varieties or flavours, namely the MIME version, the (per RFC 4648) standard base64, and base64url. These differ in how they handle line breaks and other illegal characters, what characters are used in the 64-character alphabet, and the use of padding at the end to make up an even triplet of bytes.
That’s three mostly similar but slightly different algorithms, so how to design an implementation? A number of designs are possible, of course, like:
- three copy-pasted and tweaked sets of functions
- hugely parameterised functions where every variation in algorithm or data can be altered in the call
- an inheritance tree, where virtual overridden functions alter behaviour
I chose to use a design that in my opinion takes the best from all of those, with none of the disadvantages.
The Wikipedia article has a handy table giving a summary of the differences between the variants, which gives eight possible differences of data or algorithm. I’ve elected to ignore the last of those, which is the addition of a checksum only used for OpenPGP (RFC 4880), since that is easily added outside the base64 coding. The sixth difference on Wikipedia’s list – line separator – can also be ignored since it can be inferred from the maximum line length. The remaining areas of difference are:
- Character for index 62
- Character for index 63
- Character to use for padding
- Whether fixed line length is used
- Maximum line length
- Handling of illegal characters
Fortunately, these are all integer types (bool
and char
are both integer types), so can be used as template parameters. In other words, I can define a type to hold all possible variants:
template<
char Tchar62 char Tchar63 char TcharPad bool TfixLineLength int TmaxLineLength bool TignoreIllegal> struct base64_variants
{
static char char62()
{ return Tchar62; }
static char char63()
{ return Tchar63; }
static char charPad()
{ return TcharPad; }
static bool pad()
{ return TcharPad > 0; }
static bool fixLineLength()
{ return TfixLineLength; }
static int maxLineLength()
{ return TmaxLineLength; }
static bool lineBreaks()
{ return TmaxLineLength>0; }
static bool ignoreIllegal()
{ return TignoreIllegal; }
};
What is the benefit of this? Well, it lets us express distinct sets of parameters to correspond to particular standards, like this:
typedef base64_variants<'+', '/', '=', false, 76, true> base64MIME;
typedef base64_variants<'+', '/', '=', false, 0, false> base64;
typedef base64_variants<'-', '_', 0, false, 0, false> base64url;
This means that there is no risk of mixing up parameters in function calls – once a set is defined (and tested and verified to be correct) that can be used everywhere. The actual coding functions then get a very simple interface, with an input, an output, and variant type as only parameters:
template<typename T>
void bytes_to_base64(const unsigned char* data,
size_t length,
std::string& str);
template<typename T>
void bytes_to_base64(const std::vector<unsigned char>& data,
std::string& str);
template<typename T>
void base64_to_bytes(const std::string& str,
std::vector<unsigned char>& data);
And this is how the use of these would look in code:
const char* src = "fooba";
std::vector<unsigned char> data(src, src+5);
std::string result;
bytes_to_base64<base64>(data, result);
assert(result == "Zm9vYmE=");
base64_to_bytes<base64>(result, data);
result = std::string(data.begin(), data.end());
assert(result == "fooba");
Below is a short extract of the implementation, illustrating how the variant specification is used:
template<typename T>
void bytes_to_base64(const unsigned char* data, size_t length,
std::string& str)
{
str.clear();
str.reserve((length * 4 ) / 3 + 3 +
(T::lineBreaks() ? 2 * length / T::maxLineLength() : 0) );
(I won’t post the whole implementation here. While it’s only 200 airy and thoroughly commented lines, it’s better to give you a link to an archive with the files.)
Now, the astute reader will note that I only declared the coding functions earlier, and talked about the implementation as if it was separate. That’s the normal way of doing things, but that won’t work with templates, right?
Here’s the thing with template functions and classes: they are not just the one thing. A normal, non-template function is one single thing, fully defined once and once only. Therefore, it can be compiled in one compilation unit, and then linked to. It can be declared elsewhere, and that declaration is essentially a promise that somethere this thing is defined, which is all the compiler cares about.
A template function is effectively a new function for each template parameter (or combination of parameters) it is used with. (As far as the compiler is concerned, a template class is just a way of saying that all member functions have the same set of template parameters.) To the compiler, there’s no such thing as a “template function”. There is only “template function with these template parameters”.
This means that there can’t be a single compiled variant that can be linked to, only specialisations that use this or that set of template parameters. Even if only one variant, only one specialisation, is used in your project, the compiler can’t know that.
So instead, the compiler compiles these inline; it’s effectively replacing each call to a template function with a copy of that function, in which the template parameters are those used in that particular call.
In other words, C++ templates are just a way of bullying the compiler into doing the copy-paste programming you are ashamed of doing yourself.
As it happens, though, that also provides the solution to separationg definitions and declarations, provided you know what template parameters you’ll be using. All you need to do is declare the function with the parameters you want, in the same compilation unit as the template function definition.
Here’s a simple example:
-------------------------------------
-- In my_template.h file
...
template <typename T>
T my_temp_func(const T& t);
-------------------------------------
-- In my_template.cpp file
#include "my_template.h"
template <typename T>
T my_temp_func(const T& t)
{
return t;
}
int my_temp_func<int>(const int& t);
-------------------------------------
-- In using_my_template.cpp
#include "my_template.h"
...
int i = my_temp_func<int>(4); string s = my_temp_func<string>("Abob");
In the example above, the declaration in the last line of my_template.cpp tells the compiler that there’ll be a variant of the template function that uses int
as template parameter. Okay, says the compiler, I’ll put an inline copy there. Since the generic definition is right there in the same compilation unit (i.e., my_template.cpp), this is something the compiler can do – it has all the information it needs.
The result of that is that in the compiled file (probably called my_template.obj) there is now a function that has the signature int my_temp_func(const int& t)
. This is a fully defined specialization of a template function, so to the linker it looks just like a normal function.
However, the linker won’t be able to find a string
specialisation, so this will generate a linker error.
This illustrates both how to use this trick, and its limitation. It only works if you list all specializations you are going to use, which makes it unfeasible for generic libraries.
In this case, though, it’s ideal. I have my three variants of base64
defined – base64MIME
, base64
and base64url
– and those are the only one I’ll need.
Actually, I might as well add a definition for the original variant:
typedef base64_variants<'+', '/', '=', true, 64, false> base64PEM;
So I have my four variations defined, and they are the only variations of base64
I’m interested in, so I’ll just have to declare them in the same .cpp file as the function definitions are in:
template void bytes_to_base64<base64>(const unsigned char* data,
size_t length, std::string& str);
template void bytes_to_base64<base64>(const std::vector<unsigned char>& data,
std::string& str);
template void base64_to_bytes<base64>(const std::string& str,
std::vector<unsigned char>& data);
template void bytes_to_base64<base64MIME>(const unsigned char* data,
size_t length, std::string& str);
...
This lets the linker find compiled varieties for all those base64
definitions.
Should you want to implement a slightly different base64
coder, you could use the code I’ve written. It wouldn’t be enough to declare a new definition type, but you would also have to add a declaration using that type to the .cpp file. But the source code is both open and free, so help yourself.
(I should note that like with so much else, base64
is something there are lots and lots of implementations of available on the net, but most of the ones I’ve found tended to be very lax and lack strict checking of syntactical correctness, or implement just one flavour. Hence, writing my own.)
As always, if you found this interesting or useful, or have suggestions for improvements, please let me know.
Filed under: Code, CodeProject
Tagged: C++, coding, template