Introduction
Base64 encoding can encode binary data to text but has some shortcomings. First of all, encoded values can contain "/" (a slash) which makes Base64 encoded values unusable in file names unless you apply another level of encoding. Second, the encoded values when sorted as string
s are sorted differently than the original values when sorted as numbers.
Background
On my current project, we use Azure blob storage a lot. This has the nice feature that when you list blobs, it always returns them sorted by name. We rely on this feature for versioning so the newer version is before the latest version in the listing. This way, you can get the latest version of the entity by listing "top 1" blobs with a given prefix.
At one point, we needed to extend our versioning scheme. I liked the idea of encoding the binary value using Base64
but very soon, I learned about the shortcomings described in the introduction. In the end, we decided to go in a different direction but I thought it would be fun to implement this "better" Base64
anyway.
Base64
There are several ways how to encode binary data to text. You can use decimal or hexadecimal digits for example. Base64
as the name suggests uses 64 characters: besides 10 decimal digits also uppercase and lowercase letters and two additional characters "+" and "/". The "=" sign is only used for padding and does not encode any information. But different encodings have different ratio between the length of input and output. Let us encode 16 bytes, which is the length of a GUID.
Encoding | Output length | Example |
Decimal digits | 39 | 83010580618813770707986403520966229089 |
Hexadecimal digits | 32 | f3cefc61311240f9a9dc6c519c41733e |
Base64 | 24 (22 without padding) | YfzO8xIx+UCp3GxRnEFzPg== |
Binary | 16 |
How did we get these numbers? One GUID is 16 bytes which is 128 bits - this is 2128 combinations. We use logarithm to get number of digits or characters in output. Base of the logarithm is the number of available characters in the encoding. So for decimal digits we use log10, for hexadecimal digits log16 and for Base64 encoding log64. As you can see, the more characters available in the encoding, the shorter the resulting text.
You can read more about binary to text encoding on Wikipedia.
Choice of Characters
The .NET implementation of Base64
encoding uses following characters: uppercase letters "A" to "Z", then lowercase letters "a" to "z", then decimal digits "0" to "9" and finally two additional characters "+" and "/".
Character | ASCII code |
A .. Z | 65 .. 90 |
a .. z | 97 .. 122 |
0 .. 9 | 48 .. 57 |
+ (plus) | 42 |
/ (slash) | 47 |
This choice of characters has two problems. First obviously using "/" is problematic for file names. Then ordering of characters does not preserve sorting because for higher values in encoding, it uses lower ASCII codes. To fix these problems, we change the order of encoding characters and pick different character instead of "/".
Character | ASCII code |
+ (plus) | 42 |
0 .. 9 | 48 .. 57 |
A .. Z | 65 .. 90 |
_ (underscore) | 95 |
a .. z | 97 .. 122 |
Underscore is a natural choice as it is commonly used in identifiers and is supported in file names. The new ordering also matches the ordering of ASCII codes. This is how it looks in the code:
private const string Base64Characters =
"+0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz";
Padding
In .NET implementation, the output is always padded with "=
" at the end in order to have the output length a multiple of 4. This, of course, breaks the sorting story. To preserve sorting, we would need to add padding in front of the string
. Also it is completely unnecessary and we will not use it.
We still need to insert some padding bits to get a whole number of bytes in the output. There is a simple rule for this:
Input length | Output length | Extra padding bits |
3n | 4n | 0 bits |
3n + 1 | 4n + 2 | 4 bits |
3n + 2 | 4n + 3 | 2 bits |
In the above table, "n" is a positive integer. We can encode 3 bytes into 4 Base64
characters without any padding. Input size that is not a multiple of 3 will result in output size in bits that is not divisible by 8. In this case, we will insert zero bits in front of the most significant character in the output.
You can verify the above in following table:
Input length | Output bit length |
3n | 8 * 3n = 24n = 6 * 4n |
3n + 1 | 8 * (3n + 1) = 24n + 8 = 24n + 6 + 2 = 6 * (4n + 1) + 2 |
3n + 2 | 8 * (3n + 2) = 24n + 16 = 24n + 12 + 4 = 6 * (4n + 2) + 4 |
Endianness
Before we dive into the actual code, we need to make a short stop to explain the data layout. .NET as a platform is little endian. This means that for arrays, the more significant value the higher index in the array. So the least significant value is stored at the beginning of the array and the most significant value is stored at the end. This is important because when encoding as Base64
, we want to put the most significant value at the beginning of the string
and the least significant value at the end. So the order is vice-versa.
To learn more, read about Endianness on Wikipedia.
Encoding
First, we need to determine how many characters will be in output. We will follow the above table. We encode each group of 3 bytes using 4 Base64
characters. If we have 1 extra byte, we add 2 more characters with 4 padding bits. If we have 2 extra bytes, we add 3 more characters with 2 padding bits.
var numberOfChars = 4 * Math.DivRem(input.Length, 3, out int extraBytes);
if (extraBytes > 0)
{
numberOfChars += extraBytes + 1;
}
var output = new char[numberOfChars];
switch (extraBytes)
{
case 0:
Encode0(input, output, input.Length - 1, 0);
break;
case 1:
Encode1(input, output);
break;
case 2:
Encode2(input, output);
break;
default:
throw new InvalidOperationException("Unreachable code");
}
return new string(output);
Encoding Without Padding
This is the simplest case for encoding. We get 3 bytes from input and generate 4 characters in output. Please remember we read the input from the end!
private void Encode0(byte[] input, char[] output, int inputIndex, int outputIndex)
{
while (inputIndex > 0)
{
var i0 = input[inputIndex];
var i1 = input[inputIndex - 1];
var i2 = input[inputIndex - 2];
var o0 = i0 >> 2;
var o1 = ((i0 & 0x03) << 4) | (i1 >> 4);
var o2 = ((i1 & 0x0F) << 2) | (i2 >> 6);
var o3 = i2 & 0x3F;
output[outputIndex] = Base64Characters[o0];
output[outputIndex + 1] = Base64Characters[o1];
output[outputIndex + 2] = Base64Characters[o2];
output[outputIndex + 3] = Base64Characters[o3];
inputIndex -= 3;
outputIndex += 4;
}
}
Encoding With 4 Padding Bits
In this case, we have an extra byte and we encode it into two Base64
characters with 4 padding bits. After we process this extra byte, we can continue the same as with no padding.
private void Encode1(byte[] input, char[] output)
{
var len = input.Length;
var i0 = input[len - 1];
var o0 = i0 >> 6;
var o1 = i0 & 0x3F;
output[0] = Base64Characters[o0];
output[1] = Base64Characters[o1];
Encode0(input, output, len - 2, 2);
}
Encoding With 2 Padding Bits
In this case, we have two extra bytes and we encode them into 3 Base64
characters with 2 padding bits. Again, after we process these two bytes, we can continue the same as with no padding.
private void Encode2(byte[] input, char[] output)
{
var len = input.Length;
var i0 = input[len - 1];
var i1 = input[len - 2];
var o0 = i0 >> 4;
var o1 = ((i0 & 0x0F) << 2) | (i1 >> 6);
var o2 = i1 & 0x3F;
output[0] = Base64Characters[o0];
output[1] = Base64Characters[o1];
output[2] = Base64Characters[o2];
Encode0(input, output, len - 3, 3);
}
Decoding
We use reverse process to determine how many bytes will be in output. Each group of 4 characters encodes 3 bytes. If we have 2 extra characters, we add 1 more byte and decode with 4 padding bits. If we have 3 extra characters, we add 2 more bytes and decode with 2 padding bits.
var numberOfBytes = 3 * Math.DivRem(input.Length, 4, out int extraChars);
if (extraChars > 0)
{
numberOfBytes += extraChars - 1;
}
var output = new byte[numberOfBytes];
switch (extraChars)
{
case 0:
Decode0(input, output, 0, output.Length - 1);
break;
case 2:
Decode1(input, output);
break;
case 3:
Decode2(input, output);
break;
default:
throw new InvalidOperationException("Unreachable code");
}
return output;
Decoding Without Padding
This is the simplest case for decoding. We get 4 characters from input and generate 3 bytes in output. Please remember we need to put bytes to output from the end!
private void Decode0(string input, byte[] output, int inputIndex, int outputIndex)
{
while (inputIndex < input.Length)
{
var i0 = Base64Characters.IndexOf(input[inputIndex]);
var i1 = Base64Characters.IndexOf(input[inputIndex + 1]);
var i2 = Base64Characters.IndexOf(input[inputIndex + 2]);
var i3 = Base64Characters.IndexOf(input[inputIndex + 3]);
var o0 = (byte)((i0 << 2) | (i1 >> 4));
var o1 = (byte)((i1 << 4) | (i2 >> 2));
var o2 = (byte)((i2 << 6) | i3);
output[outputIndex] = o0;
output[outputIndex - 1] = o1;
output[outputIndex - 2] = o2;
inputIndex += 4;
outputIndex -= 3;
}
}
Decoding With 4 Padding Bits
In this case, we have 2 extra Base64
characters and we decode it into 1 byte with 4 padding bits. After we process these extra characters, we can continue the same as with no padding.
private void Decode1(string input, byte[] output)
{
var i0 = Base64Characters.IndexOf(input[0]);
var i1 = Base64Characters.IndexOf(input[1]);
var o0 = (byte)((i0 << 6) | i1);
var len = output.Length;
output[len - 1] = o0;
Decode0(input, output, 2, len - 2);
}
Decoding With 2 Padding Bits
In this case, we have 3 extra Base64
characters and we decode it into 2 bytes with 2 padding bits. After we process these extra characters, we can continue the same as with no padding.
private void Decode2(string input, byte[] output)
{
var i0 = Base64Characters.IndexOf(input[0]);
var i1 = Base64Characters.IndexOf(input[1]);
var i2 = Base64Characters.IndexOf(input[2]);
var o0 = (byte)((i0 << 4) | (i1 >> 2));
var o1 = (byte)((i1 << 6) | i2);
var len = output.Length;
output[len - 1] = o0;
output[len - 2] = o1;
Decode0(input, output, 3, len - 3);
}
Points of Interest
Our new encoding is sortable but there are some limitations worth mentioning. First, it only works for string
s of the same length, shorter string
s are not always placed before longer string
s. You can either pad string
s to be of the same size or implement your own comparer.
Default sorting | Sorting with prefix |
aa | +aa |
aaa | +ab |
aab | aaa |
ab | aab |
aba | aba |
abb | abb |
The other problem is that the sorting works in .NET only with ordinal string comparison. Azure blob storage lists blobs as expected but you need to specify the comparer explicitly in List.Sort() for example. When you sort the Base64
characters using the default comparer, it looks like this:
_+0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
History
- 3rd November, 2019 - Initial release