Base64 Encoding on a GPU

Nick Kopp

4.89/5 (26 votes)

16 Sep 2013LGPL310 min read

65.6K

1.8K

Performing base64 encoding on a graphics processing unit using CUDAfy.NET (CUDA in .NET).

Download source - 1.4 MB

Introduction

Base64 is used to store binary data such as images as text. It can be used in web pages but is more commonly seen in email applications and in XML files. The .NET Framework includes a base64 encoder in its Convert class. Since it represents a fair degree of bit manipulation and can be done in parallel, is it perhaps a task that can be better handled by a graphics processing unit?

Background

Base64 uses 64 characters as its base (decimal uses 10, hex uses 16). The 64 characters are A-Z, a-z, 0-9, and '+' and '/'. Three bytes can be represented by four base64 digits. If the total number of bytes is not a multiple of three, then padding characters are used. The padding character is '='. The total length of the base64 string for a given number of bytes n is 4 * (n/3). Per block of three bytes, we have in total 24-bits (3 * 8-bits). 64 different values can be represented by 6-bits, so we need in total four 6-bit values to contain the 24-bits. Base64 encoding consists of this processing of blocks of three bytes and since each block is independent of any other block, it would appear to be a suitable task for a GPU.

Though it is becoming more common, the use of GPUs for non-graphics, general purpose programming remains a niche. Typical business and web programming still rarely makes use of GPUs and there are a number of good reasons for this. Finding suitable performance bottlenecks and then porting an algorithm or business logic to make use of a GPU can be time consuming, and the pay-off can be uncertain. Furthermore, the target hardware platform may not have a suitable GPU. Despite this, GPUs can have, given the right task, massive performance gains.

NVIDIA still leads the way with general purpose GPU (GPGPU). Their CUDA language and associated tools and libraries make it relatively easy to get started in C/C++. In .NET, it is a different story, requiring the use of PInvoke and the so-called CUDA driver API. This is rather cumbersome. CUDA.NET simplified matters a little, however the programming model remained that of the driver API, and kernel (GPU device) code still needed to be written in CUDA C. CUDAfy.NET goes a stage further and hides much of the complexity of the driver API, allows kernel code to also be written in .NET, and uses standard .NET types (e.g., arrays of value types).

Configuring your PC

First, be sure you have a relatively recent NVIDIA Graphics Card, one that supports CUDA. You'll then need to go to the NVIDIA CUDA website and download the CUDA 5.5 Toolkit. Install this in the default location. Next up, ensure that you have the latest NVIDIA drivers. These can be obtained through an NVIDIA update or from the NVIDIA website. The project here was built using Visual Studio 2010 and the language is C#, though VB and other .NET languages should be fine. Visual Studio Express can be used without problem - bear in mind that only 32-bit apps can be created though. To get this working with Express, you need to visit the Visual Studio Express website:

Download and install Visual C++ 2010 Express (the NVIDIA compiler requires this)
Download and install Visual C# 2010 Express
Download and install CUDA 5.5 Toolkit (32-bit and/or 64-bit)
Ensure that the C++ compiler (cl.exe) is on the search path (Environment variables)

This set-up of the CUDA compiler (NVCC) is actually the toughest stage of the whole process, so please persevere. Read any errors you get carefully - most likely they are related to not finding cl.exe or not having either the 32-bit or 64-bit CUDA Toolkit. Try to get NVCC working without CUDAfy. If you still have problems, then please let me know.

The Code

The simplest way of converting to base64 in .NET is to use the Convert.ToBase64 method. However, let's look at how we can implement such a method ourselves. Below is a method that takes a byte array and converts to a base64 string. The code is based on C code from René Nyffenegger.

public static string base64_encode(byte[] bytes_to_encode)
{         
    int i = 0;
    int j = 0;
    int ctr = 0;
    // Initialize StringBuilder to required capacity.
    StringBuilder sb = new StringBuilder((bytes_to_encode.Length * 4) / 3);
    // Local array for group of three bytes
    int[] char_array_3 = new int[3];
    // Local array for output 
    int[] char_array_4 = new int[4];
    int in_len = bytes_to_encode.Length;
    // Loop until all bytes have been read
    while (in_len-- > 0)
    {
        // Read bytes into local array
        char_array_3[i++] = bytes_to_encode[ctr++];
        // Once three bytes have been read, begin encoding
        if (i == 3)
        {
            // 1st byte: Mask most significant 6-bits and move 2 bits to the right
            char_array_4[0] = (char_array_3[0] & 0xfc) >> 2;
            // 1st byte: Mask least significant 2-bits and move 4 bits left
            // 2nd byte: Mask most significant 4-bits and move 4 bits right
            // Add values
            char_array_4[1] = ((char_array_3[0] & 0x03) << 4) + ((char_array_3[1] & 0xf0) >> 4);
            // 2nd byte: Mask least significant 4-bits and move 2 bits left
            // 3rd byte: Mask most significant 2-bits and move 6 bits right
            // Add values
            char_array_4[2] = ((char_array_3[1] & 0x0f) << 2) + 
                              ((char_array_3[2] & 0xc0) >> 6);
            // 3rd byte: Mask least significant 6-bits
            char_array_4[3] = char_array_3[2] & 0x3f;
            // Select the base64 characters from
            // the base64_chars array and write to output
            for (i = 0; (i < 4); i++)
            {
                char c = base64_chars[char_array_4[i]];
                sb.Append(c);
            }
            i = 0;
        }
    }
    // If the total number of bytes is not
    // a multiple of 3 then handle the last group
    if (i != 0)
    {
        // Set remaining bytes to zero
        for (j = i; j < 3; j++)
            char_array_3[j] = 0;
        // Handle the group as before
        char_array_4[0] = (char_array_3[0] & 0xfc) >> 2;
        char_array_4[1] = ((char_array_3[0] & 0x03) << 4) + ((char_array_3[1] & 0xf0) >> 4);
        char_array_4[2] = ((char_array_3[1] & 0x0f) << 2) + ((char_array_3[2] & 0xc0) >> 6);
        char_array_4[3] = char_array_3[2] & 0x3f;

        for (j = 0; (j < i + 1); j++)
        {
            char c = base64_chars[char_array_4[j]];
            sb.Append(c);
        }
        // Add padding characters
        while ((i++ < 3))
            sb.Append('=');

    }
    // Convert to string and return
    string s = sb.ToString();
    return s;
}

The GPU Kernel Code

We will use this code as a basis for our GPU kernel. This is the code that will actually run on the GPU. Methods that can run on the GPU must be static and decorated with the Cudafy attribute. Thousands of instances of device methods can run in parallel, so we need a means of identifying threads from within the method. The GThread parameter is how this can be done. The two other parameters are the input and output arrays. The method Gettid returns the unique ID of the thread. Threads are launched in a grid of blocks, where each block contains threads. Hence we calculate the ID by multiplying the block ID by the block size (dimension) and adding the thread ID within the block. Grids and blocks can be multidimensional, but here we only use one dimension, signified by x.

Since each thread will process a group of three bytes, the offset into the input array is three times the thread ID. The offset into the output array is four times the thread ID. What follows is a fairly straightforward port of the code discussed above. Since String is not supported in kernel code, we simply place the string as a constant in the code.

[Cudafy]
public static void ToBase64String(GThread thread, byte[] input, char[] output)
{
    // Get the id of the current thread.
    int tid = Gettid(thread);
    // Input id is 3 times the thread id.
    int itid = tid * 3;
    // Output id is 4 times the thread id.
    int otid = tid * 4;
    // Since we always launch a fixed number of threads
    // per block we do not want a thread to try            
    // accessing an out of range index.
    if (itid + 2 < input.Length)
    {                             
        byte a0 = 0;
        byte a1 = 0;
        byte a2 = 0;
        byte b0 = 0;
        byte b1 = 0;
        byte b2 = 0;
        byte b3 = 0;

        a0 = input[itid];
        a1 = input[itid + 1];
        a2 = input[itid + 2];

        // Do the bit shuffling that's the core of base64 encoding.
        b0 = (byte)((a0 & 0xfc) >> 2);
        b1 = (byte)(((a0 & 0x03) << 4) + ((a1 & 0xf0) >> 4));
        b2 = (byte)(((a1 & 0x0f) << 2) + ((a2 & 0xc0) >> 6));
        b3 = (byte)(a2 & 0x3f);

        // Set the four output chars by selecting the index based on above four values.
        output[otid] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"[b0];
        output[otid + 1] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"[b1];
        output[otid + 2] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"[b2];
        output[otid + 3] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"[b3];       
    }
}

[Cudafy]
public static int Gettid(GThread thread)
{
    int tid = thread.blockIdx.x * thread.blockDim.x + thread.threadIdx.x;
    return tid;
}

The GPU Host Code

The code for interacting with the GPU is implemented in the class named GConvert.

A reference is made to Cudafy.NET.dll and the following namespaces included:

using Cudafy;
using Cudafy.Host;
using Cudafy.Translator;

There are two methods implemented for performing the conversion. Both do the same thing, taking the byte array to be encoded and a Stream for the output. We'll look at the most straightforward.

Writing to the output stream is done asynchronously. Since StreamWriter does not contain Begin and End methods for asynchronous writing, a delegate is used. The class member _gpu refers to the CUDAfy GPGPU class and represents a single GPU device. It is passed into the GConvert constructor. The process of working is now as follows:

Transfer a chunk of the input data to the GPU device.
Process the chunk (Launch).
Transfer the resultant chunk back to the host.
Write the results asynchronously to the output stream.
If more data is available, then return to step 1.
Once all chunks have been processed, check for any remaining bytes and encode on host.

public void ToBase64Naive(byte[] inArray, Stream outStream)
{
    int totalBytes = inArray.Length;
    int ctr = 0;
    int chunkIndex = 0;
    int threadsPerBlock = 256;

    StreamWriter sw = new StreamWriter(outStream);
    BeginWriteDelegate beginWrite = new BeginWriteDelegate(BeginWrite);
    IAsyncResult res = null;
    while (totalBytes > 0)
    {
        // Split into chunks
        int chunkSize = Math.Min(totalBytes, MAXCHUNKSIZE);
        int outChunkSize = (chunkSize * 4) / 3;

        // Copy the data to GPU
        _gpu.CopyToDevice(inArray, ctr, _inArrays_dev[chunkIndex], 0, chunkSize);
        // Calculate blocksPerGrid - GPU launches multiple blocks (blocksPerGrid)
        // each consisting of multiple threads (threadsPerBlock).
        // Each thread will handle 3 bytes.
        int blocksPerGrid = 
          (chunkSize + (threadsPerBlock * 3) - 1) / (threadsPerBlock * 3);
        // Launch the function ToBase64String asynchronously
        // (same stream id as previous GPU command - they are in same queue).
        _gpu.Launch(blocksPerGrid, threadsPerBlock, "ToBase64String", 
                    _inArrays_dev[chunkIndex], _outArrays_dev[chunkIndex]);
        // Copy the data from GPU 
        _gpu.CopyFromDevice(_outArrays_dev[chunkIndex], 0, 
                            _outArrays[chunkIndex], 0, outChunkSize);

        // End any pending write
        if (res != null)
            beginWrite.EndInvoke(res);
        // Begin writing the managed buffer to the stream asynchronously.
        res = beginWrite.BeginInvoke(sw, _outArrays[chunkIndex], 0, 
                                     outChunkSize, null, null);

        // Increment the chunkIndex, decrement totalBytes
        // by chunkSize and increase our offset counter.
        chunkIndex++;
        totalBytes -= chunkSize;
        ctr += chunkSize;
        if (chunkIndex == MAXCHUNKS)
            chunkIndex = 0;
    }
    // Wait for last chunk to be written.
    if (res != null)
        beginWrite.EndInvoke(res);
    // If the total number of bytes converted was not
    // a multiple of 3 then handle the last bytes here.
    int remainder = inArray.Length % 3;
    if (remainder != 0)
    {
        string s = Convert.ToBase64String(inArray, 
                           inArray.Length - remainder, 
                           remainder).Remove(0, remainder);
        sw.Write(s);
    }
    sw.Flush();
}

Of interest here are the following variables:

MAXCHUNKSIZE: A constant that must be a multiple of 3.
MAXCHUNKS: The maximum number of chunks.
_inArrays_dev: On the GPU, MAXCHUNKS number of arrays each of MAXCHUNKSIZE bytes.
_outArrays_dev: On the GPU, MAXCHUNKS number of arrays each of MAXCHUNKSIZE * 4/3 chars.
_outArrays: On host, MAXCHUNKS number of arrays each of MAXCHUNKSIZE * 4/3 chars.

These arrays are instantiated in the GConvert constructor. The process of Cudafying is detailed in an earlier article. Using Cudafy for GPGPU Programming in .NET. Basically, the GPU code is contained in the class GPUConvertCUDA and we create a module with the name of that class. Using the IsModuleLoaded method, we check if the module is already loaded. If not, we try to deserialize the module from an XML file named 'GPUConvertCUDA.cdfy'. If the file does not exist or the serialized module was created from a different version of the assembly, then we Cudafy again. We pass the auto platform (x86 or x64), CUDA 1.2 architecture (the latest NVIDIA devices are 2.0) parameters along with the type of the class we want to Cudafy. To save time, next time the program runs, we serialize the module to an XML file.

public GConvert(GPGPU gpu)
{
    _gpu = gpu;

    string moduleName = typeof(GPUConvertCUDA).Name;
    // If module is not already loaded try to load from file.
    if (!_gpu.IsModuleLoaded(moduleName))
    {
        var mod = CudafyModule.TryDeserialize(moduleName);
        // If file does not exist or the checksum does not match then re-Cudafy.
        if (mod == null || !mod.TryVerifyChecksums())
        {
            Debug.WriteLine("Cudafying...");
            mod = CudafyTranslator.Cudafy(ePlatform.Auto, 
                            eArchitecture.sm_12, typeof(GPUConvertCUDA));
            // Save the module to file for future use.
            mod.Serialize(moduleName);
        }
        _gpu.LoadModule(mod);
    }
    // Instantiate arrays. _dev arrays will ultimately be on the GPU.
    _inArrays_dev = new byte[MAXCHUNKS][];
    _outArrays_dev = new char[MAXCHUNKS][];
    _outArrays = new char[MAXCHUNKS][];
    _inStages = new IntPtr[MAXCHUNKS];
    _outStages = new IntPtr[MAXCHUNKS];

    // Make MAXCHUNKS number of each array. Input is bytes, output is chars.
    // The output array will be 4/3 the size of the input.
    for (int c = 0; c < MAXCHUNKS; c++)
    {
        _inArrays_dev[c] = _gpu.Allocate<byte>(MAXCHUNKSIZE);
        _outArrays_dev[c] = _gpu.Allocate<char>((MAXCHUNKSIZE * 4) / 3);
        _inStages[c] = _gpu.HostAllocate<byte>(MAXCHUNKSIZE);
        _outStages[c] = _gpu.HostAllocate<char>((MAXCHUNKSIZE * 4) / 3);
        _outArrays[c] = new char[(MAXCHUNKSIZE * 4) / 3];
    }
}</char></byte></char></byte>

_inStages and _outStages are used by the optimized version of the ToBase64Naive method, ToBase64. This method makes use of pinned memory for faster transfers and for performing asynchronous transfers and kernel launches. Use HostAllocate for allocating pinned memory. Remember, pinned memory is limited compared to unpinned memory and that you will need to clean it up manually using HostFree. The performance increase gained by using pinned memory for transfers can be around 50%, however this is offset by the need to transfer from managed memory into pinned memory. Ultimately, the speed increase can be insignificant compared to normal transfers as in the ToBase64Naive method. However, it does permit another benefit and that is asynchronous transfers and kernel launches. Transfers do not occur in parallel with each other though; instead they are placed in a queue. Transfers can be in parallel with launches. Correct usage of stream IDs is required. Do not use stream ID 0 for this purpose - stream 0 synchronizes all streams irrespective of ID. All operations of the same stream ID are guaranteed to be performed in sequence.

Using the GConvert Class

Get a GPU device using the static CudafyHost.GetDevice method. The device is passed to the GConvert constructor. Conversion takes two arguments, the byte array and the output stream.

// Get the first CUDA GPGPU
GPGPU gpu = CudafyHost.GetDevice(eGPUType.Cuda);
// Pass the GPGPU to the GConvert constructor
GConvert gc = new GConvert(gpu);
gc.ToBase64(ba, gpuStream);

Benchmarks

We want to compare a number of different ways of doing base64 encoding. The following benchmarks were run on an Intel Core i5-450M with NVIDIA GeForce GT 540M GPU:

Our own base64 encoder.
GPU Naive
GPU Asynchronous
.NET Convert

Test	1MByte (ms)	12MByte (ms)	24MBytes (ms)	36MBytes (ms)
Base64 encoder	52	612	1232	1826
GPU Naive	5	43	85	141
GPU Asynchronous	5	39	76	113
.NET Convert	3	53	106	349

And now as a graph; lower is better.

Discussion

For small amounts of binary data, the differences in speed are relatively small. For large data sets, the GPU provided a massive performance gain over our own base64 implementation. The differences between the simpler naive GPU implementation and the asynchronous pinned memory version were useful but not earth shattering. If you need to squeeze every last bit of performance out of the GPU, then it's worthwhile. However, what is significant is the fantastic result of the out of the box .NET implementation. For the vast majority of cases, it will be almost as fast as the GPU, only being beaten for much larger arrays.

If when you run the application your GPU runs much slower compared to the .NET version that the results here showed, then there is likely a good reason for this. These benchmarks are run on the NVIDIA Fermi architecture. This recent design gives far better performance when accessing small, irregular blocks of global memory as we do here when reading three individual bytes per thread. There are ways of circumventing this issue by clever use of shared memory, however as in any programming task, one has to draw limits to the time spent implementing. The code here was a simple port of the CPU version.

So, why is the .NET base64 encoder so much faster than our own CPU version? Well, it is extremely unlikely that it is implemented in straightforward C# as we did. Most likely, it makes use of special low-level SSE commands, much the same as FFTW (Fastest Fourier Transform in the West) can perform an FFT faster than we can step through an array of the same length, doing a simple multiplication on each element. Due to the simple, sequential, in order nature of base64 encoding, an optimized CPU version can match a GPU. Note that the .NET version could likely be made even faster by explicitly breaking up the data and performing in multiple threads, say one per core. Be aware also that the writing to the stream does not become the bottleneck. Remove that step if necessary to ensure a true comparison. Writing with a StreamWriter is not as fast as writing with a BinaryWriter.

Can base64 encoding still be useful on a GPU? Yes, for very large data sets, and if the data to be encoded is already on the GPU. This would remove the need to copy the data first to the GPU. Furthermore, it would reduce the load on valuable CPU resources.

License

The Cudafy.NET SDK includes two large example projects featuring, amongst others, ray tracing, ripple effects, and fractals. It is available as a dual license software library. The LGPL version is suitable for the development of proprietary or Open Source applications if you can comply with the terms and conditions contained in the GNU LGPL version 2.1. Visit the Cudafy website for more information.

History

2^nd November, 2011

First submission.

17th September, 2013

Updated to use CUDAfy V1.26 and CUDA 5.5

License

This article, along with any associated source code and files, is licensed under The GNU Lesser General Public License (LGPLv3)