Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / CUDA

Ultra High Quality Image Rotation on a GPU

5.00/5 (7 votes)
16 Sep 2013LGPL38 min read 39.8K   2.8K  
Ultra high quality frequency domain image rotation on a GPU.

Introduction

In this article we will explore how to rotate an image in the frequency domain on a Graphics Processing Unit. The quality of the rotation is quite staggering. We will rotate an image, then rotate the rotated image, and on and on. Naturally we'd expect the image to lose quality, however we'll see that the degradation is minimal. You will also witness something surreal when rotating the image by say 1000th of a degree at 100fps and being able to still notice the image 'creeping' at sub-pixel level. We will be developing in .NET and target NVIDIA GPUs and therefore make use of CUDAfy.NET, a CUDA wrapper for .NET.

Frequency Domain Image Rotation on GPU

Lena - still looking good despite having been rotated 0.01 degrees 1000 times.

Background

Back at the end of the last century three guys got together to commercialize a rather unusual chip they were working on at a large Dutch research organization. The chip was going to become the world's fastest floating point Fast Fourier Transform processor - the PowerFFT. Now that may not say a lot to you but if you work in areas such as radar processing or medical imaging it's quite a big deal especially since it used only 3 Watts. A Fourier Transform typically converts a function of time or spatial domain to a function of frequency domain. If you do a Fourier Transform on a pure sine wave then you'll get a graph with a single spike representing the frequency of the sine wave. Anyway the company was named doubleBW and I worked with Laurens, Wout and Peter as a software engineer responsible for devising software to program this bizarre chip. One of the demonstrations we made was the frequency domain image rotator which is the subject of this article. While the PowerFFT chip was successfully produced, the business side never really took off and the company refocused on new areas. However in the last few years the European Space Agency bought the IP and set to work creating a space qualified version. I was hired in as a consultant on this which caused me to revisit my old work and decide to try and port the old image rotator to the GPU. I dedicate this article to Laurens, Wout, Peter and all the other talented engineers and all round good blokes I worked with in my 5 years there.

doubleBW PowerFFT licensed by ESA.

Image Rotation in Frequency Domain

Broadly speaking there are three main phases in performing the rotation. In the first phase we perform forward Discrete Fourier Transforms (DFT) on each line of the image. In this example we have a black and white image of size 512x512, so that is 512 DFTs of 512-points each. Being a power of 2 we could have used Fast Fourier Transforms but the CUDA FFT library does DFTs which can be pretty much any length and appear to be equally fast. Once that is done we multiply every point by a coefficient. These coefficients are specific for the desired angle of rotation and there are three sets of coefficients, each set being the same size as the image and specific to one of the three phases. The name for these sets is twiddle vectors or twiddle factors. I'm not sure why. After this we return briefly to the spatial domain by using an inverse or reverse DFT. This operation when using the CUDA DFT library requires that the result be scaled by multiplying every value by 1/Width = 1/512.

Lena after one of the required three phases.

Lena - not looking so good after the first phase.

In the second phase we are going to perform the same process as phase one but in the vertical direction. It brings us a step closer to our desired result.

Lena after two of the required three phases.

Lena - looking better but something is still not quite right.

The third and final phase is as per phase one but as with all phases it uses its own twiddle vectors. Lena appears in all her rotated glory (well actually she doesn't but this is a website for all ages so I will leave it to you to Google for the full uncropped image). Actually, you will also notice that the corners of the image are full of apparent garbage from the rounding off - the full image information is still present here and if you rotate backwards the same amount as you went forward you will fully recreate the original image.

Lena after all three of the required phases.

Lena after all three of the required phases.

Using the code

You will need NVIDIA CUDA 5.5 64-bit installed on your PC. A deployment machine will also need the CUDA CUFFT and CUBLAS libraries accessible in either the executing folder or somewhere on the search path. If you do not have experience of CUDA or CUDAfy.NET then it is highly recommended that you take a look at my other articles on the subject:

Rotating an image in the frequency domain is a rather processing intensive operation. It is also an operation ideally suited to the GPU. Since we are working in .NET, CUDAfy.NET is a convenient wrapper for NVIDIA CUDA. CUDA is a means of using NVIDIA GPUs for compute, rather than only graphics. CUDAfy simplifies the use of CUDA in .NET applications. The methods and structures which will be used on the GPU are marked with the attribute Cudafy.

The example image is stored as a resource. This is accessed and displayed and the raw data extracted by calling GetBytes. All GPU functionality is encapsulated in the class GPUWrapper. Some interesting things happen in its constructor.

C#
public GPUWrapper(int deviceId)
{
    // Check if we already have a serialized CUDAfy module. 
    var mod = CudafyModule.TryDeserialize(GetType().Name);
    // If we do not have a serialized module or if the checksum
    // of the module does not match the assembly we recreate.
    if (mod == null || !mod.TryVerifyChecksums())
    {
        // We want to translate (cudafy) the specified types.
        // The result is stored in a CUDAfy module.
        mod = CudafyTranslator.Cudafy(typeof(TwiddleSettings), 
                        typeof(GPUWrapper), typeof(TwiddleGeneration));
        // Serialize the cudafy module to an xml file.
        mod.Serialize(GetType().Name);
    }
    // Get the CUDA device with index deviceId.
    _gpu = CudafyHost.GetDevice(eGPUType.Cuda, deviceId);
    // Load the CUDAfy module.
    _gpu.LoadModule(mod);
    // Instantiate object for storing pointers to GPU memory.
    _gdata = new GPUData(_gpu);
    // Instantiate object for CUDA FFT and BLAS libraries.
    _maths = new GPUMaths(_gpu);
}

The most important action is the line that calls CudafyTranslator.Cudafy(...). This routine translates the marked .NET code into CUDA C and then calls the NVIDIA C compiler to generate intermediate language for the GPU (called PTX). All this reflection info and other output (i.e., the PTX) is stored in the CUDAfy module object. TwiddleSettings is a simple struct that holds parameters for the twiddle vectors. GPUWrapper and TwiddleGeneration are classes that have some methods marked with the Cudafy attribute. These methods are translated into CUDA C functions which we can call from the host later.

The next interesting piece of code is for uploading the image and initializing helper classes. CopyToDevice and CopyFromDevice are used to transfer data between the host (CPU/system memory) and the GPU (global) memory. Launch calls a function on the GPU. The values such as Width and Height we pass to Launch state how many blocks and threads we want to launch in parallel. The first parameter is the number of blocks and the second is the number of threads per block.

C#
public void UploadImage(byte[] image, float angle)
{
    // Initialize the helper classes.
    _gdata.Set(Width, Height);
    _maths.Set(Width, Height);

    // Copy the image to the GPU and then launch function ConvertByteToComplex.
    _gpu.CopyToDevice(image, _gdata.SourceImage);
    _gpu.Launch(Width, Height).ConvertByteToComplex(
      _gdata.SourceImage, _gdata.SourceImageCplx, Width, Height);

    UpdateTwiddles(angle);
}

public void UpdateTwiddles(float angle)
{
    // Set the struct with our required settings
    TwiddleSettings ts = new TwiddleSettings()
    {
        angle = angle,
        height = Height,
        width = Width
    };
    // Launch function GenerateTwiddlesOnDevice to create the three sets of twiddle vectors.
    _gpu.Launch(Width, Height).GenerateTwiddlesOnDevice(ts, _gdata.TwiddleBuffers[0], 
                    _gdata.ComplexFBuffers[0], _gdata.TwiddleBuffers[2]);
    // Launch function Transpose to transpose (corner turn) the second twiddle vector set.
    _gpu.Launch(new dim3(Width / BLOCK_DIM, Height / BLOCK_DIM),
        new dim3(BLOCK_DIM, BLOCK_DIM)).Transpose(_gdata.ComplexFBuffers[0], 
               _gdata.TwiddleBuffers[1], Width, Height);
}

The actual processing loop is performed in a BackgroundWorker. This causes a small headache for the GPU code because it means we are doing work on a thread other than the one in which the GPU was initialized. Think of this in much the same way as needing to use Invoke to interact with the UI from a thread other than the main UI one - the project makes liberal use of this too to update the UI from the BackgroundWorker. To handle this we need to call EnabledMultithreading() from the main thread and then SetCurrentContext() on the child thread before interacting with the GPU. Here is the code for the Process method. You can see how a combination of the CUDA FFT library, our own Multiply GPU function and CUDA Basic Linear Algebra Subprograms (BLAS) library are used. We effectively launch 12 GPU functions in order to perform one rotation - this gives a hint at how intensive this operation is. The last line of code copies the newly rotated image on top of the source image, so the next time we call Process we perform the rotation on the previously rotated image. This is a good test of the quality of frequency domain image rotation.

C#
public void Process(eProcessType type = eProcessType.Full)
{
    float scale = 1.0F / (float)Width;

    // All the calls below start functions on the GPU.
    // 1st pass - forward FFT, multiply with twiddle vectors, inverse FFT and scale the result.
    _maths.fwdPlan.Execute(_gdata.SourceImageCplx, _gdata.ComplexFBuffers[0]);
    _gpu.Launch(Width, Height).Multiply(_gdata.ComplexFBuffers[0], 
      _gdata.TwiddleBuffers[0], _gdata.ComplexFBuffers[1], Width, Height);
    _maths.invPlan.Execute(_gdata.ComplexFBuffers[1], _gdata.ComplexFBuffers[2], true);
    _maths.blas.SCAL(scale, _gdata.ComplexFBuffers[2]);

    if (type != eProcessType.OnePass)
    {
        // 2nd pass - corner turned forward FFT, multiply
        // with twiddle vectors, corner turned inverse FFT and scale.
        _maths.fwdPlanT.Execute(_gdata.ComplexFBuffers[2], _gdata.ComplexFBuffers[0]);
        _gpu.Launch(Width, Height).Multiply(_gdata.ComplexFBuffers[0], 
          _gdata.TwiddleBuffers[1], _gdata.ComplexFBuffers[1], Width, Height);
        _maths.invPlanT.Execute(_gdata.ComplexFBuffers[1], _gdata.ComplexFBuffers[2], true);
        _maths.blas.SCAL(scale, _gdata.ComplexFBuffers[2]);
    }

    if (type == eProcessType.Full)
    {
        // 3rd pass - forward FFT, multiply with twiddle vectors, inverse FFT and scale.
        _maths.fwdPlan.Execute(_gdata.ComplexFBuffers[2], _gdata.ComplexFBuffers[0]);
        _gpu.Launch(Width, Height).Multiply(_gdata.ComplexFBuffers[0], 
          _gdata.TwiddleBuffers[2], _gdata.ComplexFBuffers[1], Width, Height);
        _maths.invPlan.Execute(_gdata.ComplexFBuffers[1], _gdata.ComplexFBuffers[2], true);
        _maths.blas.SCAL(scale, _gdata.ComplexFBuffers[2]);
    }

    // Copy to source image location.
    _gpu.CopyOnDevice(_gdata.ComplexFBuffers[2], _gdata.SourceImageCplx);
}

To visualize the image we first convert the image to 8-bit gray scale and then copy from the GPU to the host.

C#
public void DownloadImage(byte[] buffer)
{
    CheckIsSet();
    // Convert the complex float array to byte array.
    _gpu.Launch(Width, Height).ConvertComplexToByte(
             _gdata.SourceImageCplx, _gdata.ResultImage, Width, Height);
    _gpu.CopyFromDevice(_gdata.ResultImage, buffer);
}

License

The Cudafy.NET SDK includes two large example projects featuring amongst others ray tracing, ripple effects, and fractals. Many of the examples are fully supported on both CUDA and OpenCL. It is available as a dual license software library. The LGPL version is suitable for the development of proprietary or Open Source applications if you can comply with the terms and conditions contained in GNU LGPL version 2.1. Visit the Cudafy website for more information. If using the LGPL version we ask you to please consider making a donation to Harmony through Education. This small charity is helping handicapped children in developing countries. Read more on the charity page of the Cudafy website.

Points of Interest

The original rotator on a PowerFFT PCI card in 2002 could manage 45fps including the PCI and graphics display. A laptop from 2011 with a NVIDIA Geforce GT540M can manage 90fps. Well 9 years in computer terms is an age, so the doubleBW PowerFFT was way ahead of its time and in terms of speed vs. power consumption it may still have the latest GPUs beat. Probably this is why the chip may still go into orbit in a future European Space Agency mission.

You can find out more about CUDAfy.NET, get support and access the latest release from either the Hybrid DSP Systems website or from the CodePlex site.

History

  • First release.

License

This article, along with any associated source code and files, is licensed under The GNU Lesser General Public License (LGPLv3)