Introduction
In this article, we are going to investigate a simple common framework for single input, single output image filters. Filters can be implemented in plain C++ running on CPU, using nVidia CUDA running on GPU, using C++ AMP (requires at least VS 2012 RC) or a mix of these.
We are going to capture a webcam input with OpenCV and process the captured frames using implemented image filters.
Requirements
- Patched OpenCV 2.2 for Windows.
- CUDA 6.5 SDK which can work with Visual Studio 2013 compilers.
- Visual Studio 2013
- An nVidia graphics card supporting CUDA with recent drivers installed.
Background
This article is not about introducing OpenCV, CUDA or C++ AMP. Only short descriptions are provided.
What is Image Processing?
Image processing filter represents a mathematical operation applied to an image such as color negation, arbitrary rotation, blurring, etc.
What is OpenCV?
OpenCV is an open-source computer vision library which among other things helps us to capture webcam for this article.
What is CUDA?
CUDA is a game-changer in parallel programming, it is a paradigm shifter, it is not multicore computing, it is many-core computing. Why many? There are so many cores that you do not have to know how many exactly, and lack of this knowledge caters for transparent scalability.
Using CUDA, you can write C/C++ code running on GPU, using hardware optimized parallel processing.
CUDA is a proprietry technology of nVidia, to use it you will need an nVidia graphics card and a recent graphics driver with CUDA support.
What is C++ AMP?
C++ Accelerated Massive Parallelism is a library which uses DirectX 11 for computations on GPU under the hood and falls back to CPU. This library is provided by Microsoft and available for use in VS 2012 and up. More Info (MSDN)
Using the Code
OpenCV Frame Grabbing Distilled
- Open webcam device.
- Grab first frame.
- If first frame is not OK, exit.
- Enter main loop
- Grab next frame.
- Resize frame if necessary.
- Call image filter with grabbed frame.
- Show filtered image on screen.
Interfaces, Classes, Inheritance and More
The point that we are developing on is operation of a filter is independent from how the image is created. We can write the filtering method in main.cpp
as a global function or we can use an interface to call our filter method on an instance of a custom filter class.
In the simplest form, it can be seen that an interface for filtering image needs at least a method to filter the image, a method to pass in dimension of the image, and we throw in a method to manage releasing resources held by the implementer. Given, we can pass the image dimensions in the filter method and clean up the memory before returning; but we need a way to allocate and release memory in the GPU. Memory operations on the GPU are expensive, that's why we are using InitFilter
and ReleaseFilter
methods with FilterImage
method in the interface ISingleImageFilter
.
Simplest Filter of All: IdentityFilter
IdentityFilter
returns the image as is, implementing the ISingleImageFilter
interface.
class IdentityFilter : public SingleImageFilter
{
public:
IdentityFilter()
{
}
virtual void FilterImage(char* imageData)
{
return; }
};
To use any ISingleImageFilter
here is what you need to do. First, you have to call InitFilter
with the image dimensions. In the frame-grabbing loop, call FilterImage
with the image data. FilterImage
methods perform processing in-place meaning that your reference will be overwritten with the result. When you are done with the filter and will not use it again, call ReleaseFilter
method.
Notice that we have only overridden FilterImage
method in the IdentityFilter
. Image dimensions are preserved in the base class SingleImageFilter
which is an abstract
class leaving only FilterImage
method as a pure virtual method.
Color Negation Filter on CPU
In this article, images are grabbed as BGR three channel images containing 8 bits for each color channel. Color negation means that every channel of every pixel in inverted such that:
value + inverted value = max value
In the 8 bit case, max value is 255 so here is the code for color negation filter:
virtual void FilterImage(char* imageData)
{
for(int i=0; i<3*width*height; i++)
{
*( imageData + i ) = ( unsigned char ) ( 255 - *( imageData + i ) ); }
}
Using CUDA in Filters
To be able to compile CUDA samples, you need to install CUDA SDK 6.5.
Code that will run on the GPU are called kernels. To mark a method as a kernel, you apply the method decorator __global__
.
__global__
void gpuInvert(
float* image,
int width,
int height
)
{
}
To launch a kernel, we use configuration parameters which declare to the GPU that in how many blocks and threads we want our kernel be run on.
dim3 dimBlock( BLOCK_SIZE, BLOCK_SIZE );
dim3 dimGrid( width / dimBlock.x, height / dimBlock.y );
gpuInvert<<< dimGrid, dimBlock >>>( d_Image, width, height );
Image processing is a voluptuous field for CUDA because many of the image processing algorithms are just the perfect fit for parallel processing in memory access patterns and mathematical complexity.
We have mentioned that CUDA programs run on the GPU itself, so where should we put the data? We transfer the image data to the graphics card memory in every frame. We allocate the device memory in InitFilter
, we deallocate it in ReleaseFilter
.
Base Class For CUDA-Enabled Filters
In the FilterImage
method, we need to copy the image data to GPU memory, process the data and fetch the result back to our same-old RAM. Since we are using single-input, single-output image processing filters, we can use the same method signature among the GPU kernels too.
We are going to use the kernel launcher function pointers a lot, we are going to define a type for them.
typedef void (*ptKernelLauncher)(float*, int, int);
ptKernelLauncher kernelLauncher;
kernelLauncher = deviceInvertLaunch;
kernelLauncher( d_Image, width, height );
From the SingleCudaFilter
class, we do not launch the kernels directly. C/C++ code containing CUDA kernels need to pass thru the nvcc nVidia compiler driver. We abstract the kernel codes by using so-called launcher methods for kernels and calling these launcher methods from our filter classes.
In SingleCudaFilter
class, we will not have any kernel or launcher dependencies. We will pass the function pointer to the launcher in the constructor of this class. When the FilterImage
method is called in the image processing loop, SingleCudaFilter
will invoke the kernel launcher which in turn will launch the kernel on the GPU.
In this code fragment, InitFilter
, FilterImage
and ReleaseFilter
methods are included for SingleCudaFilter
class.
class SingleCudaFilter : public SingleImageFilter
{
protected:
float* h_Image;
float* d_Image;
ptKernelLauncher kernelLauncher;
public:
explicit SingleCudaFilter( ptKernelLauncher kernelLauncher )
: kernelLauncher(kernelLauncher)
{
}
virtual void InitFilter(int width, int height)
{
SingleImageFilter::InitFilter(width, height);
cudaMalloc( (void**) &d_Image, 3 * sizeof(float) * width * height );
checkCUDAError("malloc device image");
cudaMallocHost( (void**) &h_Image, 3 * sizeof(float) * width * height );
checkCUDAError("malloc host image");
}
virtual void FilterImage(char* imageData)
{
for(int i=0; i<3*width*height; i++)
{
*(h_Image + i) = (unsigned char)*(imageData + i) / 255.0f; }
cudaMemcpy( d_Image, h_Image, 3 * sizeof(float) * width * height, cudaMemcpyHostToDevice );
checkCUDAError("FilterImage: memcpy");
kernelLauncher( d_Image, width, height );
cudaMemcpy( h_Image, d_Image, 3 * sizeof(float) * width * height, cudaMemcpyDeviceToHost);
checkCUDAError("FilterImage: memcpy2");
for(int i=0; i<3*width*height; i++)
{
*(imageData + i) = satchar(*(h_Image + i) * 255);
}
}
virtual void ReleaseFilter()
{
SingleImageFilter::ReleaseFilter();
cudaFree( d_Image );
checkCUDAError("free device image");
cudaFreeHost( h_Image );
checkCUDAError("free host image");
}
};
Color Negation Filter on GPU
Class CudaInvertFilter
is nothing but a boilerplate code to pass the kernel launcher to the class SingleCudaFilter
.
class CudaInvertFilter : public SingleCudaFilter
{
public:
CudaInvertFilter()
: SingleCudaFilter(deviceInvertLaunch)
{
}
};
And in the kernel we do nothing fancy either. Only point to consider is we are inverting the channels by substracting them from 1, instead of 255. We have passed the normalized image to the kernel launchers in FilterImage
method.
__global__
void gpuInvert(
float* image,
int width,
int height
)
{
int row = blockIdx.y * BLOCK_SIZE + threadIdx.y;
int col = blockIdx.x * BLOCK_SIZE + threadIdx.x;
int cIdx = ( row * width + col ) * 3;
*( image + cIdx ) = 1 - *( image + cIdx ); *( image + cIdx + 1 ) = 1 - *( image + cIdx + 1 ); *( image + cIdx + 2 ) = 1 - *( image + cIdx + 2 ); }
To launch the kernels, we have used the following arrangement:
void deviceInvertLaunch(
float *d_Image,
int width,
int height
)
{
dim3 dimBlock( BLOCK_SIZE, BLOCK_SIZE );
dim3 dimGrid( width / dimBlock.x, height / dimBlock.y );
#if ENABLE_TIMING_CODE
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
#endif
gpuInvert<<< dimGrid, dimBlock >>>( d_Image, width, height);
#if ENABLE_TIMING_CODE
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
cudaThreadSynchronize();
printf("gpuInvert kernel time: %.3f ms\n", elapsedTime);
#endif
cudaThreadSynchronize();
checkCUDAError("kernel invocation");
}
Using Textures with CUDA
Filters implemented using the previously discussed method will be using global memory of the graphics card. Global memory is the slowest memory type available. We can use texture memory which is faster, has spatial caching and read-only. Read-only part is not relevant to us, we can use the texture memory.
After getting the texture support up-and-running, there was a problem. nvcc compiler driver assigns scope to texture references by code file. That means, you cannot declare a texture in a header file and use it in different files. If you reference the texture from another code file, your code will compile but boom! You have a zero (all-black) texture waiting for you.
To load textures from our base class and provide the kernel with the current frame in the texture, we need to access the texture by refererence which can be obtained thru CUDA Driver API.
Our texture woes are not over yet. It might be that I am doing something wrong but it should not be this hard to use a texture in a kernel and load it from another file. When you get the texture reference from CUDA with the following code, you face the challenge that
Driver API is expecting a const
pointer. But if you declare the pointer const
, how can you configure the texture parameters in the structure pointed by it? const_cast<>
to the rescue.
const textureReference* constTexRefPtr;
textureReference* texRefPtr;
...
cudaGetTextureReference(&constTexRefPtr, textureSymbolName);
checkCUDAError("get texture reference");
texRefPtr = const_cast<textureReference*>( constTexRefPtr );
channelDesc = cudaCreateChannelDesc<float4>();
cudaMallocArray( &cu_array, &texRefPtr->channelDesc, width, height );
checkCUDAError("malloc device image");
...
cudaMemcpyToArray( cu_array, 0, 0, h_Image, sizeof(float4) * width * height, cudaMemcpyHostToDevice);
checkCUDAError("FilterImage: memcpy");
cudaBindTextureToArray( texRefPtr, cu_array, &texRefPtr->channelDesc );
You can dig into SingleCudaTexFilter.cu and SingleCudaTexFilter.h for gruesome details of texture usage in CUDA.
Color Negation Filter on GPU using Textures
Using textures, we get the image data at any pixel with the CUDA function tex2D
. Calling the kernel launcher is done within SingleCudaTexFilter
.
CudaTexInvertFilter Class
class CudaTexInvertFilter : public SingleCudaTexFilter
{
public:
CudaTexInvertFilter()
: SingleCudaTexFilter(deviceTexInvertLaunch, "texInvert1")
{
}
};
deviceTexInvertLaunch Kernel
__global__
void gpuTexInvert(
float* image,
int width,
int height
)
{
int row = blockIdx.y * BLOCK_SIZE + threadIdx.y;
int col = blockIdx.x * BLOCK_SIZE + threadIdx.x;
int cIdx = ( row * width + col ) * 3;
float tu = (float)col / width;
float tv = (float)row / height;
float4 texVal = tex2D( texInvert1, tu, tv );
*( image + cIdx ) = 1 - texVal.x;
*( image + cIdx + 1 ) = 1 - texVal.y;
*( image + cIdx + 2 ) = 1 - texVal.z;
}
Using C++ AMP in Filters
A trick for working with CUDA and C++ AMP in the same solution
Latest code in github uses CUDA 6.5, the following paragraph is left intact on purpose.
CUDA 4.2 requires using the VS2008 (v90) toolset but, a big but, C++ AMP requires VS2012 (v110) toolset. So when you try to compile a single project containing both CUDA and C++ AMP code, either CUDA or C++ AMP codes won't compile. We can create a new project file for C++ AMP and convert the project containing CUDA codes to a DLL project. This way, our compiled CUDA filters will be available to the AMP project.
Main project is now the AmpFilters project. CudaFilters are instantiated by factory methods in FilterFactory.cpp
.
Color Negation Filter on C++ AMP
In this filter, color negation operation is implemented as a parallel_for_each
kernel. Only drawback for the filter library is using char
as pixel color datatype. char
datatype is not supported for use within a restrict(amp) block. We have to convert the data to int
or one of the other AMP supported types.
void AmpInvertFilter::FilterImage(char* imageData)
{
unsigned int* iImageData = (unsigned int*)malloc(3*width*height * sizeof(int));
for(int i=0; i<3*width*height; i++)
{
*( iImageData + i ) = ( unsigned int ) *( imageData + i );
}
const int size = 3*width*height;
array_view<unsigned> img(size, iImageData);
parallel_for_each(
img.extent,
[=](index<1> idx) restrict(amp)
{
img[idx] = 255 - img[idx];
}
);
img.synchronize();
for(int i=0; i<3*width*height; i++)
{
*( imageData + i ) = ( char ) *( iImageData + i );
}
}
Filter Chains
All these filters won't be fun if we cannot run them one after another. Implementing the SingleImageFilter
class, SingleImageFilterChain
calls its filters sequentially.
After creating an instance, use AppendFilter
method to add filters to the queue. When FilterImage
method is called, all the filters have a chance to process the data one after another.
You can mix CPU, CUDA, CUDA texture filters and C++ AMP filters in a SingleImageFilterChain
.
ISingleImageFilter* myFilter1 = new SingleCudaTexFilter(deviceTexAbsDiffLaunch, "texAbsDiff1");
ISingleImageFilter* myFilter2 = new CpuInvertFilter();
SingleImageFilterChain* myFilter = new SingleImageFilterChain();
myFilter->AppendFilter( myFilter1 );
myFilter->AppendFilter( myFilter2 );
UML Schema of Implemented Classes
Sample Outputs
For full-size images, please visit http://dissipatedheat.com/2011/05/29/cuda-ile-opencv-kullanarak-webcam-goruntu-isleme/
Points of Interest
It has been a nice way to learn about webcam capturing using OpenCV and using CUDA with textures. There were some gotchas in texture handling routines, but all is resolved now.
History
- Third version published 08.30.2014 - Now filtering can be done using C++ AMP on VS 2013 and CUDA 6.5
- Second version published 08.07.2012 - Now filtering can be done using C++ AMP on VS 2012 RC
- First version published 03.06.2011 - This article is also available on my blog: Dissipated Heat