Introduction
Note: This article is still relevant but I have
changed my approach to GPU programming. I now use CUDA with Java and JCuda from an Eclipse IDE. See my new approach at CodeProject Article 513265
Getting NVidia Cuda up and running when you are
on a Visual Studio Express budget can be frustrating, particularly if you want
to access Cuda functions from managed code. There are plenty of resources on line to help you on your way but you have to combine information from
different sources – while avoiding certain dead ends. It’s a little hit and
miss. I hope you can benefit from my journey so far.
For now, I decided to keep it simple: use VS 2008 Express, write my own wrappers,
and stick to the x86 platform. Here’s how I succeeded:
Background
-
I have not configured Cuda for VS 2010 Express. I understand that part
of the process requires configuring your 2010 project to use the VS 2008 (VC
90) compiler instead of the VS 2010 (VC100) compiler. Most likely there are a
few other hacks required to get things going. There appear be some resources
that provide direction on doing this. In particular, I saw one article that
looks promising at http://blog.cuvilib.com/2011/02/24/how-to-run-cuda-in-visual-studio-2010/
-
Running managed code using configurations other than x86 did not
work for me. There are several convoluted posts on the web concerning this
configuration with VS Express. Google search “Visual C++ 2008 Express Edition
And 64-Bit Targets” for some entertaining ways to break your VS Express
install.
-
Working out the install in a virtual machine first is a good idea
but it was unclear to me how to access the host’s GPU hardware directly from my
guest machine. My VBox virtual graphics adapter is not Cuda enabled and, as
best I can tell, Cuda no longer easily supports the emulator mode. So I used
the standard technique: make mistakes, break the install, reinstall, and follow
the smoke.
-
I am particularly interested in Fourier transforms on the GPU. Only a
few of the canned wrappers sport CUFFT functionality. Cudafy (CodePlex) seemed the
most promising but it’s not (yet) an out of the box set-up when you have VS
Express.
First time setup
-
Be sure you have a Cuda enabled card. NVidia has an exhaustive list of compatible GPUs on their Developer Zone web site.
http://developer.nvidia.com/cuda-gpus. (I have a GeForce GTX 560 GPU.) If you are
not sure, have a look at the
GPU Caps Viewer. I am usually hesitant to download many utilities like this, but I have used this application for a few years now, it is widely recognized, and it has a solid green WOT rating. It will fairly reliably identify your GPU and report its OpenGl and Cuda capabilities.
Take Note: From release notes in Toolkit (Start -> Programs -> NVidia):
The Win7 environment variables need to be fixed on the v4.1 RC2 installation for
Windows7-x64: Environment variables written by the installer may have mistakenly
included an extra slash in the path specification.
-
CUDA_BIN_PATH %CUDA_PATH%\bin
-
CUDA_INC_PATH %CUDA_PATH%\include
-
CUDA_LIB_PATH %CUDA_PATH%\lib\x64
-
From a command window run: nvcc –V (You should get a compilation release
message.)
-
Find bandwidthTest.exe (C:\ProgramData\NVIDIA Corporation\NVIDIA GPU
Computing SDK 4.1\C\bin\win64\Release) and run it.
-
Also try oceanFFT.exe
Creating projects
Example: A simple bare-bones wrapper for FFT:
-
Create a new, empty, Win 32 project named BareBonesCuda. Check the
“dll” checkbox on the next page.
-
Add a source file – type cpp – but name it with .cu extension, eg:
test.cu
-
Right-click the project and choose
Custom Build Rules. Tick the box for CUDA Runtime API. There will be two. I use
the one that does not have the version # after the name.
-
Right-click the project and choose Properties.
Paste the following into test.cu:
#include "cufft.h"
extern "C" int __declspec(dllexport) __stdcall _Fft(float real[], float imaginary[], int N, int batchSize)
{
cufftComplex *a_h, *a_d;
cufftHandle plan;
int i, nBytes;
nBytes = sizeof(cufftComplex)*N*batchSize;
a_h = (cufftComplex *)malloc(nBytes);
for (i=0; i < N*batchSize; i++) {
a_h[i].x = real[i];
a_h[i].y = imaginary[i];
}
cudaMalloc((void **)&a_d, nBytes);
if ( cudaGetLastError ( ) != cudaSuccess ) {
cufftDestroy(plan);
free(a_h); cudaFree(a_d);
return 0;
}
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
if (cufftPlan1d(&plan, N, CUFFT_C2C, batchSize) != CUFFT_SUCCESS)
{
cufftDestroy(plan);
free(a_h); cudaFree(a_d);
return 0;
}
cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD);
cudaDeviceSynchronize();
cudaMemcpy(a_h, a_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i < N*batchSize; i++) {
real[i] = a_h[i].x;
imaginary[i] = a_h[i].y;
}
cufftDestroy(plan);
free(a_h); cudaFree(a_d);
return 1;
}
Build it. (I hope it works for you too.)
Use the dll in C#
In the example above a file named BareBonesCuda.dll
was created in the Debug folder for the solution. Make note of it.
Create a new C# console application. Change the configuration to x86 then debug
the empty solution once. This will create a folder in your solution called
\bin\x86\Debug. Copy your BareBonesCuda.dll into this folder.
Paste the following into Program.cs:
#include "cufft.h"
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Runtime.InteropServices;
namespace MyTestSharp
{
class Program
{
static void Main(string[] args)
{
test();
}
[DllImport("BareBonesCuda.dll", CallingConvention = CallingConvention.StdCall, EntryPoint = "_Fft")]
public static extern int _Fft(float[] real, float[] imaginary, int N, int batchSize);
private static List<float[]> fftFloat(float[] real, float[] imaginary, int N)
{
int oK = _Fft(real, imaginary, N, 1);
List<float[]> fftResult = new List<float[]>();
fftResult.Add(real);
fftResult.Add(imaginary);
return fftResult;
}
private static void test()
{
int N = 32768;
float[] real = new float[N];
float[] imaginary = new float[N];
StringBuilder sb = new StringBuilder(); ;
char br = (char)13;
for (int i = 0; i < N; i++)
{
real[i] = (float)i + 1;
sb.Append(real[i].ToString());
sb.Append(" + ");
imaginary[i] = 0;
sb.Append(imaginary[i].ToString());
sb.Append(br);
}
Console.WriteLine(sb.ToString());
sb = new StringBuilder();
List<float[]> result = fftFloat(real, imaginary, N);
for (int i = 0; i < N; i++)
{
sb.Append(real[i].ToString());
sb.Append(" + ");
sb.Append(imaginary[i].ToString());
sb.Append(br);
}
Console.WriteLine(sb.ToString());
}
}
}
Run it. (Again, I hope it works for you too.)
References
Some references I found useful:
http://developer.download.nvidia.com/compute/cuda/3_1/docs/GettingStartedWindows.pdf
http://www.programmerfish.com/how-to-run-cuda-on-visual-studio-2008-vs08/
http://www.isnull.com.ar/2010/12/tutorial-cuda-32-and-visual-studio-2008.html
Syntax coloring:
http://www.c-sharpcorner.com/uploadfile/rafaelwo/cuda-integration-with-C-Sharp/
http://developer.download.nvidia.com/compute/cuda/1_1/CUFFT_Library_1.1.pdf<
http://www.codeproject.com/Messages/4106223/Re-unmanaged-returning-arrays.aspx
Some results
Now that I am up and running, I am very happy with
my Cuda performance. Using the CUFFT 1-D, forward, complex Fourier transform
with double precision numbers as an example, I see a GPU/CPU performance
advantage approaching 270x. For the CPU side of my test I am using a simple recursive
radix-2 implementation based on the Sedgwick/ Wayne
Java procedure. The transforms from the GPU and CPU versions agree exactly
(to machine precision)! My GPU handles vectors up to length N = 16777216… and
does it in 0.5 seconds.
History