Introduction
The article discusses programming your Graphics Card (GPU) with Java & OpenCL.
Your computer most likely has a 3D accelerated graphics card. This is especially true if your computer is a desktop. Accelerated video cards are becoming very common even in laptops. Your graphics card, or graphics processing unit (GPU) packs a fair amount of processing power. You can harness this computational power for regular Java programs. This article is the first in a series of articles that I am going to write on GPU programming. This article is the first, and deals only with executing a very simple application on your GPU. There are a number of details that you must be concerned with. So the first step is just making sure everything is set up properly. This is the purpose of this article.
What can actually be done with multiple GPUs? This article below shows how a university research team used multiple GPUs to gain a “desktop supercomputer”:
You should also consider what types of applications might benefit from GPU acceleration. If your program spends long periods of time performing computations and general number crunching, GPU acceleration might be very beneficial. GPU acceleration can be especially useful if your task can be processed in a parallel manor. GPUs are typically made up of a large number of stream processors that are not terribly fast by themselves. Many GPU cards come with over 100 stream processors. Obviously, if you are to make any use of the GPU, you must make your task parallel. If you thought it was hard enough to make your application get performance on a quadcore, GPU acceleration is not likely going to help. GPU programming is a glimpse into the future. One day, in the not too distant future, we will have 100+ core CPUs.
Another limitation of GPU processing is that you cannot execute Java code on the GPU. This code must be created in a C-like language called OpenCL. The OpenCL code will not have direct access to your Java data. You must package data for the OpenCL to work on. In many ways, using OpenCL is like using SQL. You start create a string that holds your OpenCL routine. You then bind parameters from your Java application to the OpenCL. Finally, you integrate the results back from your OpenCL routine back into your application. This entire cycle is very similar for using a database and SQL.
I became interested in GPU programming because I am the founder and primary programmer for the Encog project for Java Neural Networks. The Encog project is an open source LGPL neural network framework for both Java and .NET. Neural networks are mathematically intense. If implemented correctly, their processing can be done in parallel. This makes neural networks an ideal choice for GPU acceleration. Encog makes use of the GPU to accelerate training of the neural networks. My articles are meant to be about GPU programming, independent of neural network programming. However, the final article will demonstrate training a neural network with the GPU. These articles could be applied to any number crunching task.
Using the JOCL OpenCL Binding
OpenCL is packaged as a DLL that is installed on your machine. We will discuss installing these drivers in the next section. For now, we will look at how to install the binding. Because OpenCL is inside of a DLL, you will have to use the Java Native Interface (JNI) to communicate with it. You won't have to deal directly with JNI, as this is what the binding does. The binding that we will use is called JOCL. JOCL can be downloaded from here.
JOCL is very much platform specific. You will have a JAR interface that is platform independent, however you will need to download the correct DLL for the platform you are going to use. There is also a distinction made between 64 and 32 bit. It is very important that you download the correct binding. The computer that I am using to write this article is a Windows 7 64-bit machine. As a result, that is the binding that I will demonstrate. Once I open the Windows 64-bit archive, I see two files in addition to some license files.
- JOCL-0.1.3a-beta.jar
- JOCL-windows-x86_64.dll
You can see both the platform independent JAR file, as well as the platform dependant DLL file. You must include the JAR file in your classpath. This is no different than any other Java application. However, the JAR file will make use of the correct platform dependant DLL. This DLL must be located on the system path, or in the current directory of your Java application.
Installing Drivers
The OpenCL drivers may already be installed on your computer system. You could skip to the next section, and see if you get a driver error. If you get a driver error, in the next section, you will need to install drivers. A driver error is usually in the form of not being able to locate OpenCL.dll.
The exact driver you install will depend on your GPU. We are really only concerned with your GPU’s chipset. There are three GPU chipsets widely in use, as of the writing of this article.
There are many different GPU makers; however they will usually make use of one of the above chipset. If you have a different chipset, check the vendor’s driver page. If you have an Intel GPU, then you are done right now! As of the writing of this article, Intel has not yet made OpenCL drivers available for their cards.
AMD and nVidia both have drivers from the following pages.
AMD currently has better drivers than nVidia. AMD can actually use your CPU as an OpenCL device, creating a truly heterogeneous computation environment.
Creating the Example Application
The sample application available for download with this article is based on an example provided with JOCL. It is a relatively short application, but it does show all of the basics of how to use OpenCL. The example download contains an Eclipse project already setup. However, you can easily compile it from the command line with the following instruction:
javac OpenCLPart1.java -classpath .;./JOCL-0.1.3a-beta.jar
To execute the application, use the following instruction:.
java -classpath .;./JOCL-0.1.3a-beta.jar OpenCLPart1
If the program executes successful, you will see the following output:
Obtaining platform...
Test PASSED
Result: [0.0, 1.0, 4.0, 9.0, 16.0, 25.0, 36.0, 49.0, 64.0, 81.0]
If you did not, then stop. Do not pass go. Go directly to the troubleshooting section at the end of this article.
Using OpenCL from Java
Now we will step inside of this very simple application and see how it works. OpenCL is designed to be massively parallel, from the ground up. This is inherently built into the kernels, or small pieces of OpenCL code, that you execute. The kernel that we are executing is shown here.
private static String programSource =
"__kernel void "+
"sampleKernel(__global const float *a,"+
" __global const float *b,"+
" __global float *c)"+
"{"+
" int gid = get_global_id(0);"+
" c[gid] = a[gid] + b[gid];"+
"}";
This is the kernel, embedded in a Java string. The actual kernel, with no Java, is shown here:
__kernel void sampleKernel(__global const float *a,
__global const float *b,
__global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
This is OpenCL. It is based on C99, and as a result looks very C-like. Notice the kernel accepts three parameters. All three are pointers. A pointer is somewhat like a Java reference, though it is really much more. A pointer can also double for an array. This is the purpose they are being used for here. If I were to translate the above kernel into pure Java, it would look something like:
void sampleKernel(int index, float[] a,float[] b, float[] c)
{
c[index] = a[index] + b[index];
}
Essentially, we are summing the contents of the “a
” and “b
” arrays and saving the result into the “c
” array. Notice the “index
” variable. This is obtained from the OpenCL get_global_id(0)
function. This returns the current thread that is executing. Even this simple example is multithreaded. Each element in the array could potentially be added by a different stream processor on your GPU. We start multithreaded from the beginning. There really is no point in creating a single threaded OpenCL application. The individual stream processors are too slow. If you are only using one stream processor, then it is very likely not worth the overhead of sending data to the stream processor to process.
Now that we have examined the OpenCL of this application, we will look at the Java code necessary to execute it. I will warn you, JOCL is a thin layer over OpenCL. This is not going to be the prettiest Java code in the world. Often the very first thing I do is bury the details of OpenCL in some support classes. This was defiantly the approached used with Encog.
We begin by obtaining a platform. This is part of the OpenCL hierarchy. It is easy to gloss over at first, but this hierarch is very important. Platforms are at the top level. If you have only one vendor’s drivers installed, you will have only one platform. One case where multiple platforms is very handy is if you have an nVidia card. nVidia has yet to release a CPU driver. So you can actually run nVidia OpenCL drivers for your GPU, and then run AMD drivers for the CPU. The AMD drivers will work on Intel CPU’s. The following code obtains the first platform.
System.out.println("Obtaining platform...");
cl_platform_id platforms[] = new cl_platform_id[1];
clGetPlatformIDs(platforms.length, platforms, null);
cl_context_properties contextProperties = new cl_context_properties();
contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);
This code will only access one platform. It will simply take the first platform it finds. If you wanted to support multiple platforms, you would need to modify the above code to take a bigger array.
Next we create a context. Each platform must have its own context. A context can have multiple devices. For example, if you had dual ATI GPUs, you would have one platform, but two devices. The context is created as follows. Here, we request only GPUs.
cl_context context = clCreateContextFromType(
contextProperties, CL_DEVICE_TYPE_GPU, null, null, null);
We obtain the first GPU device:
int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
cl_device_id devices[] = new cl_device_id[numDevices];
clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0],
Pointer.to(devices), null);
Commands will be given to the OpenCL device using the command queue. If you were using dual graphics cards, you would need dual command queues. OpenCL automatically sends tasks to the stream processors on a card. But it is up to you to break tasks over the multiple command queues that are necessary with dual GPUs.
cl_command_queue commandQueue =
clCreateCommandQueue(context, devices[0], 0, null);
Next, we must allocate the memory for the three arrays. The first two arrays are used to send data to the kernel. The third is used to write data back from the kernel to the main application.
cl_mem memObjects[] = new cl_mem[3];
memObjects[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
Sizeof.cl_float * n, srcA, null);
memObjects[1] = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
Sizeof.cl_float * n, srcB, null);
memObjects[2] = clCreateBuffer(context,
CL_MEM_READ_WRITE,
Sizeof.cl_float * n, null, null);
We now create a program to hold the kernel we created earlier.
cl_program program = clCreateProgramWithSource(context,
1, new String[]{ programSource }, null, null);
We must compile the OpenCL kernel.
clBuildProgram(program, 0, null, null, null, null);
We now bind the kernel to the name specified earlier.
cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);
Now that the buffers have been created, we must assign the three arguments.
clSetKernelArg(kernel, 0,
Sizeof.cl_mem, Pointer.to(memObjects[0]));
clSetKernelArg(kernel, 1,
Sizeof.cl_mem, Pointer.to(memObjects[1]));
clSetKernelArg(kernel, 2,
Sizeof.cl_mem, Pointer.to(memObjects[2]));
We specify how many threads we want the task broken into. This is the total number of elements in the array. Further subdivision can be done in the form of workloads. We simply specify a workload size of 1
. This means each item is a separate workgroup. Sometimes it is useful to consolidate several workgroups together, as they can share memory.
long global_work_size[] = new long[]{n};
long local_work_size[] = new long[]{1};
We are finally ready to execute the kernel. It will execute each array element in parallel.
clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
global_work_size, local_work_size, 0, null, null);
Finally, we read the result back from the “c
” parameter.
clEnqueueReadBuffer(commandQueue, memObjects[2], CL_TRUE, 0,
n * Sizeof.cl_float, dst, 0, null, null);
We now have the results from the kernel.
This is a very simple application, however it shows all of the basics. You saw how to structure data to be sent to the kernel. You saw how to read data back from the kernel. In the next part, we will see how to expand the kernel for more useful processing.
If you are interesting in neural networks with the GPU, you should have a look at the Encog Java Neural Networks project.
Troubleshooting
It is very important that you are using the correct DLL for your system. Even if your computer supports 64 but, you may not be running Java in 64 bit mode. The following error shows an example of this. Notice that it specifies the architecture bit size as 32. My computer is running in 64-bit mode. However, Java is not running in that mode. Adjusting the Eclipse IDE’s JRE settings can fix this.
Error while loading native library "JOCL-windows-x86" with base name "JOCL"
Operating system name: Windows 7
Architecture : x86
Architecture bit size: 32
You also might not have the OpenCL.DLL loaded. This can result in the following error.
Error while loading native library "JOCL-windows-x86" with base name "JOCL"
Operating system name: Windows 2003
Architecture : x86
Architecture bit size: 32
Stack trace:
java.lang.UnsatisfiedLinkError: no JOCL-windows-x86 in java.library.path
at java.lang.ClassLoader.loadLibrary(Unknown Source)
at java.lang.Runtime.loadLibrary0(Unknown Source)
at java.lang.System.loadLibrary(Unknown Source)
at org.jocl.LibUtils.loadLibrary(LibUtils.java:67)
at org.jocl.CL.assertInit(CL.java:1694)
at org.jocl.CL.clGetPlatformIDs(CL.java:1744)
at org.encog.util.cl.EncogCL.<init>(EncogCL.java:53)
at org.encog.Encog.initCL(Encog.java:150)
at org.encog.examples.neural.opencl.SimpleCLTest.main(SimpleCLTest.java:10)
Exception in thread "main" org.encog.EncogError: java.lang.UnsatisfiedLinkError:
Could not load the native library
at org.encog.Encog.initCL(Encog.java:155)
at org.encog.examples.neural.opencl.SimpleCLTest.main(SimpleCLTest.java:10)
Caused by: java.lang.UnsatisfiedLinkError: Could not load the native library
at org.jocl.LibUtils.loadLibrary(LibUtils.java:78)
at org.jocl.CL.assertInit(CL.java:1694)
at org.jocl.CL.clGetPlatformIDs(CL.java:1744)
at org.encog.util.cl.EncogCL.<init>(EncogCL.java:53)
at org.encog.Encog.initCL(Encog.java:150)
... 1 more
History
- 9th June, 2010: Initial post