Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Part 2: OpenCL™ – Memory Spaces

27 Oct 2010 2  
In his second tutorial, GPGPU expert Rob Farber discusses OpenCL™ memory spaces and the OpenCL memory hierarchy, and how to start thinking in terms of work items and work groups. This tutorial also provides a general example to facilitate experimentation with a variety of OpenCL kernels.

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers.

This second article in a series on portable multithreaded programming using OpenCL™ will discuss memory spaces. The previous article introduced the thought behind the OpenCL standard and demonstrated how to download and use the ATI Stream software development kit (SDK) to build and run an OpenCL program.

This tutorial also provides a general example to facilitate experimentation with a variety of OpenCL kernels.

The OpenCL relaxed memory consistency model

Implicit in the OpenCL memory model is the idea that the kernel (the code for a work item that runs on the device) resides in a separate memory space. Data is only accessible after some form of data transfer has been used to move data into the device memory. Each work item can use private memory, local memory, constant memory, and global memory.

In essence, OpenCL uses what is called a relaxed memory consistency model (Khronos OpenCL Working Group, 2008a, p.25) that:

  • Allows work items to access data within private memory.
  • Permits sharing of local memory by work items during the execution of a work-group. However, memory is only guaranteed to be consistent after various synchronization points such as barriers within kernels (which can only be used to synchronize the view of local memory between elements of a work-group), and queue events.
  • Different work-groups cannot communicate or synchronize with one another as consistency of variables across a collection of workgroup items is not guaranteed.
  • As will be discussed in a later article, data dependencies can also be defined and satisfied via the work queues and atomic operations.

For a good explanation of memory consistency models, I suggest reading:

The beauty inherent in the OpenCL memory design is that the data is localized and associated with a work item, or within a work-group, by the programmer. These work items can then be queued on one to many devices to achieve very high performance and scalability. Each device internally can potentially support large numbers of concurrent threads of execution. (Effectively we use the term thread and work item interchangeably.)

A big challenge for most developers resides in the fact that they are not used to thinking in terms of partitioning data and splitting computation tasks into separate work items. Writing single-threaded applications or applications for multi-core SMP systems provides the programmer with an intuitively simple cache coherent memory model. In other words, all data that resides in the local processor caches is guaranteed to be consistent across processors. As a result, programmers are free to ignore the data partitioning problem because they can assume that any core can access any byte within a common address space. Unfortunately, preserving cache-coherency introduces significant communications overhead plus it limits the ability for the programmer and hardware designer to optimize performance. As a result, SMP systems cost more and quickly become limited in performance as the number of cores increases due to scaling bottlenecks. I personally have seen a significant drop in performance that was attributed to maintaining coherency of the caches on small SMP systems containing only four quad-core chips, or 16 total cores.

Succinctly, the OpenCL relaxed consistency model means that different work-items may see a different view of global memory as the computation progresses. Synchronization is required to ensure data consistency within the work items of a work group, while reads and writes to all memory spaces are consistently ordered within work-items. 

The “carrot” OpenCL offers applications developers is the ability to develop applications that scale extremely well and achieve very high performance on SMP systems and potentially orders of magnitude increased performance on inexpensive GPU hardware. Hybrid CPU/GPU systems are now very common. The need to develop applications for these platforms has provided the motivation for the portable parallelism of OpenCL as GPUs support hundreds to thousands of concurrent threads of execution. Similarly, the numbers have changed for mass market multi-core SMP systems as products with tens to hundreds of cores are either on the market or will be in the near future. To exploit this hardware capability, OpenCL programmers must learn to think about computational problems in terms of data partitions and localized calculations - in other words, by using work units and work groups.

Rules to achieve high-performance

As discussed in the first article in this series, high-performance applications follow three general rules. As can be seen, appropriate use of the device memory space and hierarchy are critical:

  1. Get and keep the data on the GPU to eliminate PCI bus data transfer bottlenecks. The performance difference shown by the SDK examples tells the story on an ATI Radeon™ HD 5870 GPU:
    • PCIeBandwidth: 2.4 GB/s
    • GlobalMemoryBandwidth: 118 – 169 GB/s
  2. Give the GPU enough work to do.

    Starting a kernel does require a small amount of overhead. However, modern GPUs are so fast that they can perform a significant amount of work while the kernel is being started.

  3. Optimize the calculation to minimize the bottleneck in accessing the GPU memory.

For example, the best read memory bandwidth an ATI Radeon HD 5870 GPU can deliver is 169 GB/s. A kernel that performs one floating-point operation that requires a single 32-bit read on each element in a vector would deliver at best 42 GFLOP/s of performance, which is significantly slower than the claimed peak for this GPU of 2,720 GFLOP/s.

The first point highlights the need to partition the data appropriately so that it can be moved to the memory space one or more devices and kept there. As can be seen in point 3, data reuse is a key to performance. We will address the second point in a later article.

The OpenCL memory hierarchy

OpenCL defines a memory hierarchy of types as illustrated in the following figure from the AMD OpenCL Introduction. Proper usage can deliver significant performance improvements. However, the OpenCL standard only specifies the access levels of different types of memory. Many of the important performance details may vary from vendor to vendor. The good news is that this flexibility allows each vendor to deliver the highest performance products, but it creates a challenge for the developer who wishes to exploit the portability of OpenCL while also achieving the highest performance across devices. However, vendor implementations appear to be strongly influenced by existing hardware architectures – especially graphics processors. Thus, to create high-performance portable OpenCL applications it seems prudent to think in terms of GPU architectures.

image001.jpg

Global memory: Global memory generally is the largest capacity memory subsystem on the compute device. For most devices, global memory will be measured in gigabytes of capacity. While large and visible to all threads on a GPU or SMP system, global memory should also be considered the slowest memory subsystem that also has some restrictions on use. These restrictions vary according to device, which complicates code design:

  • To determine the amount of global memory on a device, use clGetDeviceInfo passing CL_DEVICE_GLOBAL_MEM_SIZE for param_name, or examine the output of CLIinfo in the AMD SDK samples.
  • Global memory should be considered as streaming memory. This means that the best performance will be achieved when streaming contiguous memory addresses or memory access patterns that can exploit the full bandwidth of the memory subsystem. Such memory operations are said to be coalesced.
    • The AMD OpenCL Programming Guide notes, “For optimally coalesced memory patterns, a common access pattern for GPU-optimized algorithms is for work-items in the same wavefront to access memory locations from the same cache line.” (A wavefront describes N hardware threads that execute the same instruction in parallel. This is distinct from a work group that is a software abstraction for grouping threads.)
    • The rules for coalesced memory accesses vary according to device and can be quite complex as discussed here.
  • Some memory access patterns can cause a bank conflict, which can dramatically slow application performance. (For performance reasons memory subsystems are arranged in banks to increase streaming bandwidth by a factor related to the number of banks. A bank conflict occurs when multiple threads try to simultaneously access the same memory bank at the same time. In such cases, each bank can only service individual requests one-at-a-time, which causes a sequential slowdown that can severely impact performance.) Be aware that bank conflicts are very device dependent:
    • Some hardware greatly speeds special access cases, but can result in poor performance on other devices.
      • Broadcast reads from global memory can perform well or poorly depending on the device. For example, slide 69 of this presentation notes that the ATI Radeon HD 5870 GPU has special hardware to perform a broadcast when many threads attempt to read from a common location in global memory. Such a memory access will perform poorly on pre-Fermi NVIDIA GPUs.
        • When possible, use constant memory to achieve good performance on broadcast operations across most, if not all GPU types.
      • Reduction operations are common in many applications. For this reason, high-performance reductions are the subject of much interest. The following is one study discussing how AMD has designed their hardware to support fast reductions from global memory on the ATI Radeon HD 5870 GPU.

Private memory: This is memory used within a work item that is similar to registers in a GPU multiprocessor or CPU core.

  • Private memory is fast and can be used without need for synchronization primitives. It is allocated and partitioned at compile time by the JIT compiler for the given kernel and card.
  • As is discussed in this thread, the location and size of the private memory is not defined in the OpenCL specification. This ambiguity makes it difficult to decide how much private memory to use except “as little as possible” and to use benchmarks to define appropriate amounts for a device.
  • Application performance can plummet when too much private memory is used on some devices – like GPUs because it is spilled to slower memory. Depending on the device, private memory can be spilled to cache memory. GPUs that do not have cache memory will spill to global memory causing significant performance drops.

Local memory: OpenCL local memory is much faster than global memory – generally on-chip.

  • To determine the local memory on a device, use clGetDeviceInfo with CL_DEVICE_LOCAL_MEM_SIZE passed for param_name, or examine the output of CLIinfo in the AMD SDK samples.
  • Local memory is used to enable coalesced accesses, to share data between work items in a work group, and to reduce accesses to lower bandwidth global memory.

Constant memory: Constant memory is exactly that, a read-only section of memory.

  • On GPU devices from NVIDIA, constant memory is a specialized region of memory that is good for broadcast operations. On AMD devices, this is a region of global memory that exploits hardware optimizations to broadcast data.
  • To determine the size of constant memory buffer, use clGetDeviceInfo with CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE or examine the output of CLIinfo in the AMD SDK samples.
  • Portable code must conservatively assume that each variable declared inside a function or in program scope with the __constant qualifier counts as a separate constant argument. The online Khronos documentation for OpenCL 1.1 notes that implementations are not required to aggregate these declarations into the fewest number of constant arguments.

Additional Resources

Following are the OpenCL address qualifiers (which are distinct from access qualifiers):

  • __global: memory allocated from global address space, images are global by default.
  • __constant: a region of read-only memory.
  • __local: memory shared by work-group.
  • __private: private per work-item memory.

Note: kernel arguments have to be __global, __constant or __local. Pointers that are cast using different address qualifiers are undefined.

Example

This article will utilize cl.hpp – the OpenCL 1.1 C++ bindings header file that implements the C++ wrapper API to create shorter, more succinct programs. In contrast, the first example in this series used cl.h, which defines the OpenCL 1.1 Core API.

The following code, second.cpp, reads the OpenCL kernel source code from a file, builds it, runs it, and then performs a double check on the results. The default behavior is to display the time it took the kernel to run, which makes comparisons possible. Since the OpenCL kernel source code is in a file, it is easy to experiment with different kernels to understand the performance implications of various memory types on both CPU and GPU hardware. However, this example provides a usable and easily generalized framework to experiment with other data sets and computational kernels.

Upon entry into main(), the command-line arguments are parsed.

int main(int argc, char* argv[])
{
  int seed=4;
 
  if( argc < 4) {
    cerr 
      << "Use: {cpu|gpu} kernelFile n_iter vectorSize (k)" 
      << endl;
    exit(EXIT_FAILURE);
  }
 
  // handle command-line arguments
  const string platformName(argv[1]);
  int deviceType = platformName.compare("cpu")?
    CL_DEVICE_TYPE_GPU:CL_DEVICE_TYPE_CPU;
 
  const char* kernelFile = argv[2];
  unsigned int n_iter = atoi(argv[3]);
  unsigned int vecLen = 1000 * atoi(argv[4]);

The remaining code in main(), performs all the remaining work to build and run the kernel within the scope of a try { … } catch block to handle any exceptions. Note that:

  • PROFILING is defined by default.
  • The work to generate the data, load and build the kernel, and perform the golden check is performed by the class OclTest.
  • The C++ wrapper API is used to perform more work in fewer lines of source code than the example, first.cpp, discussed in the previous tutorial that used the core OpenCL API. Similarly, the source code is more readable.
try {
    cl::vector< cl::Platform > platformList;
    cl::Platform::get(&platformList);
    
    displayPlatformInfo(platformList, deviceType);
    
    cl_context_properties cprops[3] = 
    {CL_CONTEXT_PLATFORM, 
     (cl_context_properties)(platformList[0])(), 0};
 
    cl::Context context(deviceType, cprops);
    
    cl::vector<cl::Device> devices = 
      context.getInfo<CL_CONTEXT_DEVICES>();
    
#ifdef PROFILING
    cl::CommandQueue queue(context, devices[0], 
			   CL_QUEUE_PROFILING_ENABLE);
#else
    cl::CommandQueue queue(context, devices[0], 0);
#endif
 
    OclTest test(context, devices, kernelFile, n_iter, vecLen);
 
    cl::Event event;
    test.initData(queue, event, seed);
 
    queue.enqueueNDRangeKernel(test.getKernel(), 
			       cl::NullRange, cl::NDRange(vecLen),
			       cl::NDRange(1, 1), NULL, &event);
 
    if(test.goldenTest(queue, event) == 0) {
      cout << "test passed" << endl;
    } else {
      cout << "TEST FAILED!" << endl;
    }
    
  } catch (cl::Error error) {
    cerr << "caught exception: " << error.what() 
	 << '(' << error.err() << ')' << endl;
  }
  return EXIT_SUCCESS;
}

The class OclTest breaks out the work needed to perform the test. As can be seen:

  1. The constructor. OclTest(…):
    • Builds the kernel from the source file.
    • Allocates the three host vectors, creates the device vectors, and assigns the device vectors to the kernel.
  2. The method initData(…):
    • Fills the host vectors with random data.
    • Queues the transfer of the h_vecA and h_vecB from the host to the device.
  3. The method getKernel() returns the kernel for queuing on the CommandQueue.
  4. After main(…) queues the kernel, the method goldenTest(…) is called. This method waits for the command queue to complete and then double-checks that the computation on the device was correct.
class OclTest {
private:
  int nIter, vLen,vSize;
  cl::Kernel kernel;
  int  *h_vecA, *h_vecB, *h_vecC;
  cl::Buffer d_vecA, d_vecB, d_vecC;
  
public:
  OclTest( cl::Context& context, cl::vector<cl::Device>& devices,
	   const char* kernelFile, int n_iter, int vecLen) {
    nIter = n_iter;
    vLen = vecLen;
    vSize = vLen * sizeof(int);
    
    // build the program from the source in the file
    ifstream file(kernelFile);
    string prog(istreambuf_iterator<char>(file),
		(istreambuf_iterator<char>()));
    cl::Program::Sources source( 1, make_pair(prog.c_str(),
					      prog.length()+1));
    cl::Program program(context, source);
    file.close();
 
    try {
      program.build(devices);
    } catch(cl::Error& err) {
      // Get the build log
      cerr << "Build failed! " << err.what() 
	   << '(' << err.err() << ')' << endl;
      cerr << "retrieving  log ... " << endl;
      cerr 
	<< program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0])
	<< endl;
      exit(-1);
    }
    kernel = cl::Kernel(program, "vec_iii_1d");
    
    // set up the kernel inputs
    h_vecA = new int[vLen];
    h_vecB = new int[vLen];
    h_vecC = new int[vLen];
    
    d_vecA = cl::Buffer(context, CL_MEM_READ_ONLY, vSize);
    d_vecB = cl::Buffer(context, CL_MEM_READ_WRITE, vSize);
    d_vecC = cl::Buffer(context, CL_MEM_READ_WRITE, vSize);
    
    kernel.setArg(0, d_vecA);
    kernel.setArg(1, d_vecB);
    kernel.setArg(2, d_vecC);
  }
  
  inline void initData(cl::CommandQueue& queue, cl::Event& event, 
		       int seed) 
  {
    srand(seed);
    for(int i=0; i < vLen; i++) h_vecA[i] = rand();
    for(int i=0; i < vLen; i++) h_vecB[i] = abs(rand()) % nIter;
    queue.enqueueWriteBuffer(d_vecA, CL_TRUE, 0, vSize, h_vecA);
    queue.enqueueWriteBuffer(d_vecB, CL_TRUE, 0, vSize, h_vecB);
  }
 
  inline cl::Kernel& getKernel() { return(kernel); }
 
  inline int goldenTest(cl::CommandQueue& queue, cl::Event& event) 
  {
    event.wait();
#ifdef PROFILING
    cl_ulong start=
      event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
    cl_ulong end=
      event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
    double time = 1.e-9 * (end-start);
    cout << "Time for kernel to execute " << time << endl;
#endif
 
    // bring data back to the host via a blocking read
    queue.enqueueReadBuffer(d_vecC, CL_TRUE, 0, vSize, h_vecC);
    for(int i=0; i < vLen; i++) 
      if(h_vecC[i] != h_vecA[i] + h_vecB[i]) 
	return(1);
    return(0);
  }
};

The complete source code for second.cpp follows:

#define PROFILING // Define to see the time the kernel takes
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS // needed for exceptions
#include <CL/cl.hpp>

#include <fstream>

#include <iostream>

#include <string>

using namespace std;
 
void displayPlatformInfo(cl::vector< cl::Platform > platformList,
			 int deviceType)
{
  // print out some device specific information
  cout << "Platform number is: " << platformList.size() << endl;
    
  string platformVendor;
  platformList[0].getInfo((cl_platform_info)CL_PLATFORM_VENDOR, 
			  &platformVendor);
 
  cout << "device Type " 
       << ((deviceType==CL_DEVICE_TYPE_GPU)?"GPU":"CPU") << endl;
  cout << "Platform is by: " << platformVendor << "\n";
}
 
class OclTest {
private:
  int nIter, vLen,vSize;
  cl::Kernel kernel;
  int  *h_vecA, *h_vecB, *h_vecC;
  cl::Buffer d_vecA, d_vecB, d_vecC;
  
public:
  OclTest( cl::Context& context, cl::vector<cl::Device>& devices,
	   const char* kernelFile, int n_iter, int vecLen) {
    nIter = n_iter;
    vLen = vecLen;
    vSize = vLen * sizeof(int);
    
    // build the program from the source in the file
    ifstream file(kernelFile);
    string prog(istreambuf_iterator<char>(file),
		(istreambuf_iterator<char>()));
    cl::Program::Sources source( 1, make_pair(prog.c_str(),
					      prog.length()+1));
    cl::Program program(context, source);
    file.close();
 
    try {
      program.build(devices);
    } catch(cl::Error& err) {
      // Get the build log
      cerr << "Build failed! " << err.what() 
	   << '(' << err.err() << ')' << endl;
      cerr << "retrieving  log ... " << endl;
      cerr 
	<< program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0])
	<< endl;
      exit(-1);
    }
    kernel = cl::Kernel(program, "vec_iii_1d");
    
    // set up the kernel inputs
    h_vecA = new int[vLen];
    h_vecB = new int[vLen];
    h_vecC = new int[vLen];
    
    d_vecA = cl::Buffer(context, CL_MEM_READ_ONLY, vSize);
    d_vecB = cl::Buffer(context, CL_MEM_READ_WRITE, vSize);
    d_vecC = cl::Buffer(context, CL_MEM_READ_WRITE, vSize);
    
    kernel.setArg(0, d_vecA);
    kernel.setArg(1, d_vecB);
    kernel.setArg(2, d_vecC);
  }
  
  inline void initData(cl::CommandQueue& queue, cl::Event& event, 
		       int seed) 
  {
    srand(seed);
    for(int i=0; i < vLen; i++) h_vecA[i] = rand();
    for(int i=0; i < vLen; i++) h_vecB[i] = abs(rand()) % nIter;
    queue.enqueueWriteBuffer(d_vecA, CL_TRUE, 0, vSize, h_vecA);
    queue.enqueueWriteBuffer(d_vecB, CL_TRUE, 0, vSize, h_vecB);
  }
 
  inline cl::Kernel& getKernel() { return(kernel); }
 
  inline int goldenTest(cl::CommandQueue& queue, cl::Event& event) 
  {
    event.wait();
#ifdef PROFILING
    cl_ulong start=
      event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
    cl_ulong end=
      event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
    double time = 1.e-9 * (end-start);
    cout << "Time for kernel to execute " << time << endl;
#endif
 
    // bring data back to the host via a blocking read
    queue.enqueueReadBuffer(d_vecC, CL_TRUE, 0, vSize, h_vecC);
    for(int i=0; i < vLen; i++) 
      if(h_vecC[i] != h_vecA[i] + h_vecB[i]) 
	return(1);
    return(0);
  }
};
 
int main(int argc, char* argv[])
{
  int seed=4;
 
  if( argc < 4) {
    cerr 
      << "Use: {cpu|gpu} kernelFile n_iter vectorSize (k)" 
      << endl;
    exit(EXIT_FAILURE);
  }
 
  // handle command-line arguments
  const string platformName(argv[1]);
  int deviceType = platformName.compare("cpu")?
    CL_DEVICE_TYPE_GPU:CL_DEVICE_TYPE_CPU;
 
  const char* kernelFile = argv[2];
  unsigned int n_iter = atoi(argv[3]);
  unsigned int vecLen = 1000 * atoi(argv[4]);
 
  try {
    cl::vector< cl::Platform > platformList;
    cl::Platform::get(&platformList);
    
    displayPlatformInfo(platformList, deviceType);
    
    cl_context_properties cprops[3] = 
    {CL_CONTEXT_PLATFORM, 
     (cl_context_properties)(platformList[0])(), 0};
 
    cl::Context context(deviceType, cprops);
    
    cl::vector<cl::Device> devices = 
      context.getInfo<CL_CONTEXT_DEVICES>();
    
#ifdef PROFILING
    cl::CommandQueue queue(context, devices[0], 
			   CL_QUEUE_PROFILING_ENABLE);
#else
    cl::CommandQueue queue(context, devices[0], 0);
#endif
 
    OclTest test(context, devices, kernelFile, n_iter, vecLen);
 
    cl::Event event;
    test.initData(queue, event, seed);
 
    queue.enqueueNDRangeKernel(test.getKernel(), 
			       cl::NullRange, cl::NDRange(vecLen),
			       cl::NDRange(1, 1), NULL, &event);
 
    if(test.goldenTest(queue, event) == 0) {
      cout << "test passed" << endl;
    } else {
      cout << "TEST FAILED!" << endl;
    }
    
  } catch (cl::Error error) {
    cerr << "caught exception: " << error.what() 
	 << '(' << error.err() << ')' << endl;
  }
  return EXIT_SUCCESS;
}

Building the executable is straight-forward under Linux:

  1. Copy the source and place it in a file second.cpp.
  2. Set the environment variable for the OpenCL home. Following is one possible example when using a bash shell:
    export OCL_HOME=../ati-stream-sdk-v2.2-lnx64
  3. Compile second.cpp:
    g++ -I $OCL_HOME/include -L
    $OCL_HOME/lib/x86_64 second.cpp -l OpenCL –o second
  4. Specify where to search for shared libraries:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$OCL_HOME/lib/x86_64

The OpenCL kernels are fairly straightforward for the integer addition tests that will be performed in this tutorial. Basically, d_vecA is added to d_vecB and the result is placed in d_vecC. However, the tests will demonstrate the performance difference between private and global memory by using a for loop to increment the value by 1 rather than by performing the addition. Integer operations were used to get exact results.

Following is the try_slow_iii.cl kernel. (The ‘iii’ notation indicates that this kernel uses three integer vectors). This kernel is slow because the processor must decrement the b array in place in global memory.

__kernel void vec_iii_1d(__global int *a, __global int *b, 
                         __global int *c)
{
  size_t tid = get_global_id(0);
  c[tid] = a[tid];
  while(b[tid] > 0) {
    b[tid]--;
    c[tid]++;
  }
}

In comparison, the try_fast_iii.cl kernel uses private memory to speed the increment and deincrement operations. Specifically, the variable tmp is contains the value of the a vector for each value of tid. Similarly, the private variable i is loaded with the value of b.

__kernel void vec_iii_1d(__global int *a, __global int *b,
                         __global int *c)
{
  size_t tid = get_global_id(0);
  int tmp = a[tid];
  for(int i=b[tid]; i > 0; i--) tmp++;
  c[tid] = tmp;
}

Following shows how to use the executable and some timings that were produced using an ATI Radeon HD 5870 GPU and an AMD Phenom™ II X6 1055T processor:

./second gpu try_slow_iii.cl 100 10000
Platform number is: 1
device Type GPU
Platform is by: Advanced Micro Devices, Inc.
Time for kernel to execute 11.1207
test passed
 
./second gpu try_fast_iii.cl 100 10000
Platform number is: 1
device Type GPU
Platform is by: Advanced Micro Devices, Inc.
Time for kernel to execute 0.0445558
test passed
 
./second cpu try_fast_iii.cl 100 10000
Platform number is: 1
device Type CPU
Platform is by: Advanced Micro Devices, Inc.
Time for kernel to execute 23.7571
test passed
 
./second cpu try_slow_iii.cl 100 10000
Platform number is: 1
device Type CPU
Platform is by: Advanced Micro Devices, Inc.
Time for kernel to execute 23.5225
test passed

Timing results show the significant difference in the performance using private vs. global memory on the GPU. The CPU run demonstrates that an AMD Phenom II X6 1055T processor is not significantly affected by the difference between global and private memory on this test. Overall, the GPU is much faster.

  try_fast_iii.cl try_slow_iii.cl
GPU 0.0445558 11.1207
CPU (all cores) 23.7571 23.5225

Summary

It is exciting that OpenCL is becoming ever more prevalent in the technical community. For example the latest Ubuntu Linux distribution, Maverick Meerkat (version 10.10), added OpenCL support. (Users with AMD hardware might need to apply a hotfix to correct for a recent Linux security update. Note that the driver download is nearly 100 MB.)

Expertly using OpenCL memory spaces to partition data and localize computational work is in the top three characteristics, and is probably the defining characteristic, that separates excellent OpenCL developers from everyone else. These are the programmers who can create high-performance applications that scale to large numbers of threads and across many devices. They are also the same individuals who can also create portable high-performance applications through their understanding OpenCL device performance for a given application domain. As pointed out in the first in this article series, balance ratios help to quantify what device characteristics are needed for a particular application or job mix. As is abundantly clear, the bandwidth of the link between the host and OpenCL device, along with the memory bandwidth of the device are key performance metrics for most applications.

Finally, this tutorial has provided a quick and basic introduction to the OpenCL C++ wrapper API along with several example kernels that can run on both CPU and GPU device types. This framework is quite general to allow testing a variety of other OpenCL kernels.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here