Part 6
in this series on portable parallelism with OpenCL™ taught how to mix OpenCL™
computation and OpenGL rendering within a single application. Primitive
restart, an addition to the OpenGL 3.1 standard, was used in the example source
code to greatly accelerate performance by computing and rendering data on the
GPU, which avoided transfers across the PCIe bus and highlighted GPU performance.
This article will demonstrate how to create C/C++ plugins that can be dynamically loaded at runtime to add
massively parallel OpenCL capabilities to an already running application. Dynamically
loaded modules via shared objects or DLLs (Dynamically Link-Libraries) are a
popular design pattern for many applications – especially in the commercial
marketplace. Developers who understand how to use OpenCL in a dynamically
loaded runtime environment have the ability to create plugins that accelerate
the performance of existing applications by an order of magnitude or more –
simply by writing a new plugin that uses OpenCL.
As discussed in part 1
of this series, OpenCL application kernels are written in a variant of the ISO
C99 C-language specification. These kernels are compiled at runtime for the
destination device via the runtime OpenCL compiler. We know from previous
articles in this series that OpenCL already creates and dynamically loads
device-dependent code for use in an already running application. This dynamic
compilation capability is perfect for use in a plugin environment except when
some sequential operation is required to support a massively parallel OpenCL
kernel, to work within a legacy plugin framework, or when the developer wishes
to use multiple OpenCL devices. For these reasons, this tutorial will
demonstrate how to create C/C++ plugins that are dynamically loaded into an
application. These plugins can then create and load the massively parallel
OpenCL kernels.
Looking ahead, the next article in this series will extend
this plugin capability to incorporate OpenCL into heterogeneous workflows via a
general-purpose "click together tools" framework that can stream arbitrary
messages (vectors, arrays, and arbitrary, complex nested structures) within a
single workstation, across a network of machines, or within a cloud computing
framework. The ability to create scalable workflows is important because for
many problems data handling and transformation can be as complex a problem as
the computational problem used to produce the desired result.
The reader should note that dynamically compiled OpenCL plugins
and kernels also opens up the possibility for highly optimized kernel
generation based on problem parameters. A number of papers and examples on can
be found on the Internet. Two examples are the presentation "Automatic
OpenCL Optimization for Locality and Parallelism Management" and paper, "Automatically
Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a
High-Level Representation".
OpenCL for Libraries and Plug-ins
Most programmers are familiar with using libraries. A
library is a collection of methods and functions that are contained in a single
file. By convention, libraries are usually denoted by the library name followed
by .a under Linux or.lib under Windows. Libraries are regularly
used by most programmers because they allow code to be shared and changed in a
modular fashion. During the compilation phase when building an executable, the
compiler must note calls to any external library methods or functions. It is up
to the linker, in a follow-on step, to complete the resolution of any unresolved
references to create an executable that can run on the computer.
Linking to
external methods can occur:
Statically
during the creation of the executable: Static linking means that all references
are resolved when the executable is built. Further, the executable contains the
explicit machine code to run all library functions used by the program.
Dynamically during load-time: Load time dynamic linkage
happens when the executable is loaded into memory. Just like static linking,
all symbols in the executable are resolved by linking with one or more .dll
(Windows) or .so files (Linux) at program startup. This form of linkage
provides fixed functionality (think of the C runtime library and other commonly
used libraries). A big advantage of load-time linking is that all
applications that link a library at load time will benefit from bug fixes and
performance improvements just by installing a revised library file at the
shared location. No applications need to be recompiled or relinked to use the
improved library code. Further, shared libraries keep individual executable
sizes small – a cost savings that is multiplied many times for libraries that
are commonly used.
Dynamically
during run-time: Run-time
linking is used to provide functionality to load plugins, which allows
generic functionality to be added without recompiling the application. Thus an
application can call a generic external function, func(),
whose
functionality depends entirely on the plug-in that is loaded by the application.
As mentioned previously, an application can literally write and compile the
plugin (when a compiler is available) by generating a problem-specific source
code, which is then compiled and linked into the already running application.
This trick allows application developers to create very highly optimized
functions across a general problem domain when provided specific problem
parameters. Many scientific applications utilize this capability to
significantly improve performance.
For more information about the benefits of using libraries,
DLLs (Dynamic-Link Libraries) and shared object files, look to "Static,
Shared Dynamic, and Loadable Linux Libraries or the general Wikipedia discussion
of DLLs.
Following is a simple C-language program that calls a
generic external function, func()
, and prints the value of x
created
by the generic function. An init(), func(), fini()
framework (similar to
a C++ object constructor, computational method, and destructor) is demonstrated
in this simple source code to provide additional generality. This is a common
design pattern in plugin programming as the init()
method lets the
programmer perform any initialization while the a fini()
method gives
the programmer the ability to perform an final processing and cleanup. In
addition, the C library printf()
function is called.
#include <stdio.h>
extern int func(int *);
int main()
{
int x;
init();
func(&x);
printf("Example of static linking\n");
printf("Valx=%d\n",x);
fini();
return 0;
}
Example 1: prog.c
The source code for dynCompile.cc, extends this
generic behavior and adds the capability to dynamically compile the .so (Shared
Object) at runtime. The compiled method is then loaded and linked to the
running executable. The name of the source file is specified by the user on the
command-line. It is not hard to see how this application can be extended to
generate the source code that is then compiled to create the shared object plugin.
The code walk-through of dynCompile.cc, starts with
the specification of the include files needed to build dynCompile.cc.
#include <cstdlib>
#include <sys/types.h>
#include <dlfcn.h>
#include <string>
#include <iostream>
using namespace std;
Example 2: Part 1 of dynCompile.cc
Some global handles and pointer to function
types are defined.
void *lib_handle;
typedef int (*initFini_t)();
typedef int (*func_t)(int*);
Example 3: Part 2 of dynCompile.cc
The main()
method begins by parsing the command-line
argument, which contains the filename of the source to be built. The command to
build the .so is created and performed with a system()
call. For
a Linux environment, g++ is used. Windows users can call the Visual
Studio cl.exe compiler.
int main(int argc, char **argv)
{
if(argc < 2) {
cerr << "Use: sourcefilename" << endl;
return -1;
}
string base_filename(argv[1]);
base_filename = base_filename.substr(0,base_filename.find_last_of("."));
string buildCommand("g++ -fPIC –shared ");
buildCommand += string(argv[1])
+ string(" -o ") + base_filename + string(".so ");
cerr << buildCommand << endl;
if(system(buildCommand.c_str())) {
cerr << "compile command failed!" << endl;
cerr << "Build command " << buildCommand << endl;
return -1;
}
Example 4: Part 3 of dynCompile.cc
Assuming no errors occurred during the compilation phase,
the next step is to load the library created in the previous step. If there is
an error, the program exits.
string nameOfLibToLoad("./");
nameOfLibToLoad += base_filename;
nameOfLibToLoad += ".so";
lib_handle = dlopen(nameOfLibToLoad.c_str(), RTLD_LAZY);
if (!lib_handle) {
cerr << "Cannot load library: " << dlerror() << endl;
return -1;
}
Example 5: Part 4 of dynCompile.cc
Finally, the symbols are loaded and the pointers to the init()
,
func()
, and fini()
methods are resolved.
initFini_t dynamicInit= NULL;
func_t dynamicFunc= NULL;
initFini_t dynamicFini= NULL;
dlerror();
dynamicFunc= (func_t) dlsym(lib_handle, "func");
const char* dlsym_error = dlerror();
if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}
dynamicInit= (initFini_t) dlsym(lib_handle, "init");
dlsym_error = dlerror();
if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}
dynamicFini= (initFini_t) dlsym(lib_handle, "fini");
dlsym_error = dlerror();
if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}
Example 6: Part 5 of dynCompile.cc
Each function pointer is checked to see that symbol has been
resolved. If so, the function is called. As a convenience to the plugin author,
any of the calls can be made optional – meaning the method does not need to be
included in the compiled source file. All that is required is to modify the
logic in the previous step so a failure to resolve a reference does not cause
the application to exit.
if( (*dynamicInit)() < 0) return -1;
int x;
(*dynamicFunc)(&x);
cout << "Valx " << x << endl;
if( (*dynamicFini)() < 0) return -1;
Example 7: Part 6 of dynCompile.cc
Finally, the libraries are unloaded and the application
exits.
dlclose(lib_handle);
}
Example 8: Part 7 of dynCompile.cc
Following is the source for a simple C++ plugin, cctest1.cc.
This source code is very straight forward.
#include <iostream>
using namespace std;
extern "C" int init() {
cerr << "Hello from Init" << endl;
return(0);
}
extern "C" int func(int *i)
{
cerr << "Hello from Func" << endl;
*i=100;
return(1);
}
extern "C" int fini()
{
cerr << "Hello from Fini" << endl;
return(0);
}
Example 9: cctest1.cc
The following script demonstrates how to build dynCompile.cc
and run the cctest1.cc source code:
echo "------ building dynCompile -----"
g++ -o dynCompile dynCompile.cc -ldl
echo "------ dynamic version of cctest1.cc -----"
./dynCompile cctest1.cc
Example 10: Linux commands to build and run dynCompile
It produces the following output:
$ ./dynCompile cctest1.cc
g++ -fPIC -shared cctest1.cc -o cctest1.so
Hello from Init
Hello from Func
Valx 100
Hello from Fini
Example 11: Output from dynCompile
The
application dynCompile.cc demonstrates how a sequential C/C++ plugin can
be built and loaded into a running application. Further, it opens up the
possibility for highly optimized automatic plugin generation based on problem
parameters.
Using OpenCL in Plugins
The following source code, testStatic.cc, calls a
shared object function myOCLfunction()
that creates and runs an OpenCL
kernel. Walking through the code we see that the device context and queue is
setup as described in part 5 of
this article series. The user can specify that the plugin runs on either the CPU
or a GPU.
#define PROFILING // Define to see the time the kernel takes
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS // needed for exceptions
#include <CL/cl.hpp>
#include <fstream>
#include <iostream>
using namespace std;
extern "C" int myOCLfunction(cl::CommandQueue&, const char*, int, char **);
int main(int argc, char* argv[])
{
if( argc < 2) {
cerr << "Use: {cpu|gpu} kernelFile" << endl;
exit(EXIT_FAILURE);
}
const string platformName(argv[1]);
const char* kernelFile = argv[2];
int ret= -1;
cl::vector<int> deviceType;
cl::vector< cl::CommandQueue > contextQueues;
if(platformName.compare("cpu")==0)
deviceType.push_back(CL_DEVICE_TYPE_CPU);
else if(platformName.compare("gpu")==0)
deviceType.push_back(CL_DEVICE_TYPE_GPU);
else { cerr << "Invalid device type!" << endl; return(1); }
try {
cl::vector< cl::Platform > platformList;
cl::Platform::get(&platformList);
cl::vector<cl::Device> devices;
for(int i=0; i < deviceType.size(); i++) {
cl::vector<cl::Device> dev;
platformList[0].getDevices(deviceType[i], &dev);
for(int j=0; j < dev.size(); j++) devices.push_back(dev[j]);
}
cl_context_properties cprops[] = {CL_CONTEXT_PLATFORM, NULL, 0};
cl::Context context(devices, cprops);
cout << "Using the following device(s) in one context" << endl;
for(int i=0; i < devices.size(); i++) {
cout << " " << devices[i].getInfo<CL_DEVICE_NAME>() << endl;
}
for(int i=0; i < devices.size(); i++) {
#ifdef PROFILING
cl::CommandQueue queue(context, devices[i],CL_QUEUE_PROFILING_ENABLE);
#else
cl::CommandQueue queue(context, devices[i],0);
#endif
contextQueues.push_back( queue );
}
ret = myOCLfunction(contextQueues[0], kernelFile, argc-3, argv+3);
} catch (cl::Error error) {
cerr << "caught exception: " << error.what()
<< '(' << error.err() << ')' << endl;
}
return ret;
}
Example 12: Source code for testStatic.cc
As noted in the comment (highlighted in green), the
function, myOCLfunction()
, is called using the first device. If desired,
the reader can add an additional command-line argument to specify a device
number or change the code to run on multiple devices as shown in part 5 of
this tutorial series. The generic function is passed the queue of the device as
well as the name of the OpenCL kernel source file.
Following is a simple C++ plugin file to build and run the
OpenCL kernel. Walking through the code we see that the appropriate
preprocessor defines and includes are provided. Also, the oclBuildProgram()
method
has been adapted from part 5 to build the OpenCL kernel for the device. Note
that the device context and device information (highlighted in green) can be retrieved
from OpenCL queue.
#define PROFILING // Define to see the time the kernel takes
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS // needed for exceptions
#include <CL/cl.hpp>
#include <fstream>
#include <iostream>
using namespace std;
#ifndef _OCL_BUILD
#define _OCL_BUILD
cl::Program oclBuildProgram( cl::CommandQueue& queue,
const char *kernelFile,
const char* myType)
{
cl::Context context = queue.getInfo<CL_QUEUE_CONTEXT>();
cl::Device device = queue.getInfo<CL_QUEUE_DEVICE>();
string buildOptions;
{ char buf[256];
sprintf(buf,"-D TYPE1=%s ", myType);
buildOptions += string(buf);
}
ifstream file(kernelFile);
string prog(istreambuf_iterator<char>(file),
(istreambuf_iterator<char>()));
cl::Program::Sources source( 1, make_pair(prog.c_str(),
prog.length()+1));
cl::Program oclProg(context, source);
file.close();
try {
cerr << " buildOptions " << buildOptions << endl;
cl::vector<cl::Device> foo;
foo.push_back(device);
oclProg.build(foo, buildOptions.c_str() );
} catch(cl::Error& err) {
cerr << "Build failed! " << err.what()
<< '(' << err.err() << ')' << endl;
cerr << "retrieving log ... " << endl;
cerr << oclProg.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device)
<< endl;
exit(-1);
}
return(oclProg);
}
#endif
Example 13: Part 1 of myOCLfunction.cc
Also note that the function, myOCLfunction()
is
declared as an external "C" method to prevent C++ name mangling, which allows
this method to be called from a C language program. This plugin parses additional
command line arguments passed from the running application. It then calls the oclBuildProgram()
method to compile the OpenCL kernel code for the device. In this example, the
type of vector the kernel operates on is defined to be an unsigned int
.
Through the use of preprocessor defines, this OpenCL kernel can be compiled to
support any type (int
, float
, double
) as discussed in part 4 of
this series.
extern "C" int myOCLfunction( cl::CommandQueue& queue, const char* kernelFile,
int argc, char *argv[])
{
if(argc < 1) {
cerr << "myOCLfunction requires a vector size on the command-line" << endl;
return -1;
}
int vecsize = atoi(argv[0]);
unsigned int* vec = new uint[vecsize];
int vecBytes = vecsize*sizeof(uint);
cl::Context context = queue.getInfo<CL_QUEUE_CONTEXT>();
cl::Program oclProg = oclBuildProgram(queue, kernelFile, "uint");
cl::Kernel funcKernel = cl::Kernel(oclProg, "func");
Example 14: Part 2 of myOCLfunction.cc
Astute readers will recognize that this plugin is adapted
from the testSum.hpp code from part 5. A
vector is filled with random numbers and passed to an OpenCL kernel. The simple
double check on the host verifies that the OpenCL code correctly added each
random number in the vector to itself. If the OpenCL and host results agree, a
message "test passed" is printed. If incorrect results are found, the message
"TEST FAILED!" will be printed.
srand(0);
for(int i=0; i < vecsize; i++) vec[i] = (rand()&0xffffff);
cl::Buffer d_vec;
d_vec = cl::Buffer(context, CL_MEM_READ_WRITE, vecBytes);
funcKernel.setArg(0,vecsize); funcKernel.setArg(1,d_vec); funcKernel.setArg(2,0);
queue.enqueueWriteBuffer(d_vec, CL_TRUE,0, vecBytes, &vec[0]);
cl::Event event;
queue.enqueueNDRangeKernel(funcKernel,
cl::NullRange, cl::NDRange( vecsize ), cl::NDRange(1, 1), NULL, &event);
queue.enqueueReadBuffer(d_vec, CL_TRUE, 0, vecBytes, &vec[0]);
queue.finish();
{
int i;
srand(0);
for(i=0; i < vecsize; i++) {
unsigned int r = (rand()&0xffffff);
r += r;
if(r != vec[i]) break;
}
if(i == vecsize) {
cout << "test passed" << endl;
} else {
cout << "TEST FAILED!" << endl;
}
}
delete [] vec;
return EXIT_SUCCESS;
}
Example 15: Part 3 of myOCLfunction.cc
These two examples codes can be compiled with the following
commands under Linux:
echo "---------------"
g++ -c -I $AMDAPPSDKROOT/include myOCLfunction.cc
g++ -I $AMDAPPSDKROOT/include -fopenmp testStatic.cc myOCLfunction.o -L $AMDAPPSDKROOT/lib/x86_64 -lOpenCL -o testStatic
This program can be adapted to dynamically load the plugin
and call the myOCLfunction()
as seen in the source code for testDynamic.ccbelow.
The modified code to perform the dynamic load and link is highlighted in green.
Note that the name of the shared object file is passed via a command line
argument.
#define PROFILING // Define to see the time the kernel takes
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS // needed for exceptions
#include <CL/cl.hpp>
#include <fstream>
#include <iostream>
#include <dlfcn.h>
using namespace std;
typedef int (*func_t)(cl::CommandQueue&, const char*, int, char **);
int main(int argc, char* argv[])
{
if( argc < 3) {
cerr << "Use: {cpu|gpu} kernelFile sharedObjectFile" << endl;
exit(EXIT_FAILURE);
}
const string platformName(argv[1]);
const char* kernelFile = argv[2];
const char* soFile = argv[3];
int ret= -1;
cl::vector<int> deviceType;
cl::vector< cl::CommandQueue > contextQueues;
if(platformName.compare("cpu")==0)
deviceType.push_back(CL_DEVICE_TYPE_CPU);
else if(platformName.compare("gpu")==0)
deviceType.push_back(CL_DEVICE_TYPE_GPU);
else { cerr << "Invalid device type!" << endl; return(1); }
try {
cl::vector< cl::Platform > platformList;
cl::Platform::get(&platformList);
cl::vector<cl::Device> devices;
for(int i=0; i < deviceType.size(); i++) {
cl::vector<cl::Device> dev;
platformList[0].getDevices(deviceType[i], &dev);
for(int j=0; j < dev.size(); j++) devices.push_back(dev[j]);
}
cl_context_properties cprops[] = {CL_CONTEXT_PLATFORM, NULL, 0};
cl::Context context(devices, cprops);
cout << "Using the following device(s) in one context" << endl;
for(int i=0; i < devices.size(); i++) {
cout << " " << devices[i].getInfo<CL_DEVICE_NAME>() << endl;
}
for(int i=0; i < devices.size(); i++) {
#ifdef PROFILING
cl::CommandQueue queue(context, devices[i],CL_QUEUE_PROFILING_ENABLE);
#else
cl::CommandQueue queue(context, devices[i],0);
#endif
contextQueues.push_back( queue );
}
string nameOfLibToLoad = string("./") + soFile;
void* lib_handle = dlopen(nameOfLibToLoad.c_str(), RTLD_LAZY);
if (!lib_handle) {
cerr << "Cannot load library: " << dlerror() << endl;
return -1;
}
func_t dynamicFunc = (func_t) dlsym(lib_handle, "myOCLfunction" );
const char* dlsym_error = dlerror();
if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}
ret = (*dynamicFunc)(contextQueues[0], kernelFile, argc-4, argv+4);
} catch (cl::Error error) {
cerr << "caught exception: " << error.what()
<< '(' << error.err() << ')' << endl;
}
return ret;
}
Example 16: Source code for testDynamic.cc
The testDynamic executable and myOCLfunction.so file
are created under Linux with the following commands:
echo "---------------"
g++ -fPIC -shared -I $AMDAPPSDKROOT/include myOCLfunction.cc -o myOCLfunction.so
g++ -I $AMDAPPSDKROOT/include testDynamic.cc -L $AMDAPPSDKROOT/lib/x86_64 -lOpenCL -o testDynamic
Example 17: Commands to build testDynamic.cc under Linux
Running testStatic
and testDynamic
shows that
the plugin works correctly on both the CPU and a GPU in both applications. The
message demonstrating that the device and host results agree is highlighted in
green.
$ sh RUN
------- static test CPU ------
Using the following device(s) in one context
AMD Phenom(tm) II X6 1055T Processor
buildOptions -D TYPE1=uint
test passed
------- static test GPU ------
Using the following device(s) in one context
Cypress
buildOptions -D TYPE1=uint
test passed
------- dynamic test CPU ------
Using the following device(s) in one context
AMD Phenom(tm) II X6 1055T Processor
buildOptions -D TYPE1=uint
test passed
------- dynamic test GPU ------
Using the following device(s) in one context
AMD Phenom(tm) II X6 1055T Processor
buildOptions -D TYPE1=uint
test passed
Example 18: Output of testStatic and testDynamic
Including and Protecting the OpenCL Source
In many cases, the developer might not want to include the
OpenCL source code for the plugin in a separate file. Using multiple files
complicates the installation and can introduce hard-to-diagnose installation
and upgrade errors. Using a single file simplifies installation and can make
OpenCL plugin deployments more robust. In particular, commercial developers
will not want to release their OpenCL source code in a form that anyone can
easily read.
The simplest way to include the OpenCL source in the plugin
file is to use the Linux xxd
hexdump tool to create a string that
contains the OpenCL source code. The OpenCL method clCreateProgramWithSource()
used in part 1
of this article series can then be used to compile the source code from a
string.
The following command demonstrates how to create a
C-language string from a file, foo.txt, that contains the following:
This is a test file.
It spans multiple
lines.
Example 19: An example file, foo.txt, to be converted with xxd
Running "xxd –I foo.txt" results in the following output:
unsigned char foo_txt[] = {
0x54, 0x68, 0x69, 0x73, 0x20, 0x69, 0x73, 0x20, 0x61, 0x20, 0x74, 0x65,
0x73, 0x74, 0x20, 0x66, 0x69, 0x6c, 0x65, 0x2e, 0x0a, 0x49, 0x74, 0x20,
0x73, 0x70, 0x61, 0x6e, 0x73, 0x20, 0x6d, 0x75, 0x6c, 0x74, 0x69, 0x70,
0x6c, 0x65, 0x20, 0x6c, 0x69, 0x6e, 0x65, 0x73, 0x2e, 0x0a
};
unsigned int foo_txt_len = 46;
Example 20: Output of "xxd –I foo.txt"
The string foo_txt
can be compiled in the plugin and passed
to the clCreateProgramWithSource()
. A disadvantage of using xxd is that
the OpenCL source string is no longer readable. This does not protect the
OpenCL kernel source as it can be easily found in the .so or executable
file by simply running the UNIX strings
command.
Commercial developers can make it more difficult to get the
source code for their OpenCL kernels by encrypting and decrypting the source text
with a package like Keyczar
on Google code. The encrypted source string can still be created using
xxd.
Still, the OpenCL source code will reside in a buffer after decryption and
during the call to
clCreateProgramWithSource()
. While encryption does
make it harder, a motivated hacker can still find and print the source code
from that buffer.
As mentioned
in part
1, offline compilation can create the OpenCL device binary for specific
devices. Just as the application binary provides protection against reverse
engineering, so do OpenCL binaries obfuscate the OpenCL kernel code. Again this
binary can be included in the source code using the output from xxd
. AMD
provides a knowledge base article explaining how to
perform offline kernel compilation. A technical limitation of this method
is that only pre-compiled devices can be supported by the plugin. Depending on
the business model, this could be an advantage or drawback as plugins delivered
to the customer will only be able to support specific devices for which they
have pre-compiled OpenCL binaries.
Summary
With the ability to create OpenCL plugins, application
programmers have the ability to write and support generic applications that can
deliver accelerated performance when a GPU is present and CPU-based performance
when a GPU is not available. These plugin architectures are well-understood and
a convenient way to leverage existing applications and code bases. They also
help preserve existing software investments.
The ability to dynamically compile OpenCL source code and
link it into a running application opens a host of opportunities for optimizing
code generators. This capability is part of the foundation upon which the
portable parallelism of OpenCL resides. As mentioned in this article and
utilized in scientific computation, dynamically generating optimized code for
specific parameter sets in a general problem domain can achieve very high
performance – far beyond what a single "one size fits all" generic code can
deliver.
The next article in this series will extend this capability to
exploit hybrid CPU/GPU computation in heterogeneous workflows so developers can
capitalize on GPU acceleration and CPU capabilities in their production workflows.