In this article, we'll be explaining how one might port CUDA code to Intel's oneAPI toolkits, and in particular, port a CUDA kernel to Intel's DPC++ compiler.
Introduction
CUDA is an Nvidia-owned, parallel computing platform and programming model to run software on GPUs. It is widely used among researchers and industry practitioners to accelerate computationally-heavy workloads, without needing to adopt a wholly unfamiliar workflow and programming model compared to traditional software development. The additional benefits to adopting CUDA are immediate access to a wide array of existing libraries, as well as the use of a number of tools to both debug and visualize CUDA code.
In this article, we'll be explaining how one might port CUDA code to Intel's oneAPI toolkits, and in particular, port a CUDA kernel to Intel's DPC++ compiler. The "oneAPI" toolkits refer to the Data Parallel C++ (or DPC++ for short) programming model along with a number of APIs intended to support high-performance computing applications. DPC++ is a compiler built on LLVM's Clang compiler, extending modern C++ capabilities with SYCL, an open standard designed to allow C++ applications to target heterogeneous systems.
Why Port CUDA to oneAPI?
You might be wondering why we'd want to do such a port, given CUDA's widespread usage in the community for image analysis, machine learning, and more. In short, there are a few compelling advantages to Intel's platform worth considering.
First, DPC++ can target FPGA accelerators as easily as it can target GPUs.
Second, DPC++ is built on top of Clang and open source standards produced by Khronos. Intel is very keen on bringing work on DPC++ upstream to the LLVM project, which would have immediate impact on the value of the various parallel STL algorithms.
Third, it's worth porting code to DPC++ to at least understand how the general programming model works, which may translate to new insights into how best to architect code that requires acceleration in the future.
Perhaps the greatest potential benefit is the ability to deploy oneAPI software to the Intel DevCloud, a cloud environment providing CPUs, GPUs, and FPGAs at your disposal. In particular, much of the hardware available is cutting edge and perhaps impractical to experiment on at home or in the office. For example, with a few commands, you can easily benchmark your application against both an Arria 10 FPGA and a Xeon Platinum. There are subjective reasons why one might prefer to write DPC++ code as well, namely, DPC++ programs read as semantically correct C++, without needing foreign syntax or attributes you might be accustomed to coming from CUDA.
The CUDA Application
The first order of business is to select a CUDA application to port for demonstration purposes. Here, we'll be porting the venerable Mandelbrot fractal generator as we're more interested in learning the DPC++ programming model itself. Briefly, let's perform a quick scan of the CUDA code. First, we need routines to multiply two complex numbers, add two complex numbers, and compute the squared magnitude of a complex number:
struct complex
{
float r;
float i;
};
__device__ complex operator*(complex a, complex b)
{
return {a.r * b.r - a.i * b.i, a.r * b.i + a.i * b.r};
}
__device__ complex operator+(complex a, complex b)
{
return {a.r + b.r, a.i + b.i};
}
__device__ float sqr_magnitude(complex c)
{
return c.r * c.r + c.i * c.i;
}
In CUDA, functions we intend on invoking on the accelerator device require the __device__
attribute. Next, we'll write the function that computes the mandelbrot "value" associated with each pixel:
constexpr static uint32_t max_iterations = 12000u;
__device__ uint32_t mandelbrot_pixel(complex c)
{
complex z = {};
uint32_t i = 0;
for (; i != max_iterations; ++i) {
complex z_next = z * z + c;
if (sqr_magnitude(z_next) > 4.0) {
return i;
} else {
z = z_next;
}
}
return i;
}
Briefly, this function accepts a constant c
, initializes a variable z
to 0
, then continuously evaluates z_next = z^2 + c; z = z_next
until the magnitude of the z_next
exceeds 2
. The function returns the number of iterations needed for this event to occur. Next, we need the kernel function which will evaluate and write out the color of the pixel corresponding to each invocation.
__global__ void mandelbrot(uint8_t* output, int width, int height)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= width || y >= height)
{
return;
}
complex c = {static_cast<float>(x) / width * 3.f - 2.f,
static_cast<float>(y) / height * 3.f - 1.5f};
uint32_t iterations = mandelbrot_pixel(c);
uint32_t color = iterations * 6;
output[y * width + x] = color >= 256 ? 0 : color;
}
The mandelbrot function uses the __global__
attribute to indicate it is intended to be invoked on the host. The pattern employed in this kernel is fairly common, namely, the block size, block index, and thread index are used to associate a specific invocation of the kernel with a pixel in the output raster. The coordinates of this pixel are used to evaluate a color, which is then written out to the output buffer. As every invocation dispatched targets a unique pixel in the output raster, each invocation can operate independently of all other invocations, without needing locks, atomics, or any other synchronization primitive.
Finally, we need a main
function to allocate device memory to output to, dispatch our kernel, allocate host memory, readback the output, and finally write the output to an image. For emitting the image, we'll use the single header/source file stb_image_write.h from the venerable stb library collection for simplicity.
int main(int argc, char* argv[])
{
constexpr static int width = 512;
constexpr static int height = 512;
constexpr static size_t buffer_size = width * height;
uint8_t* buffer;
cudaMalloc(&buffer, buffer_size);
dim3 workgroup_dim{8, 8};
dim3 workgroup_count{width / workgroup_dim.x, height / workgroup_dim.y};
mandelbrot<<<workgroup_count, workgroup_dim>>>(buffer, width, height);
cudaDeviceSynchronize();
uint8_t* host_buffer = reinterpret_cast<uint8_t*>(std::malloc(buffer_size));
cudaMemcpy(host_buffer, buffer, width * height, cudaMemcpyDeviceToHost);
cudaFree(buffer);
int result = stbi_write_png("mandelbrot.png", width, height, 1, host_buffer, width);
std::free(host_buffer);
return 0;
}
Finally, if you're following along, please be sure to include the following needed headers at the top of the file:
#include <cmath>
#include <cstdint>
#include <cstdlib>
#include <iostream>
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"
After compiling and evaluating the code above, you should be able to produce the following PNG image:
Porting to DPC++
To perform a port from CUDA to DPC++, we could painstakingly "translate" CUDA code to DPC++. Fortunately, Intel provides the DPC++ Compatibility Tool to streamline the porting process. While the tool is still in "beta" status at the time of this writing, I had no issues porting the CUDA Mandelbrot code.
First, you'll need to ensure that you have both DPC++ and the compatibility tool installed on your machine. The simplest way to do this is to install the oneAPI toolkits. Both the compiler and compatibility tool are provided in the base toolkit. If you need to target the CUDA backend, you may need to build the toolchain yourself with CUDA support, as the CUDA-enabled toolchain is still experimental. To learn how to do this, please consult the documentation here. Additionally, if you are currently working with CUDA 11, you must have CUDA 10 or 9 installed alongside it for the compatibility tool to run.
Next, after opening a shell in the operating system of your choice, you'll need to invoke a shell script to locally modify various environment variables needed to ensure the Intel oneAPI libraries and executables are locatable. On UNIX-like platforms, the script is called setvars.sh and is located in the installation root of the toolkit (typically something like /opt/intel/oneapi or ~/intel/oneapi). On Windows, the script is provided as setvars.bat and is similarly located in the installation root.
After verifying that the PATH
is set correctly, the dpct compatibility tool should be available. For our simple example with a single main.cu file, the following command is enough to perform the conversion and emit the output to the dpct_output folder in the same directory.
dpct --extra-arg="-std=c++17" main.cu --out-root dpct_output
The directory contents of dpct_output are typically cpp source files with the .dp.cpp extension. In addition, you may see various yaml files which enumerate the code replacements made to various files in your project. While they do not participate in the compilation, they can be handy to understand what operations were made and troubleshoot any issues that arise.
To compile the code and test it, invoke the following command:
mkdir build
cd build
dpcpp ../main.dp.cpp -o mandelbrot -lsycl -lOpenCL
On Windows, you'll want to emit an executable with the .exe extension instead. In the same terminal, executing the mandelbrot program should generate an identical image to what we produced above with CUDA.
One small gotcha you may find is that invoking the executable produced above in a new terminal or from your file explorer may result in runtime errors complaining that the various shared libraries could not be located. This is because by default, dpcpp uses dynamic linkage for the sycl
library, which is useful for the program to receive passive updates, should the oneAPI installation be upgraded in the future. To remedy this issue, you may wish to either ship the library in the same directory as the executable, or modify the library load path.
Deploying to the Intel DevCloud Platform
To wrap up our port, let's deploy our application to Intel's DevCloud. This will allow us to experiment with hardware provided by Intel. To begin, first create an account via the following DevCloud sign-up page. Afterwards, follow the unique login link that will be subsequently emailed to you and SSH to the DevCloud instance provisioned. The redirected page immediately after sign-in should contain instructions on how to perform this connection on your OS. For the most part, this amounts to a Host entry in your SSH configuration, remapping devcloud to a proxy connection with your credentials.
Afterwards, we can use scp
to transfer our source files to the DevCloud instance:
scp -r dpct_output devcloud:~/mandelbrot
In addition, you'll need a Makefile
and script to run your application. The following Makefile
can be used to compile our example:
CXX = dpcpp
CXXFLAGS = -o
LDFLAGS = -lOpenCL -lsycl
EXE_NAME = mandelbrot
SOURCES = main.dp.cpp
BINDIR = bin
all: main
main:
[ -d $(BINDIR) ] || mkdir $(BINDIR)
$(CXX) $(CXXFLAGS) $(BINDIR)/$(EXE_NAME) $(SOURCES) $(LDFLAGS)
run:
$(BINDIR)/$(EXE_NAME)
clean:
rm -rf $(BINDIR)/$(EXE_NAME)
A script to invoke make and execute the compiled program is also needed (here, we name it run.sh
but you can choose your own name and adapt the following command accordingly):
source /opt/intel/inteloneapi/setvars.sh
make run
With this, we're now able to submit jobs to various hardware queues in the DevCloud. The full documentation for interfacing with the job queues is provided here. As a demonstration, the following commands dispatches our request, runs it, and reads back the result.
# ON DEVCLOUD
# Queue submission with a job label, working directory, and script to run
qsub -N mandelbrot -d . run.sh
# Show job submission status
qstat
# ON HOST
scp devcloud:~/mandelbrot/mandelbrot.png .
# Verify the image looks correct
Conclusion
In this article, we've demonstrated how to port an existing CUDA application to DPC++, compile it, and run it on the DevCloud. Assuming familiarity with the commands and simplicity of the original CUDA program, such a port and deployment to edge hardware can occur in minutes. More sophisticated projects may require additional steps not covered here: for example, invoking the compatibility tool on a Visual Studio project or a compiler-commands database produced by a tool such as CMake. Also not covered are various features of the DevCloud, such as the ability to target specific classes of hardware or compute nodes, as well as executing scripts that time execution. To leverage these features and learn more about Intel's DPC++ compiler, please consult the documentation on the Intel Developer Zone.
History
- 9th November, 2020: Initial version