Modern workloads demand a diverse set of architectures to operate effectively and efficiently. Each architecture exposes us to specific libraries, tools, and programming models. As a result, to run our application on a different device, we need to rewrite parts of our code. Rewriting code is a time-consuming task that prevents us from focusing exclusively on improving our algorithms.
We need a productive, performant, and heterogeneous programming model that crosses architectural boundaries. In other words, we want a single code base that runs transparently on all devices regardless of the hardware’s architecture while still delivering the best performance. This is the vision of the oneAPI, the industry initiative, and Intel provides a product implementation of the oneAPI specification with a set of comprehensive developer toolkits. A key component of Intel’s oneAPI product is the Intel® oneAPI DPC++/C++ Compiler, which aims to bring devices to some common ground so that software can run seamlessly on heterogeneous systems.
Intel has iteratively refined the compiler in line with industry standards and provided a reference implementation.
SYCL is a royalty-free, single-source embedded domain-specific language based on the C++17 standard. It specifies an abstract layer that allows programming on heterogeneous architectures. Based on SYCL, the compiler compiles modern C++ and SYCL code contained in a single source file, both for the CPU and for a wide range of accelerators such as graphics processing units (GPUs) and field-programmable gate arrays (FPGAs).
This article demonstrates how to compile a simple SYCL application with the compiler, showing how even a simple SYCL program can run on multiple devices of our choice with minimal effort. We’ll also review:
- The SYCL implementation
- A Dockerfile including OpenCV library, Intel oneAPI Base Toolkit, and other necessary dependencies to run your code
- Two scripts to one-click build and run your Docker container on a Linux system
Before starting this tutorial, learn about getting started with the Intel oneAPI DPC++/C++ Compiler. This documentation provides a basic understanding of compiling and executing an even simpler SYCL program on different platforms.
Using SYCL Code with the Intel oneAPI DPC++/C++ Compiler
To provide a practical overview of how to compile SYCL code with the compiler, we’ll build an application to enhance a slightly under-exposed image of the oneAPI logo below. We’ll use SYCL to brighten this image:
Let’s begin by understanding our application. Our image consists of 611×611 pixels, with three channels (red, blue, and green) for each pixel. In total, it contains more than one million values. To brighten the image, we’ll add 25 to each value. A sequential implementation would iterate over each value in the image.
But is this a sequential problem by definition?
The answer is no. Rather, this operation is highly parallel. Each pixel represents a separate computing operation that can be executed independently and in arbitrary order. So, we can manipulate all pixel values at once. This tutorial shows both the C++ serial implementation and the SYCL parallel version.
But first, let’s load our image using the OpenCV library to access pixel values. This code loads the image in memory with the cv::imread
function given a path to our image:
std::string image_path = cv::samples::findFile("img_lowexposure.jpg");
cv::Mat img = cv::imread(image_path, cv::IMREAD_COLOR);
Now, let’s sequentially manipulate pixels:
for(int i = 0; i < img.rows; ++i){
for(int j = 0; j < img.cols; ++j){
for(int c = 0; c < img.channels(); ++c){
img.at<cv::Vec3b>(i,j)[c] = std::clamp(img.at<cv::Vec3b>(i,j)[c] + 25, 0, 255);}}}
The C++ implementation suffers from being serial. It uses three nested for
loops that iterate all pixels and their channels, as there are no automatic optimizations on the compiler side. Note that we clamp the values to avoid overexposing our image.
The equivalent implementation of our algorithm in SYCL is simple and takes advantage of parallel devices under the hood. We can easily target a specific architecture using the device selectors:
device d;
try{
d = device(gpu_selector());
} catch (exception const& e){
std::cout << "Cannot select a GPU\n" << e.what() << "\n";
std::cout << "Using a CPU device\n";
d = device(cpu_selector());}
std::cout << "Device: "<< q.get_device().get_info<info::device::name>() << std::endl;
First, we attempt to select a GPU device if it’s available. If not, we use the CPU instead. We provide the selected device with a queue that creates a connection to the device.
Now, we can easily submit work to the queue with the submit
function. But before that, we need to encapsulate our image in a buffer.
queue q(d);
buffer<uint8_t, 3> frame_buffer(img.data, range<3>(img.rows, img.cols, 3));
We then submit a command group function object to the queue. This takes a command group handler as an argument. Command group function object encapsulates accessors to buffer — which in our case is a read-write accessor — and the kernel.
q.submit([&](handler& cgh){
auto pixels = frame_buffer.get_access<access::mode::read_write>(cgh);
cgh.parallel_for(range<3>(img.rows, img.cols, 3), [=](item<3> item){
uint8_t p = pixels[item];
pixels[item] = sycl::clamp(p + 25, 0, 255);});
});
In the code above, we enqueue a parallel_for
task, passing a function executed by each work item. The range
class determines iteration space, while the item
class is an individual instance of a kernel function. The kernel operates on a single pixel. We run as many kernels as pixels in the image to process all pixels at once. Kernels execute in parallel.
Finally, we make the program wait until the kernel’s work is completed before outputting the final image, as follows:
q.wait_and_throw();
cv::imwrite("img_sycl.jpg", img);
We’re ready to compile our code. But first, we must set the environment variables by sourcing setvars.
source /opt/intel/oneapi/setvars.sh
To compile our code, we use the Intel oneAPI DPC++/C++ Compiler.
We use the dpcpp
command as follows:
dpcpp src.cpp -std=c++17 -I/usr/local/include/opencv4 -lopencv_core -lopencv_imgcodecs -lopencv_highgui -o src
It’s important to include the OpenCV directory of header files and specify all the required libraries during compilation.
After executing your application, check the results. They should be the same as the images below. On the left is an under-exposed image of the oneAPI logo, and on the right is the same image with increased brightness.
Congratulations! You’ve implemented a SYCL application to brighten your pictures.
Use Cases and Benefits
SYCL aims to enable full heterogeneous programming, given the emerging explosion in hardware diversity. It minimizes boilerplate code and makes parallelism explicit. Minimizing code and making parallelism explicit simplifies code migration and increases our productivity, as we can now focus on improving our algorithms instead of rewriting code to run on new hardware.
The Intel oneAPI DPC++/C++ Compiler enables fast, productive heterogeneous programming across multiple architectures. Heterogeneous programming allows developers to easily select dedicated devices to accelerate specific parts of their workflow. For instance, we can accelerate image processing in our application using GPUs and FPGAs while still using the CPU for serial tasks. This programming model makes our application ready to run on a new accelerator with minimal changes in our codebase while ensuring the best performance.
Conclusion
Using Intel’s oneAPI DPC++/C++ Compiler, we enabled the SYCL code to take advantage of heterogeneous computing using this single program and it improved the brightness of the graphics. Now we can deploy the code to a CPU, a GPU, and so on to take advantage of the computing power we have.
First, we implemented our application in C++. Then, with SYCL, we enabled heterogeneous computing, highlighting how our SYCL code intuitively expresses parallelism.
After setting the compiler’s environment variables, we compiled our SYCL program using the compiler and included some OpenCV functionality for image processing. Now, we can transparently target multiple devices to accelerate our application.
Interested in learning more? Explore the optimized libraries for specific workloads, debugging tools, and advanced analysis capabilities that the oneAPI toolkit provides. And, for a quick start without having to download the toolkit, try Intel DevCloud.