In this article, we'll write a small program using oneAPI and DPC++ from scratch and subsequently deploy and run our program on hardware in the DevCloud.
Introduction
Intel's oneAPI is an open, standards-based programming model, empowering C++ developers to target different accelerator architectures in a uniform manner. This is in contrast to specialized APIs such as Vulkan, CUDA, or D3D12 which target GPUs in a bespoke manner, or Verilog and VHDL which target FPGAs. When programming with oneAPI, the same code can be used to target CPUs, GPUs, and FPGAs alike. The only thing you need to get started is a working installation of the oneAPI toolkit which ships with DPC++, Intel's extension to the Clang compiler, needed to compile C++ code to target the various supported accelerator types.
An additional benefit of using oneAPI is the ability to deploy applications to Intel's DevCloud, a sandbox environment providing access to compute clusters provisioned with powerful CPUs and FPGAs. Practitioners familiar with the complexity of managing FPGA toolchains will likely be pleased to deploy software in a nearly turnkey environment, targeting FPGAs such as the Arria 10 without needing an immediate investment in hardware. In this article, we'll be writing a small program using oneAPI and DPC++ from scratch and subsequently deploy and run our program on hardware in the DevCloud.
A Sobel Convolution
The Sobel operator is a simple edge-detection filter that, when convolved with an image, produces a corresponding image with the edges emphasized. It is used primarily in computer vision or rendering applications as a precursor to downstream pipeline stages that perform inference or render various visual effects. To perform a Sobel filter, we need to convolve an image two times with the following filters:
|1 0 -1|
|2 0 -2|
|1 0 -1|
and:
| 1 2 1|
| 0 0 0|
|-1 -2 -1|
Intuitively, the first kernel here will "detect" edges in the x-direction, and the second kernel detects edges in the y-direction. The application of the kernel at a single pixel is, in reality, a directional gradient computed using finite differences. The gradients produced in this way can then be combined to return the norm of the directional derivative at each pixel. Larger norms correspond to edges as seen in the following images:
For a 3x3 kernel, application to a single pixel requires 9 multiplies, and 8 additions to accumulate the result, so 17 operations in total (not accounting for SIMD or MADD or any other such operation fusing, this is just a rough cost estimation). If an image has dimension w
by h
, then we expect the two convolutions to require on the order of 34wh
operations. There is a trick however, which leverages the fact that the convolution operator is separable whenever the convolution matrix is rank 1. In the case of the x-derivative finite difference kernel, we can factor it out as the following outer product:
|1 0 -1| |1|
|2 0 -2| = |2| * [1 0 -1]
|1 0 -1| |1|
Subsequently, the image can be convolved in two steps, first with the 3x1 horizontal kernel, then by the 1x3 vertical kernel. These two kernels require 10 operations, meaning that the entire Sobel convolution can be done on the order of 20wh operations instead of 34wh (at the cost of requiring some additional intermediate memory).
The DPC++ we will write now will perform the following steps:
-
Load an image
-
Initialize a device queue
-
Convert the image to greyscale
-
In parallel
-
Combine the two results from step two into the output image
-
Write the image to disk
Loading the Image
To read and write images, we'll be using the single-file libraries stb_image and stbimagewrite respectively. After putting these files in your source tree, create a main.cpp file with the following included headers at the top.
#include <CL/sycl.hpp>
#include <cmath>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"
#define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h"
Then, we can load an image like so:
int channels;
int width;
int height;
uint8_t* image = stbi_load(path, &width, &height, &channels, 3);
cl::sycl::buffer<uint8_t, 1> image_buffer{image, width * height * channels};
The memory is already available on the host after the stbi_load
function executes. However, we also construct a buffer type to manage read and write barriers to the image later.
Initializing a Device Queue
In order to submit hardware-accelerated commands, we need to construct a cl::sycl::queue
. Queues are needed to marshall commands and memory to a device.
cl::sycl::queue queue{cl::sycl::default_selector};
In addition to the default selector (which attempts to pick the "best" one according to a set of heuristics), you can specify a gpu_selector
, fpga_selector
, host_selector
or even a custom selector you define. Without the cl::sycl::queue
abstraction, we would need to implement custom logic to interact with different device drivers for each accelerator type. The chief benefit of writing SYCL code is the ability to target all the above with a single unified interface.
Converting the Image to Greyscale
To evaluate the edge gradients, we need to first convert the 3-channel color image to greyscale. On an individual pixel, we perform this mapping with the following function:
float luminance(uint8_t r, uint8_t g, uint8_t b)
{
float r_lin = static_cast<float>(r) / 255;
float g_lin = static_cast<float>(g) / 255;
float b_lin = static_cast<float>(b) / 255;
return 0.2126f * r_lin + 0.7152 * g_lin + 0.0722 * b_lin;
}
If you aren't familiar with color-theory, just remember that green is perceptually brighter than red and blue, with blue being the least perceptually bright color component of the three.
Then, we can simply compute the luminance of each pixel in parallel like so:
cl::sycl::buffer<float, 1> greyscale_buffer{width * height};
queue.submit([&greyscale_buffer, &image_buffer, image, width, height](
cl::sycl::handler& h) {
auto data = greyscale_buffer.get_access<cl::sycl::access::mode::discard_write>(h);
auto image_data = image_buffer.get_access<cl::sycl::access::mode::read>(h);
h.parallel_for(cl::sycl::range<1>(width * height),
[image_data, data](cl::sycl::id<1> idx) {
int offset = 3 * idx[0];
data[idx[0]] = luminance(image_data[offset],
image_data[offset + 1],
image_data[offset + 2]);
});
});
The get_access<M>
method of the SYCL buffer allows us to advertise that code submitted to the queue will access memory in a particular way. The SYCL runtime, in turn, will sequence queue submissions along with any need memory synchronization. Furthermore, note that while we wrote the luminance function with vanilla C++, the compiler is able to compile it to executable code suitable for the device we target.
Horizontal and Vertical Convolutions
Next, we need to perform the convolutions needed to compute the edge gradients which we'll store in two buffers:
cl::sycl::buffer<float, 1> dx{width * height};
cl::sycl::buffer<float, 1> dy{width * height};
Now, the horizontal convolution:
{
cl::sycl::buffer<float, 1> dx_tmp{width * height};
queue.submit([&greyscale_buffer, &dx_tmp, width, height](cl::sycl::handler& h) {
auto data = greyscale_buffer.get_access<cl::sycl::access::mode::read>(h);
auto out = dx_tmp.get_access<cl::sycl::access::mode::discard_write>(h);
h.parallel_for(cl::sycl::range<2>(width, height),
[data, width, out](cl::sycl::id<2> idx) {
int offset = idx[1] * width + idx[0];
float left = idx[0] == 0 ? 0 : data[offset - 1];
float right = idx[0] == width - 1 ? 0 : data[offset + 1];
out[offset] = left - right;
});
});
queue.submit([&dx, &dx_tmp, width, height](cl::sycl::handler& h) {
auto data = dx_tmp.get_access<cl::sycl::access::mode::read>(h);
auto out = dx.get_access<cl::sycl::access::mode::discard_write>(h);
h.parallel_for(
cl::sycl::range<2>(width, height),
[data, width, height, out](cl::sycl::id<2> idx) {
int offset = idx[1] * width + idx[0];
float up = idx[1] == 0 ? 0 : data[offset - width];
float down = idx[1] == height - 1 ? 0 : data[offset + width];
float center = data[offset];
out[offset] = up + 2 * center + down;
});
});
}
The main important thing to note is that the dependency graph of operations is implicitly defined by the memory access barriers we include in each queue submission. For example, while we have no explicit synchronization between the first 3x1 convolution and the greyscale conversion, SYCL guarantees that there is a happens-after relationship between these two steps because the greyscale memory produces data that is read by the convolution.
The vertical convolution is then performed in the same way, except with different kernels:
{
cl::sycl::buffer<float, 1> dy_tmp{width * height};
queue.submit([&greyscale_buffer, &dy_tmp, width, height](
cl::sycl::handler& h) {
auto data = greyscale_buffer.get_access<cl::sycl::access::mode::read>(h);
auto out = dy_tmp.get_access<cl::sycl::access::mode::discard_write>(h);
h.parallel_for(cl::sycl::range<2>(width, height),
[data, width, out](cl::sycl::id<2> idx) {
int offset = idx[1] * width + idx[0];
float left = idx[0] == 0 ? 0 : data[offset - 1];
float right = idx[0] == width - 1 ? 0 : data[offset + 1];
float center = data[offset];
out[offset] = left + 2 * center + right;
});
});
queue.submit([&dy, &dy_tmp, width, height](cl::sycl::handler& h) {
auto data = dy_tmp.get_access<cl::sycl::access::mode::read>(h);
auto out = dy.get_access<cl::sycl::access::mode::discard_write>(h);
h.parallel_for(
cl::sycl::range<2>(width, height),
[data, width, height, out](cl::sycl::id<2> idx) {
int offset = idx[1] * width + idx[0];
float up = idx[1] == 0 ? 0 : data[offset - width];
float down = idx[1] == height - 1 ? 0 : data[offset + width];
out[offset] = up - down;
});
});
}
Notice that the vertical and horizontal gradients have no dependence on one another, so SYCL may execute them in parallel.
Combining the Gradients
For each pixel, we can have the gradient projected on the x
and y
axes, so it's a simple matter to compute the magnitude of the gradient.
// Allocate a memory region shared between the host and device queue
uint8_t* out = reinterpret_cast<uint8_t*>(
cl::sycl::malloc_shared(width * height, queue));
queue.submit([&dx, &dy, width, height, out](cl::sycl::handler& h) {
auto dx_data = dx.get_access<cl::sycl::access::mode::read>(h);
auto dy_data = dy.get_access<cl::sycl::access::mode::read>(h);
h.parallel_for(cl::sycl::range<1>(width * height),
[dx_data, dy_data, out](cl::sycl::id<1> idx) {
float dx_val = dx_data[idx[0]];
float dy_val = dy_data[idx[0]];
// NOTE: if deploying to an accelerated device, math
// functions MUST be used from the sycl namespace
out[idx[0]] = cl::sycl::sqrt(dx_val * dx_val + dy_val * dy_val) * 255;
});
});
Here, we write the data back to the unified memory allocation after rescaling to an 8-bit greyscale format. Notice that because we do a buffer read from both dx
and dy
buffers, this work will be sequenced after both horizontal and vertical edge convolutions finish.
Writing Out the Result
Finally, we're ready to read out the result and write it to disk.
queue.wait();
stbi_write_png("edges.png", weidth, height, 1, out, width);
stbi_image_free(image);
cl::sycl::free(out, queue);
The wait is needed here because unlike before, we don't have a get_access
request to create an implicit barrier and we are reading from the back host memory directly.
Deploying to DevCloud
Now that we have a working program, we can deploy the source code and any build scripts to the Intel oneAPI DevCloud. After signing up for the program, you should receive an email with instructions on how to obtain your SSH credentials to login to the DevCloud platform. After following the instructions, you should be able to upload your source file and test image with the following command:
scp main.cpp devcloud:~
scp stb_image.h devcloud:~
scp stb_image_write.h devcloud:~
scp peppers.png devcloud:~
This uploads your source file to the home directory of your assigned DevCloud user account. Then, you can login to deploy your program:
ssh devcloud
Create a script that compiles and runs your program like so:
#!/usr/bin/env bash
# run.sh
# Ensure the correct environment variables are set
source /opt/intel/inteloneapi/setvars.sh
# Compile our program
dpcpp main.cpp -o sobel -std=c++17 -fsycl -lOpenCL
# Invoke the program with our test image
./sobel peppers.png
Then, we can use the qsub
(queue submission) command to invoke our script on the various compute nodes available on the DevCloud. To see a list of available hosts, run the pbsnodes
command. This will list the nodes by id with additional information such as the type of processor running, and any accelerators available to it.
To submit a job to a host with a GPU for example, we can run the following command:
qsub -l nodes=1:gpu:ppn=2 -d . run.sh
The options, in short, indicate that we want to run our script on a single node with a GPU, we want to occupy the node fully (the ppn=2
option), we want the working directory to be the current directory, and we want the node to invoke the run.sh
script.
To see the status of your job, you can run qstat
which will produce output similar to the following:
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
681510.v-qsvr-1 run.sh u47956 0 R batch
The Job ID can be used as an argument to the qdel
command in order to cancel a pending job.
Once the job completes, you'll have files like run.sh.o681510
and run.sh.e068150
in your current directory, corresponding to the script stdout
and stderr
output respectively. If our program ran successfully, you should also have an edges.png image you can check for correctness. Logout using the exit
command, and transfer the image back to your host with scp
:
scp devcloud:~/edges.png .
Conclusion
In this article, we developed a C++ application using the SYCL runtime with additional extensions provided by the Intel DPC++ compiler. The application demonstrates how a unified programming model can target different architectures, as well as the abstractions the SYCL runtime provides for coordinating access to memory and author parallel code with implicit dependency graphs.
Finally, we showed how to deploy and test our code against different hardware profiles provided by the Intel DevCloud. To learn more about the SYCL runtime, the Intel DevCloud, or the Intel DPC++ compiler, you are encouraged to read more at the Intel DevZone here.
History
- 9th November, 2020: Initial version