Device discovery is an important aspect of SYCL or any cross-architecture, heterogeneous parallel programming approach. My previous oneAPI articles focused on using SYCL and the oneMKL and oneDPL libraries to offload computations to an accelerator device; in other words, to control where the code executes. This article focuses on device discovery because writing portable code for heterogeneous systems requires the ability to query the system for information about the available hardware. For example, if we hardcode the SYCL device selector to use a GPU, but there’s no GPU in the system, the following statement will fail:
...
sycl::queue Q(sycl::gpu_selector_v);
...
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
what(): No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html -1 (PI_ERROR_DEVICE_NOT_FOUND)
Aborted
The code isn’t portable to systems without a GPU. Instantiating the SYCL queue with the default selector instead of the GPU selector is guaranteed to work, but we lose control over where the queue submits work. The SYCL runtime chooses the device, e.g.:
...
sycl::queue Q(sycl::default_selector_v);
std::cout << "Running on: "
<< Q.get_device().get_info<sycl::info::device::name>()
<< std::endl;
...
Running on: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
To write more robust heterogeneous parallel programs, let’s take a closer look at SYCL device discovery to answer the following questions:
- What accelerator devices are available?
- What device is a SYCL queue using?
- What device is a oneDPL execution policy using?
Robust Device Discovery
Let’s run some examples on the Intel® DevCloud for oneAPI because it has a variety of Intel hardware options and the latest Intel® oneAPI toolkits are already installed. The hardware is refreshed periodically, but the following compute nodes are available at the time of writing (December 2, 2022):
$ pbsnodes | grep properties | sort | uniq -c | sort -nr
79 properties = xeon,skl,gold6128,ram192gb,net1gbe,jupyter,batch
78 properties = xeon,cfl,e-2176g,ram64gb,net1gbe,gpu,gen9
26 properties = xeon,skl,gold6128,ram192gb,net1gbe,jupyter,batch,fpga_compile
25 properties = core,tgl,i9-11900kb,ram32gb,netgbe,gpu,gen11
12 properties = xeon,skl,ram384gb,net1gbe,renderkit
12 properties = xeon,skl,gold6128,ram192gb,net1gbe,fpga_runtime,fpga,arria10
6 properties = xeon,icx,gold6348,ramgb,netgbe,jupyter,batch
4 properties = xeon,icx,plat8380,ram2tb,net1gbe,batch
4 properties = xeon,clx,ram192gb,net1gbe,batch,extended,fpga,stratix10,fpga_runtime
As you can see, we have plenty of CPU, GPU, and FPGA options. (Users who have nondisclosure agreements with Intel can access prerelease hardware in the NDA partition of the DevCloud.) Let’s request a node and see what devices are available:
$ qsub -I -l nodes=1:gen11:ppn=2
This command requests interactive access to single node with Intel® Processor Graphics Gen11. SYCL provides several built-in selectors, in addition to the two you’ve already seen: default_selector_v
, gpu_selector_v
, cpu_selector_v
, and accelerator_selector_v
. Note that Intel also provides fpga_selector
and fpga_emulator_selector
extensions for FPGA development. They are in the sycl/ext/intel/fpga_device_selector.hpp header. See the chapter on FPGA Flow in the Intel oneAPI Programming Guide for more information about using SYCL on FPGAs.
The built-in selectors are mainly for convenience, but they can be robust when combined with exception handling, e.g.:
sycl::device d;
try {
d = sycl::device(sycl::gpu_selector_v);
}
catch (sycl::exception const &e) {
d = sycl::device(sycl::cpu_selector_v);
}
However, the SYCL runtime is still selecting the device. We may want more control, especially if multiple devices are available.
The following program lists the platforms and devices that are available in our compute node (note that you can also get this information using the sycl-ls
command-line utility):
#include <sycl/sycl.hpp>
int main()
{
for (auto platform : sycl::platform::get_platforms())
{
std::cout << "Platform: "
<< platform.get_info<sycl::info::platform::name>()
<< std::endl;
for (auto device : platform.get_devices())
{
std::cout << "\tDevice: "
<< device.get_info<sycl::info::device::name>()
<< std::endl;
}
}
}
$ icpx -fsycl show_platforms.cpp -o show_platforms
$ ./show_platforms
Platform: Intel(R) FPGA Emulation Platform for OpenCL(TM)
Device: Intel(R) FPGA Emulation Device
Platform: Intel(R) OpenCL
Device: 11th Gen Intel(R) Core(TM) i9-11900KB @ 3.30GHz
Platform: Intel(R) OpenCL HD Graphics
Device: Intel(R) UHD Graphics [0x9a60]
Platform: Intel(R) Level-Zero
Device: Intel(R) UHD Graphics [0x9a60]
SYCL platforms are based on the OpenCL platform model in which a host is connected to accelerator devices. This is apparent in the example output above. This system has an OpenCL platform and a oneAPI Level Zero platform. Each platform has a device where a SYCL program can submit work. We have two GPU platforms, depending on whether we want to use the OpenCL or oneAPI Level Zero backend. We also have CPU and FPGA emulation platforms. This information allows us to create queue to submit work to either of these devices, e.g.:
#include <sycl/sycl.hpp>
int main()
{
auto platforms = sycl::platform::get_platforms();
sycl::queue Q1(platforms[1].get_devices()[0]);
sycl::queue Q2(platforms[3].get_devices()[0]);
std::cout << "Q1 mapped to "
<< Q1.get_device().get_info<sycl::info::device::name>()
<< std::endl;
std::cout << "Q2 mapped to "
<< Q2.get_device().get_info<sycl::info::device::name>()
<< std::endl;
}
$ ipcx -fsycl map_queues.cpp -o map_queues
$ ./map_queues
Q1 mapped to 11th Gen Intel(R) Core(TM) i9-11900KB @ 3.30GHz
Q2 mapped to Intel(R) UHD Graphics [0x9a60]
Note that the devices are hardcoded in the previous example, so remember to update the platform indices if you try this program on one of your systems.
Querying the SYCL Queue and Device
Queue creation is visible in the previous example codes, but this may not always be the case. For example, a SYCL queue is typically passed to oneAPI library functions. Therefore, it may be necessary to query the queue for information. What’s the target device that the queue is mapped to? What’s the backend API for the device? Is it an in-order queue (i.e., kernels must be executed in the order that they were submitted)? What are the vector widths or maximum work-item dimensions of the target device?
A library developer may use this information to select an optimal code path. Consequently, the SYCL queue class provides several member functions to query information: get_backend()
, get_context()
, get_device()
, is_in_order()
, etc. Likewise, the SYCL device class provides member functions to query device characteristics [e.g., the is_cpu()
, is_gpu()
, and get_info()
functions]. The get_info()
function in particular can be used to gather detailed information about the target device: vendor, vector widths, maximum work-item and image dimensions, memory characteristics, etc. The SYCL 2020 Specification contains the complete lists of device information descriptors and device aspects that can be queried.
Using such information to optimize code is beyond the scope of this article, but it will be the subject of a future article.
Custom Selectors
Each of the device selectors we've seen so far has been a built-in selector provided by the SYCL implementation. Behind the scenes, each device selector is implemented as a C++ callable that accepts a device and returns a score. The SYCL implementation calls the device selector to score every available device in the system, finally selecting the device with the highest score.
We can write our own custom device selectors by writing callables of the same form. As a simple first example, we can write a device selector with the same behavior as the built-in CPU selector by assigning a positive score to all CPU devices and a negative score to all other devices:
...
auto my_cpu_selector = [](const sycl::device& d)
{
if (d.is_cpu())
{
return 1;
}
else
{
return -1;
}
};
sycl::queue Q(my_cpu_selector);
...
Running on: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
Because a device selector is just a function, we're free to use any properties of a device (e.g., aspects or device information descriptors) in conjunction with other variables in our program (e.g., command-line arguments) to score devices. This is very powerful, giving us complete control over how devices are scored and selected, and allowing us to ensure that the selected device meets our application's requirements. The example below gives a taste of what's possible, showing a device selector that ignores devices that do not support double-precision floating-point arithmetic, and which can be configured to prefer GPUs via a Boolean variable:
...
bool prefer_gpus = true; auto my_selector = [=](const sycl::device& d)
{
if (not d.has(sycl::aspect::fp64))
{
return -1;
}
if (prefer_gpus and d.is_gpu())
{
return 1;
}
else
{
return 0;
}
};
sycl::queue Q(my_selector);
...
SYCL recently added the aspect_selector
function to help select devices that meet the programmer’s requirements. For example, the following statement selects a GPU device that supports half-precision while excluding emulated, fixed-function devices:
auto dev = sycl::device{sycl::aspect_selector(
std::vector{sycl::aspect::fp16, sycl::aspect::gpu}, std::vector{sycl::aspect::custom, sycl::aspect::emulated} )};
At the time of writing, aspect_selector
is not yet supported by the Intel® oneAPI DPC++/C++ Compiler but it should be available soon.
Changing the oneDPL Execution Policy
The article, The Maxloc Reduction in oneAPI (The Parallel Universe, Issue 48), showed how oneDPL uses the execution policy to offload functions to accelerators. The code examples simply used the oneapi::dpl::execution::dpcpp_default
policy, so let’s see how to use SYCL queues to modify the execution policy to explicitly control where oneDPL functions run:
#include <oneapi/dpl/execution>
int main()
{
sycl::queue Q1(sycl::gpu_selector_v);
auto gpu_policy = oneapi::dpl::execution::make_device_policy(Q1);
std::cout << "GPU execution policy runs oneDPL functions on "
<< gpu_policy.queue().get_device().
get_info<sycl::info::device::name>()
<< std::endl;
sycl::queue Q2(sycl::cpu_selector_v);
auto cpu_policy = oneapi::dpl::execution::make_device_policy(Q2);
std::cout << "CPU execution policy runs oneDPL functions on "
<< cpu_policy.queue().get_device().
get_info<sycl::info::device::name>()
<< std::endl;
}
$ icpx -fsycl onedpl_policy_example.cpp -o onedpl_example
$ ./onedpl_example
GPU policy runs oneDPL functions on Intel(R) UHD Graphics [0x9a60]
CPU policy runs oneDPL functions on 11th Gen Intel(R) Core(TM) i9-11900KB @ 3.30GHz
The previous program creates queues using the built-in CPU and GPU selectors, then uses these queues to set the oneDPL execution policy. We could just as easily have queried the platforms and devices and instantiated the queues as shown previously:
...
auto platforms = sycl::platform::get_platforms();
sycl::queue Q1(platforms[3].get_devices()[0]);
auto gpu_policy = oneapi::dpl::execution::make_device_policy(Q1);
sycl::queue Q2(platforms[1].get_devices()[0]);
auto cpu_policy = oneapi::dpl::execution::make_device_policy(Q2);
...
Once again, the devices are hardcoded in the previous example so remember to update the platform indices for your system.
We’ve only scratched the surface of what SYCL provides for device discovery, and how programs can use platform and device information. Expect to see more about this in future issues of The Parallel Universe, especially as multi-device systems become more prevalent, and programmers begin targeting algorithms to specific devices.
Additional Resources