Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / IoT / Raspberry-Pi / RaspberryPi4

Parallel Heterogeneous Programming In C++17/2x0 Using CL/SYCL-Model Specification On Raspberry Pi 4B+ IoT-Boards

5.00/5 (6 votes)
10 Dec 2020CPOL24 min read 13.7K  
A practical guide for building parallel CL/SYCL-code in, C++17, targeting Arm/AArch64-architectures and running it on Raspberry Pi 4B+ IoT-boards
This article provides useful guidelines, tips and the tutorial for building a modern parallel code, in C++17/2x0, that uses Khronos CL/SYCL heterogeneous programming model layer specification, and running it on the next generation of innovative Raspberry Pi 4B+ IoT-boards. Essentially, an audience of software developers, system engineers and IoT-enthusiasts, will find out about the using of Khronos CL/triSYCL and Aksel Alpay’s hipSYCL open-source library projects for delivering a parallel code, targeting the Arm/Aarch64 hardware architectures, and using the latest GNU's Compiler Collection (GCC) and LLVM/Clang’s toolchains, for performing the “native” Arm/Aarch64, or cross-platform compilation, on a Raspberry Pi board and x86_64 Debian/Ubuntu-based development machine, respectively.

An automated Linux bash-scripts (*.sh), for installing and configuring the GNU's and LLVM/Clang's toolchains and open-source triSYCL/hipSYCL libraries, from sources:


Pre-installed Ubuntu 20.04 LTS host x86_64 development virtual machine (VBox6-VHD) and Raspbian Buster 10.06 SD-card (*.img) image, available on Microsoft OneDrive:


Parallel code samples in C++17/2x0, for evaluating the triSYCL/hipSYCL open-source libraries usage, building the sources with GNU's and LLVM/Clang Arm/Aarch64-toolchains, and running it on Raspberry Pi 4B+ IoT-boards:


An audience of this article’s readers can also evaluate an execution of parallel code, in C++17, implementing Max-Miner association rules learning (ARL) algorithm, developed by using the Aksel Alpay’s CL/hipSYCL library, running on Raspberry Pi 4B+ 4GB IoT-board, with Arm/Aarch64 hardware architecture:


Similar Publications and Re-Posts by the Author


Table of Contents

Parallel Computing On Scale With IoT (An Idea...)

"SYCL (pronounced ‘sickle’) is a royalty-free, cross-platform abstraction layer of OpenCL 2.0, that enables code for heterogeneous processors to be written using standard ISO C++ with the host and kernel code for an application contained in the same source file..." - Khronos® Group, Inc., 2020.

Imagine a logistics company, delivering shipments and freights, nationwide, from suppliers to its customers, both located on a typically large distance. The company performs the delivery with “heavy” cargo truck vehicles, it owns. In the most cases, a cargo truck's fuel consumption is one of the most essential factors, critically impacting the time of transportation process, as well as the delivery costs, itself. To benefit in the high-quality logistics and reduced delivery costs, it uses a one or multiple of IT cloud-based software solutions, for monitoring the fuel consumption, in real-time, as well as providing the data analytics, used within the logistics optimization process.

To provide the fuel consumption real-time monitoring, the logistics company installed IoT boards, with multiple connected sensors, to each cargo truck, used for the delivery. Each specific IoT-board, being installed, retrieves a variety of data on the amounts of fuel, consumed by a cargo truck, sending it to servers within the solution’s IoT-cluster, collecting and pre-processing these data, prior to performing most of the processing, in data center, on exa-scale. After timely processing of this data, the data center provides various analytics and other inference, used for optimization, such the fuel consumption and delivery costs estimates, optimal route recommendations, etc.

However, the exponentially growing amounts of shipments and freights, delivered, causes the company to expand their businesses by increasing a potential quantity of trucks, used for the freight’s delivery. Since that, the amounts of data, sent by each truck to the IoT-cluster and large data center, processing it on exa-scale, is relentlessly growing, all the time, pinning the bandwidth of networks, as well as the compute nodes system resources utilization, close to the limit.

Upgrading the IoT-cluster’s hardware to the latest innovating IoT-boards and devices, based on powerful multi-core, symmetric CPUs and higher amounts of RAM, allows to completely survive the problem of an exhausted system resources and network bandwidths, in many existing cloud-based solutions, designed for massively processing big-data, in real-time, and, thus, significantly improve the either quality of data analytics or productivity of an entire cloud-based solution, respectively.

In 2016, ARM® Holdings released the powerful and innovative ARM® Cortex-A72, Quad-core, 64-bit RISC-V CPUs, used by many IoT-boards vendors, such as Broadcom and Raspberry Pi Foundation, for manufacturing tiny-sized nano-computers, capable of performing complex, time-critical and “heavy” computations, on scale.

This, in turn, provided an ability of delivering a modern parallel code, in C/C++ and other programming languages, implementing a vast number of computational processes, by using widespread HPC libraries and frameworks, such as OpenMP, TBB or MPI, executing a code on the latest powerful IoT-boards, nano-computers, used as an essential constituent of IoT-clusters and embedded systems.

At the same time, in 2020, Khronos® Group announced their CL/SYCL programming model specification - an OpenCL 2.0 library abstraction layer, for the revolutionary heterogeneous compute platform (XPU), providing an ultimate performance and productivity of parallel computations, by offloading an execution of complex and “heavy” workloads to more hardware acceleration targets, such as GPGPUs and FPGAs, rather than hosts CPUs only.

Later on, a vast number of software developers have developed and contributed a number of open-source library distributions, such as Khronos triSYCL and Aksel Alpay’s hipSYCL, implementing the CL/SYCL programming model specification, used for designing a parallel CL/SYCL-code in C++17/2x0, and running it on IoT-boards and nano-computers, with the Arm/Aarch64 hardware architectures.

A practical guide of this article provides everything, that is required to be known to get started with the development of software for IoT-clusters, performing complex and time-critical computations, in parallel, providing a scalability of the "heavy" execution workloads, and thus, significantly increasing the performance and productivity of real-time data processing, at the edge:

  • Setting up a Raspberry Pi 4B+ IoT-board, out-of-the-box
  • Developing a parallel code, using CL/SYCL programming model
  • Configuring a Debian/Ubuntu-based development machine (x86_64)
  • Installing and configuring the GNU's GCC/G++-10.x.x toolchain
  • Downloading and installing the Khronos CL/triSYCL library
  • Installing and configuring the Arm/Aarch64 LLVM/Clang-9.x.x "native" toolchain
  • Building and installing the Aksel Alpay's hipSYCL library, from its sources
  • Running a parallel code, implemented by using CL/triSYCL and hipSYCL libraries, on Raspberry Pi

Along with the tutorial and walkthroughs, listed above, the material of the following article, also, demonstrates and provides an explanation of several code samples, in C++17, implemented using CL/SYCL programming model specification, performing a parallel matrix multiplication or raising floating-point numbers to a specific power.

An Innovative Raspberry PI 4B+ IoT Boards Overview

The next generation of innovative Raspberry Pi 4B+ IoT boards, based on the powerful ARM's multi-core symmetric 64-bit RISC-V CPUs, provides an unleashed performance, and, thus, the ultimate productivity of parallel computing, itself. Using the latest Raspberry Pi boards allows to drastically improve the actual performance speed-up of the computational processes, at the edge, such collecting and pre-processing data in real-time, prior to delivering it to a data center for processing, on exa-scale. The running of these processes in parallel significantly increases the efficiency of those cloud-based solutions, serving billons of client requests or providing data analytics and other inference.

Before we'll ground our discussion on building and running a parallel code in C++17, designed by using CL/SYCL heterogeneous programming model specification for the Raspberry Pi boards with Arm/Aarch64-architecture, let’s spend a moment and take a short glance at the next generation of Raspberry Pi 4B+ boards and its technical specs:

Image 1

Raspberry Pi 4B+ IoT boards are manufactured with the innovative Broadcom BCM2711B0 (SoC) chips, based on the latest ARM® Quad-Core Cortex-A72 @ 1.5GHz 64-bit RISC-V CPUs, providing an ultimate performance and scalability, while leveraging it for the parallel computing, at the edge.

The Raspberry Pi's are known as the “reliable” and “fast” tiny-sized nano-computers, designed for data mining and parallel computing. Principally new hardware architectural features of the ARM's multi-core symmetric 64-bit RISC-V CPUs, such as DSP, SIMD, VFPv4 and hardware virtualization support, are capable of bringing the significant performance speed-up to the IoT-clusters, massively processing data, at the edge.

Specifically, one of the most important advantages of the latest Raspberry Pi 4B+ boards is the low-profile LPDDR4 memory with 2, 4 or 8 GiB RAM capacity of choice, operating at 3200Mhz and providing a typically large memory transactions bandwidth, positively affecting the performance of parallel computing, in general. The boards with 4 GiB of RAM installed, and higher, are strongly recommended for data mining and parallel computing. Also, the BCM2711B0 SoC-chips are bundled with a various of integrated devices and peripherals, such as Broadcom VideoCore VI @ 500Mhz GPUs, PCI-E Gigabit Ethernet Adapters, etc.

For building and running a specific parallel modern code in C++17, implemented using the CL/SYCL heterogeneous programming model, the first that we really need is a Raspberry Pi 4B+ IoT-board with the latest Raspbian Buster 10.6 OS installed and configured for the first use.

Here is a brief checklist of the hardware and software requirements, that must have been met, beforehand:

Hardware

  • Raspberry Pi 4 Model B0, 4GB IoT Board
  • Micro-SD Card 16GB For Raspbian OS And Data Storage
  • DC Power Supply: 5.0V/2-3A via USB Type-C connector
    (minimum 3A - for data mining and parallel computing)

Software

  • Raspbian Buster 10.6.0 Full OS
  • Raspbian Imager 1.4
  • MobaXterm 20.3 build 4396, or any other SSH-client

Since, we've got a Raspberry Pi 4B+ IoT board, now, we can proceed with setting up, out-of-the-box...

Setting Up a Raspberry Pi 4B IoT Board

Before we begin, we must download the latest release of the Raspbian Buster 10.6.0 Full OS image from the official Raspberry Pi repository. To install the Raspbian OS image to the SD-card, we will also need to download the Raspbian Imager 1.4 application, available for a various of platforms, such as Windows, Linux or macOS:

Additionally, we must also download and install MobaXterm application for establishing a connection to the Raspberry Pi board, remotely, via the SSH- or FTP-protocols:

Since the Raspbian Buster OS and Imager application have been successfully downloaded and installed, we will be using the Imager application to do the following:

  1. Erase the SD-card, formatting it to the FAT32 filesystem, by default
  2. Extract the pre-installed Raspbian Buster OS image (*.img) to the SD-card

Since the steps above have been successfully completed, just remove the SD-card from card-reader and plug it into the Raspberry Pi board’s SD-card slot. Afterwards, attach the micro-HDMI and Ethernet cables. Finally, plug the DC power supply cable's connector in, and turn on the board. Finally, the system will boot up with the Raspbian Buster OS, installed to the SD-card, prompting to perform several post-installation steps to configure it for the first use.

Since the board has been powered on, make sure that all of the following post-installation steps have been completed:

  1. Open the bash-console and set the ‘root’ password:
    pi@raspberrypi4:~ $ sudo passwd root
  2. Login to the Raspbian bash-console with 'root' privileges:
    pi@raspberrypi4:~ $ sudo -s
  3. Upgrade the Raspbian's Linux base system and firmware, using the following commands:
    root@raspberrypi4:~# sudo apt update
    root@raspberrypi4:~# sudo apt full-upgrade
    root@raspberrypi4:~# sudo rpi-update
  4. Reboot the system, for the first time:
    root@raspberrypi4:~# sudo shutdown -r now
  5. Install the latest Raspbian's bootloader and reboot the system, once again:
    root@raspberrypi4:~# sudo rpi-eeprom-update -d -a
    root@raspberrypi4:~# sudo shutdown -r now
  6. Launch the 'raspi-config' setup tool:
    root@raspberrypi4:~# sudo raspi-config
  7. Complete the following steps, using the 'raspi-config' tool:

* Update the 'raspi-config' tool

Image 2

* Disable the Raspbian's Desktop GUI on boot:

System Options >> Boot / Autologin >> Console Autologin:

Image 3

* Expand the root ‘/’ partition size on the SD-card:

Image 4

After performing the Raspbian post-install configuration, finally reboot the system. After rebooting, you will be prompted to login. Use the ‘root’ username and the password, previously set, for logging in to the bash-console with root privileges.

Since you've been successfully logged in, install the number of packages from APT-repositories by using the following command, in bash-console:

root@raspberrypi4:~# sudo apt install -y net-tools openssh-server

These two packages are required for configuring the either the Raspberry Pi's network interface or the OpenSSH-server for connecting to the board, remotely, via SSH-protocol, by using MobaXterm.

Configure the board’s network interface ‘eth0’ by modifying the /etc/network/interfaces, for example:

auto eth0
iface eth0 inet static
address 192.168.87.100
netmask 255.255.255.0
broadcast 192.168.87.255
gateway 192.168.87.254
nameserver 192.168.87.254

Next to the network interface, perform a basic configuration of the OpenSSH-server, by uncommenting these lines in the /etc/ssh/sshd_config:

PermitRootLogin yes
StrictModes no

PasswordAuthentication yes
PermitEmptyPasswords yes

This will enable the 'root' login, into the bash-console, via SSH-protocol, without entering a password.

Finally, give a try to connect the board over the network, using the MobaXterm application and opening the remote SSH-session to the host with IP-address: 192.168.87.100. You must also be able to successfully login to the Raspbian's bash-console, with the credentials, previously set:

Image 5

Developing a Parallel Code in C++17 using CL/SYCL Programming Model

Here's a tiny example, illustrating the code in C++17, implemented using the CL/SYCL-model abstraction layer:

C++
#include <CL/sycl.hpp>

using namespace cl::sycl;

constexpr std::uint32_t N = 1000;

cl::sycl::queue q{};

q.submit([&](cl::sycl::handler &cgh) {
    cgh.parallel_for<class Kernel>(cl::sycl::range<1>{N}, \
       [=](cl::sycl::id<1> idx) {
           // Do some work in parallel
       });
});

q.wait();

The code fragment in C++17, shown above, is delivered, entirely based on using the CL/SYCL-programming model. It instantiates a cl::sycl::queue{} object with the default parameter initializers list, for submitting SYCL-kernels, for an execution, to the host CPUs acceleration target, used by default. Next, it invokes the cl::sycl::submit(...) method, having a single argument of the cl::sycl::handler{} object, for accessing methods, that provide a basic kernels functionality, based on a various of parallel algorithms, including the cl::sycl::handler::parallel_for(...) method.

The following method is used for implementing a tight parallel loop, spawned from within a running kernel. Each iteration of this loop is executed in parallel, by its own thread. The cl::sycl::handler::parallel_for(...) accepts two main arguments of the cl::sycl::range<>{} object and a specific lamda-function, invoked, during each loop iteration. The cl::sycl::range<>{} object basically defines an amount of parallel loop iterations, being executed, for each specific dimension, in case when multiple nested loops are collapsed, while processing a multi-dimensional data.

In the code, from above, cl::sycl::range<1>(N){} object is used for scheduling N-iterations of the parallel loop, in a single dimension. The lambda-function of the cl::sycl::handler::parallel_for(...) method accepts a single argument of another cl::sycl::id<>{} object. As well as the cl::sycl::range<>{}, this object implements a vector container, each element, of which, is an index value for each dimension and each iteration of the parallel loop, respectively. Passed as an argument to a code in the lamda-function's scope, the following object is used for retrieving the specific index values. The lamda-function's body contains a code that does some of the data processing, in parallel.

After a specific kernel has been submitted to the queue and spawned for an execution, the following code invokes the cl::sycl::wait() method with no arguments to set a barrier synchronization, ensuring that no code will be executed, so far, until the kernel being spawned has completed its parallel work.

The CL/SYCL heterogeneous programming model is highly efficient and can be used for a number of applications.

However, Intel Corp. and CodePlay Software Inc, soon, have deprecated the support of CL/SYCL for hardware architectures, other than the "native" x86_64. This made it impossible to deliver a parallel C++ code, using the specific CL/SYCL libraries, targeting Arm/Aarch64, and other architectures.

Presently, there are a number of CL/SYCL open-source library projects, developed by a number of developers and enthusiasts, providing support for more hardware architectures, rather than the x86_64, only.

In 2019, Aksel Alpay, at Heidelberg University (Germany), implemented the latest CL/SYCL programming model layer specification library, targeting a various of hardware-architectures, including the Raspberry Pi's Arm/Aarch64 architecture, and contributed the hipSYCL open-source library project distribution to GitHub (https://github.com/illuhad/hipSYCL).

Further, in this article, we will discuss how to install and configure the LLVM/Clang-9.x.x compilers, toolchains and the hipSYCL library distribution, for delivering a modern parallel code in C++17, based on using the library, being discussed.

Configuring a Debian/Ubuntu-Based Development Machine (x86_64)

There are basically two methods of building a CL/SYCL-code, in C++17, introduced above, by using the GNU's GCC/G++-10.x.x cross-platform toolchain and x86_64 Debian/Ubuntu-based development machine, or, "natively", on a Raspberry Pi IoT-board, with LLVM/Clang-9.x.x, for Arm/Aarch64 hardware architectures, installed.

The using of the first method allows to build code sources in C++17/2x0, implemented, by using the Khronos triSYCL library and GNU's cross-platform Arm/Aarch64-toolchain, on the Debian/Ubuntu-based x86_64 development machine, prior to running it on a Raspberry Pi.

For deploying the x86_64 development machine, the installation of the latest Debian Buster 10.6.0 or Ubuntu 20.04 LTS, are required:

For deploying the x86_64 development machine, the installation of the latest Debian Buster 10.6.0 or Ubuntu 20.04 LTS, are required:

To have an ability of using the development machine, on a host computer, running Microsoft Windows 10, any of the existing (e.g., Oracle VirtualBox or VMware Workstation) virtualization environments of choice can be used, for that purpose:

To get started with the development machine deployment, all that must be done, first, is to setup a specific virtualization environment, create a virtual machine and launch the Debian or Ubuntu installation.

Since the virtual machine has been created, and Debian/Ubuntu has been successfully installed, we can proceed with several steps, installing and configuring the GNU's GCC/G++-10.x.x cross-platform compilers, development tools, and the Khronos triSYCL library, required for building a code, targeting the Raspberry Pi's Arm/Aarch64-architectures.

Prior to installing and configuring the GCC/G++ compilers toolchain and runtime libraries, make sure that the following prerequisite steps have been completed:

  • Upgrade the Debian/Ubuntu’s Linux base system:
    root@uarmhf64-dev:~# sudo apt update
    root@uarmhf64-dev:~# sudo apt upgrade -y
    root@uarmhf64-dev:~# sudo apt full-upgrade -y

    The completion of this step is required, to ensure that the running Debian/Ubuntu installation, on the x86_64 host development machine, is the most up-to-date, and the latest kernel and packages, are installed.

  • Install ‘net-tools’ and OpenSSH-server packages from APT-repository:
    root@uarmhf64-dev:~# sudo apt install -y net-tools openssh-server

    The ‘net-tools’ and ‘openssh-server’ are installed for providing an ability of configuring the development machine's network interface and connecting to the running development machine, remotely, over the SSH- and FTP-protocols.

Since the system has been upgraded and all required packages have been installed, we can proceed with installing and configuring the specific compilers and toolchains.

Installing and Configuring GNU's GCC/G++-10.x.x Toolchain

  1. Install the GNU Compilers Collection (GCC)’s toolchain, for x86_64 platform:
    root@uarmhf64-dev:~# sudo apt install -y build-essential
  2. Install the GNU’s cross-platform Arm64/Armhf toolchains:
    root@uarmhf64-dev:~# sudo apt install -y crossbuild-essential-arm64
    root@uarmhf64-dev:~# sudo apt install -y crossbuild-essential-armhf

    The installation of cross-platform toolchains for Arm64/Armhf hardware architectures is essentially required for building a parallel code in C++17, that uses triSYCL library, on the x86_64 development machine.

  3. Install the GNU's GCC/G++, OpenMP 5.0, Boost, Range-v3, POSIX Threads, C/C++ standard runtime libraries, required:
    root@uarmhf64-dev:~# sudo apt install -y g++-10 libomp-dev libomp5 libboost-all-dev 
    librange-v3-dev libc++-dev libc++1 libc++abi-dev libc++abi1 libpthread-stubs0-dev 
    libpthread-workqueue-dev
  4. Install the GNU’s GCC/G++-10.x.x. cross-platform compilers, for building a code, targeting Arm64/Armhf architectures:
    root@uarmhf64-dev:~# sudo apt install -y gcc-10-arm-linux-gnueabi 
    g++-10-arm-linux-gnueabi gcc-10-arm-linux-gnueabihf g++-10-arm-linux-gnueabihf
  5. Select the GCC/G++-10.x.x “native” x86_64-compilers, used by default, updating the alternatives:
    sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 1
    sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 2
    sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 1
    sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 2
    sudo update-alternatives --install /usr/bin/cc cc /usr/bin/gcc 3
    sudo update-alternatives --set cc /usr/bin/gcc
    sudo update-alternatives --install /usr/bin/c++ c++ /usr/bin/g++ 3
    sudo update-alternatives --set c++ /usr/bin/g++
  6. Select the GCC/G++-10.x.x cross-platform Arm/Aarch64-compilers, used by default, updating the alternatives:
    sudo update-alternatives --install /usr/bin/arm-linux-gnueabihf-gcc 
    arm-linux-gnueabihf-gcc /usr/bin/arm-linux-gnueabihf-gcc-9 1
    sudo update-alternatives --install /usr/bin/arm-linux-gnueabihf-gcc 
    arm-linux-gnueabihf-gcc /usr/bin/arm-linux-gnueabihf-gcc-10 2
    sudo update-alternatives --install /usr/bin/arm-linux-gnueabihf-g++ 
    arm-linux-gnueabihf-g++ /usr/bin/arm-linux-gnueabihf-g++-9 1
    sudo update-alternatives --install /usr/bin/arm-linux-gnueabihf-g++ 
    arm-linux-gnueabihf-g++ /usr/bin/arm-linux-gnueabihf-g++-10 2
    sudo update-alternatives --install /usr/bin/arm-linux-gnueabihf-cc 
    arm-linux-gnueabihf-cc /usr/bin/arm-linux-gnueabihf-gcc 3
    sudo update-alternatives --set arm-linux-gnueabihf-cc /usr/bin/arm-linux-gnueabihf-gcc
    sudo update-alternatives --install /usr/bin/arm-linux-gnueabihf-c++ 
    arm-linux-gnueabihf-c++ /usr/bin/arm-linux-gnueabihf-g++ 3
    sudo update-alternatives --set arm-linux-gnueabihf-c++ /usr/bin/arm-linux-gnueabihf-g++
  7. Finally, check if the correct versions of the GNU’s “native” and cross-platform toolchains are installed:
    root@uarmhf64-dev:~# gcc --version && g++ --version
    root@uarmhf64-dev:~# arm-linux-gnueabihf-gcc --version
    root@uarmhf64-dev:~# arm-linux-gnueabihf-g++ --version

Downloading and Using Khronos CL/triSYCL Library

  1. Navigate to the /opt directory and clone the Khronos triSYCL library distribution from the GitHub repository:
    root@uarmhf64-dev:~# cd /opt
    root@uarmhf64-dev:~# git clone --recurse-submodules https://github.com/triSYCL/triSYCL

    The following commands will create the /opt/triSYCL sub-directory, containing sources of the triSYCL library distribution.

  2. Copy the triSYCL library’s C++ header files from the /opt/triSYCL/include directory to its default location /usr/include/c++/10/, on the development machine, by using ‘rsync’ command:
    root@uarmhf64-dev:~# cd /opt/triSYCL
    root@uarmhf64-dev:~# sudo rsync -r ./ include/ /usr/include/c++/10/
  3. Set the environment variables, required for using the triSYCL library with GNU’s cross-platform toolchain, previously installed:
    export CPLUS_INCLUDE_PATH=/usr/include/c++/10
    env CPLUS_INCLUDE_PATH=/usr/include/c++/10
    sudo echo "export CPLUS_INCLUDE_PATH=/usr/include/c++/10" >> /root/.bashrc
  4. Perform a simple clean-up, by removing the /opt/triSYCL sub-directory:
    root@uarmhf64-dev:~# rm -rf /opt/triSYCL
  5. Build the ‘hello.cpp’ code sample using the “native” x86_64 GNU’s GCC/G++ compiler:
    root@uarmhf64-dev:~# g++ -std=c++17 -o hello hello.cpp -lpthread -lstdc++

    The building specific code in C++17/2x0, that uses Khronos triSYCL library, requires the POSIX threads and C++ standard libraries runtime linkage.

  6. Build the ‘hello.cpp’ code sample using the GNU’s cross-platform GCC/G++ compiler:
    root@uarmhf64-dev:~# arm-linux-gnueabihf-g++ -std=c++17 
    -o hello_rpi4b hello.cpp -lpthread -lstdc++

Since the code executable for Arm/Aarch64-architectures were successfully generated, download the executable, from the development machine, via FTP- or SSH-protocol, using the MobaXterm application. After that, upload 'hello_rpi4b' executable file, by using another SSH-session, to the Raspberry Pi board.

How to Run a Parallel Code, Delivered using C++17/2x0 And CL/triSYCL Library, on a Raspberry Pi 4B+ Board

To run the 'hello_rpi4b' executable, use the following command in the Raspbian's bash-console, for example:

root@uarmhf64-dev:~# chmod +rwx hello_rpi4b
root@uarmhf64-dev:~# ./hello_rpi4b > output.txt && cat output.txt

This will create and append the output to the 'output.txt' file, printing its contents to the bash-console:

Hello from triSYCL on Rasberry Pi 4B+!!!
Hello from triSYCL on Rasberry Pi 4B+!!!
Hello from triSYCL on Rasberry Pi 4B+!!!
Hello from triSYCL on Rasberry Pi 4B+!!!
Hello from triSYCL on Rasberry Pi 4B+!!!

Note: Normally, the first method does not require the building of Khronos triSYCL library distribution from its sources, unless you plan the using of triSYCL against the other HPC libraries, such either OpenCL, OpenMP or TBB. For more information on using the triSYCL, along with the other libraries, refer to the following guidelines and documentation https://github.com/triSYCL/triSYCL/blob/master/doc/cmake.rst.

The using of Aksel Alpay's hipSYCL open-source library distribution and LLVM/Clang-9.x.x. "native" compiler toolchain, targeting the Arm/Aarch64-architecture, is the second method, that allows to build a CL/SYCL code, in C++17/2x0, for running it on Raspberry Pi boards. The building of specific code, natively, is only possible, in case when both the LLVM/Clang-9.x.x toolchain and hipSYCL library distribution are installed on the Raspberry Pi board, and, not x86_64 development machine, itself.

Further, we will discuss everything that is needed to know for installing and configuring the LLVM/Clang-9.x.x compiler toolchain on a Raspberry Pi board, as well as building the Aksel Alpay's hipSYCL library, from sources.

Installing and Configuring LLVM/Clang-9.x.x Toolchain

Before using the Aksel Alpay's hipSYCL library project's distribution, the specific LLVM/Clang-9.x.x compilers and the Arm/Aarch64 toolchains must be properly installed and configured. To do that, make sure that you've completed the number of steps, listed below:

  1. Update the Raspbian's APT-repositories and install the following prerequisite packages:
    root@raspberrypi4:~# sudo apt update
    root@raspberrypi4:~# sudo apt install -y bison flex python python3 snap snapd git wget

    The command above will install an alternative 'snap' package manager, required for installing the proper version of cmake >= 3.18.0 utility, as well as the 'python', 'python3' distributions and the 'bison', 'flex' utilities, needed for building the hipSYCL open-source project from a "scratch", by using the 'cmake' utility.

  2. Install the 'cmake' >= 3.18.0 utility and LLVM/Clang daemon by using the 'snap' package manager:
    root@raspberrypi4:~# sudo snap install cmake --classic
    root@raspberrypi4:~# sudo snap install clangd --classic

    After installing the 'cmake' utility, let's check if it works and the correct version has been installed from the 'snap'-repository, by using the command below:

    root@raspberrypi4:~# sudo cmake --version

    You must see the following output, after running this command:

    cmake version 3.18.4
    
    CMake suite maintained and supported by Kitware (kitware.com/cmake).
  3. Install the latest Boost, POSIX-Threads and C/C++ standard runtime libraries for the LLVM/Clang toolchain:
    root@raspberrypi4:~# sudo apt install -y libc++-dev libc++1 
    libc++abi-dev libc++abi1 libpthread-stubs0-dev libpthread-workqueue-dev
    
    root@raspberrypi4:~# sudo apt install -y clang-format clang-tidy 
    clang-tools clang libc++-dev libc++1 libc++abi-dev libc++abi1 libclang-dev 
    libclang1 liblldb-dev libllvm-ocaml-dev libomp-dev libomp5 lld lldb llvm-dev 
    llvm-runtime llvm python-clang libboost-all-dev
  4. Download and add the LLVM/Clang's APT-repositories security key:
    root@raspberrypi4:~# wget -O 
    - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
  5. Append the LLVM/Clang's repository URLs to the APT's sources.list:
    root@raspberrypi4:~# echo "deb http://apt.llvm.org/buster/ 
    llvm-toolchain-buster main" >> /etc/apt/sources.list.d/raspi.list
    
    root@raspberrypi4:~# echo "deb-src http://apt.llvm.org/buster/ 
    llvm-toolchain-buster main" >> /etc/apt/sources.list.d/raspi.list

    The completion of these two previous steps 4 and 5 is necessary to have an ability of installing the LLVM/Clang-9.x.x. compilers and specific toolchains, from the specific APT-repository.

  6. Remove the existing symlinks to the previous versions of the LLVM/Clang, installed:
    root@raspberrypi4:~# cd /usr/bin && rm -f clang clang++
  7. Update the APT-repositories, once again, and install the LLVM/Clang's compilers, debugger and linker:
    root@raspberrypi4:~# sudo apt update
    root@raspberrypi4:~# sudo apt install -y clang-9 lldb-9 lld-9
  8. Create the corresponding symlinks to the 'clang-9' and 'clang++-9' compilers, installed:
    root@raspberrypi4:~# cd /usr/bin && ln -s clang-9 clang
    root@raspberrypi4:~# cd /usr/bin && ln -s clang++-9 clang++
  9. Finally, you must have an ability of using the 'clang' and 'clang++' commands in the bash-console:
    root@raspberrypi4:~# clang --version && clang++ --version

    Here, let's check the version of the LLVM/Clang, that has been installed, using the command, above.

    After using the commands, you must see the following output:

    clang version 9.0.1-6+rpi1~bpo10+1
    Target: armv6k-unknown-linux-gnueabihf
    Thread model: posix
    InstalledDir: /usr/bin
    clang version 9.0.1-6+rpi1~bpo10+1
    Target: armv6k-unknown-linux-gnueabihf
    Thread model: posix
    InstalledDir: /usr/bin

Downloading and Installing Aksel Alpay's hipSYCL Library from Sources

Another essential step is downloading and building the open-source hipSYCL library staging distribution from its sources, contributed to the GitHub.

This is typically done by completing the following steps below:

  1. Download the hipSYCL project's distribution, cloning it from GitHub:
    root@raspberrypi4:~# git clone https://github.com/llvm/llvm-project llvm-project
    root@raspberrypi4:~# git clone --recurse-submodules https://github.com/illuhad/hipSYCL

    The Aksel Alpay's hipSYCL project's distribution has several dependencies from another, LLVM/Clang's open-source project. That's actually why, we normally need to clone these both distributions, for building the hipSYCL library runtimes from a "scratch".

  2. Set the number of environment variables, required for building hipSYCL project from sources, by using the 'export' and 'env' commands, as well as appending the specific lines, below, to the.bashrc profile script:
    export LLVM_INSTALL_PREFIX=/usr
    export LLVM_DIR=~/llvm-project/llvm
    export CLANG_EXECUTABLE_PATH=/usr/bin/clang++
    export CLANG_INCLUDE_PATH=$LLVM_INSTALL_PREFIX/include/clang/9.0.1/include
    
    echo "export LLVM_INSTALL_PREFIX=/usr" >> /root/.bashrc
    echo "export LLVM_DIR=~/llvm-project/llvm" >> /root/.bashrc
    echo "export CLANG_EXECUTABLE_PATH=/usr/bin/clang++" >> /root/.bashrc
    echo "export CLANG_INCLUDE_PATH=$LLVM_INSTALL_PREFIX/include/clang/9.0.1/include" >> 
          /root/.bashrc
    
    env LLVM_INSTALL_PREFIX=/usr
    env LLVM_DIR=~/llvm-project/llvm
    env CLANG_EXECUTABLE_PATH=/usr/bin/clang++
    env CLANG_INCLUDE_PATH=$LLVM_INSTALL_PREFIX/include/clang/9.0.1/include
  3. Create and change to the ~/hipSYCL/build sub-directory under the hipSYCL project's main directory:
    root@raspberrypi4:~# mkdir ~/hipSYCL/build && cd ~/hipSYCL/build
  4. Configure the hipSYCL project's sources using 'cmake' utility:
    root@raspberrypi4:~# cmake -DCMAKE_INSTALL_PREFIX=/opt/hipSYCL ..
  5. Build and install the hipSYCL runtime library using the GNU's 'make' command:
    root@raspberrypi4:~# make -j $(nproc) && make install -j $(nproc)
  6. Copy the libhipSYCL-rt.iso runtime library to the Raspbian's default libraries location:
    root@raspberrypi4:~# cp /opt/hipSYCL/lib/libhipSYCL-rt.so /usr/lib/libhipSYCL-rt.so
  7. Set the environment variables, required for using hipSYCL runtime library and LLVM/Clang compilers for building a source code:
    export PATH=$PATH:/opt/hipSYCL/bin
    export C_INCLUDE_PATH=$C_INCLUDE_PATH:/opt/hipSYCL/include
    export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/opt/hipSYCL/include
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hipSYCL/lib
    
    echo "export PATH=$PATH:/opt/hipSYCL/bin" >> /root/.bashrc
    echo "export C_INCLUDE_PATH=$C_INCLUDE_PATH:/opt/hipSYCL/include" >> /root/.bashrc
    echo "export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/opt/hipSYCL/include" >> /root/.bashrc
    echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hipSYCL/lib" >> /root/.bashrc
    
    env PATH=$PATH:/opt/hipSYCL/bin
    env C_INCLUDE_PATH=$C_INCLUDE_PATH:/opt/hipSYCL/include
    env CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/opt/hipSYCL/include
    env LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hipSYCL/lib

Since, we're finally all set with the installing and configuring LLVM/Clang and hipSYCL library, it's strongly recommended to build and run the 'matmul_hipsycl' sample's executable, making sure that everything is just working fine:

Here're the most common steps for building the following sample from sources:

rm -rf ~/sources
mkdir ~/sources && cd ~/sources
cp ~/matmul_hipsycl.tar.gz ~/sources/matmul_hipsycl.tar.gz
tar -xvf matmul_hipsycl.tar.gz
rm -f matmul_hipsycl.tar.gz

A set of commands, above, will create ~/source sub-directory and extract sample's sources from the matmul_hipsycl.tar.gz archive.

To build the sample's executable, simply use the GNU's 'make' command:

root@raspberrypi4:~# make all

This will invoke the 'clang++' command to build the executable:

syclcc-clang -O3 -std=c++17 -o matrix_mul_rpi4 src/matrix_mul_rpi4b.cpp -lstdc++

This command will compile the specific C++17 code with the highest level of code optimization (e.g. -O3), enabled, and linking it with the C++ standard library runtime.

Note: Along with the library runtime, hipSYCL project, built, also provides also the 'syclcc' and 'syclcc-clang' tools, used for building a parallel code, in C++17, implemented using hipSYCL library. The using of these tools is a slightly different from the regular usage of 'clang' and 'clang++' commands. However, the 'syclcc' and 'syclcc-clang' can still be used, specifying the same compiler and linker options, as the original 'clang' and 'clang++' commands.

How to Run a Parallel Code, Delivered using C++17/2x0 and hipSYCL Library, on a Raspberry Pi 4B+ Board

After performing the compilation using these tools, just grant the execution privileges to 'matrix_mul_rpi4' file, generated by the compiler, using the command, listed below:

root@raspberrypi4:~# chmod +rwx matrix_mul_rpi4

and, just, run the executable, in the bash-console:

root@raspberrypi4:~# ./matrix_mul_rpi4

After running it, the execution will end up with the following output:

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Multiplication C = A x B:

Matrix C:

323 445 243 343 363 316 495 382 463 374
322 329 328 388 378 395 392 432 470 326
398 357 337 366 386 407 478 457 520 374
543 531 382 470 555 520 602 534 639 505
294 388 277 314 278 330 430 319 396 372
447 445 433 485 524 505 604 535 628 509
445 468 349 432 511 391 552 449 534 470
434 454 339 417 502 455 533 498 588 444
470 340 416 364 401 396 485 417 496 464
431 421 325 325 272 331 420 385 419 468

Execution time: 5 ms

Optionally, we can evaluate performance of the parallel code, being executed by installing and using the following utilities:

root@raspberrypi4:~# sudo apt install -y top htop

The using of 'htop' utility, installed, visualizes the CPU's and system memory utilization, while running the parallel code executable:

Image 6

Points of Interest

micro-FPGAs, as well as the pocket-sized GPGPUs with compute capabilities, connected to an IoT-board, externally, via GPIO- or USB-interfaces, is the next step of parallel computing with IoT. The using of tiny-sized FPGAs and GPGPUs provides an opportunity of performing an even more complex and “heavy” computations, in parallel, drastically increasing an actual performance speed-up, while processing huge amounts of big-data, in real-time.

Obviously, that, another essential aspect of the parallel computing with IoT is the continuation in the development of specific libraries and frameworks, providing CL/SYCL-model layer specification and, thus, the heterogeneous compute platform (XPU) support. Presently, the latest versions of these libraries provide a support for offloading a parallel code execution to the host CPUs acceleration targets, only, since the other acceleration hardware, such as small-sized GPGPUs and FPGAs for nano-computers, have not yet been designed and manufactured, by its vendors, at this time.

In fact, the parallel computing with Raspberry Pi and other specific IoT boards, based on the innovative ARM Cortex-A72 multi-core, 64-bit, RISC-V CPUs is a special point of interest for the software developers and hardware technicians, conducting a performance assessment of the existing computational processes, while running it in parallel with IoT.

In conclusion, leveraging IoT-based parallel computing generally benefits in an overall performance of the cloud-based solutions, intended for collecting and massively processing big-data, in real-time, and, as the result, positively impacting the quality of machine learning (ML) and data analytics, itself.

Acknowledgements

ARM® Holdings Corp., in 2016, designed and manufactured the latest Cortex-A72 ARMv8 Quad-core, 64-bit RISC-V CPUs, providing an ability of performing parallel computing at the edge. Raspberry Pi (RPi) Foundation Team released the next generation of Raspberry Pi 4B+ IoT boards, based on the BCM2711 (SoC) chip, in June 2019, increasing the productivity of the Raspberry Pi IoT-boards. The BCM2711 (SoC) chip is bundled with not only the powerful, next generation ARM’s Cortex-A72 CPUs, but also a various of peripheral devices, such as the revolutionary high-speed LPDDR4-3200Mhz 2, 4, 8 GiB RAM of choice, the latest Broadcom® VideoCore™ VI 500 MHz GPUs, compact Gigabit Ethernet Cards (PCI-E), innovative, low energy profile USB-C 5.0V/3A DC, ideal for the data mining IoT-boards, etc.

The latest version 10.x.x. of the GNU’s Compiler Collection (GCC) cross-platform toolchain was released in June 2020. LLVM Development Group released the latest “stable” version of the LLVM/Clang-11.x.x C/C++ cross-platform compilers, in October 2020. In 2020, Khronos Group released the CL/triSYCL library project’s open-source staging distribution, intended for using it as a testbed for evaluating CL/SYCL wrapper library and providing a feedback to Khronos Group and ISO-committee. Aksel Alpay, at Heidelberg University (Germany), released the most “stable” version of the CL/hipSYCL library project’s staging distribution, targeting a various of hardware architectures, including the Raspberry Pi’s Arm/Aarch64.

In 2020, Arthur V. Ratz, at CodeProject and Intel® DevMesh, has developed several projects of his own, intended for an IoT-cluster, the implementation of which, is mainly based on using Khronos CL/triSYCL and Aksel Alpay's hipSYCL open-source library distributions. During an entire development cycle, he has evaluated the using of GNU's and LLVM/Clang's specific toolchains for building the projects from sources. Later on, in November 2020, Arthur V. Ratz published a practical tutorial for getting started with the parallel heterogenous programming in C++17/2x0, by using the CL/triSYCL and hipSYCL libraries, as well as building and running a CL/SYCL-code on the latest Raspberry Pi 4B+ rev. B0 IoT-boards, with the Arm/Aarch64-architectures, other than native PCs x86_64 platform.

History

  • 12th November, 2020 - First revision of article and CL/SYCL-code samples have been published

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)