Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / HPC / GPU

C++ Virtual-Array Implementation that is Backed by the Combined Video Memory of User-System

4.40/5 (15 votes)
15 Mar 2021GPL326 min read 18.7K  
Header-only C++ tool that supports basic array-like usage pattern and uses multiple graphics cards in system as storage with LRU caching
If you have some big arrays to compute but not enough RAM and a very slow HDD, then you can use your graphics cards as storage with this.

Introduction

All popular modern OSes have some paging technique that allows us to use more data than our RAM can hold. This is crucial to be able to continue testing/learning algorithms for larger scale data without buying extra RAM and just have thousands of apps running on the same machine without an overflow.

But, they don't (yet) have an option to use spare-VRAM in system that is connected with high-enough throughput pcie bridges. This header-only C++ tool (VirtualMultiArray) supports basic array-like usage pattern and is backed by a paging system that uses multiple-graphics card through OpenCL.

Github page: https://github.com/tugrul512bit/VirtualMultiArray/wiki/How-it-is-used

Background

The basic idea about this tool is to waste less CPU cycles in swap-death which occurs with over-allocation over-usage of big buffers. Wasting some VRAM is sometimes better than wasting all CPU-cycles in next 10 minutes trying to Ctrl-C the app (and save HDD from permanent damage in stormy weather, SSD from weariness, etc.).

Image 1

With this simple array tool, access to all indices of array are seamless while paging is handled only on demand by only those threads accessing the array. It is thread-safe, even on same elements.

Performance depends on access pattern, but for simple sequential multi-threaded access on a low-memory cheap system, it turns this:

Image 2

into this:

Image 3

Even after system is saved from near swap death, it takes several minutes to rollback those pages from HDD. So real time loss is much more than 10m33s but when an unused vram is given some work, it's both quick and fully responsive.

Using the Code

Some environment preparations:

  • Since it uses OpenCL and written under Ubuntu 18.04 with Nvidia graphics card, its OpenCL library is "lib64". This can be different for other OSes and graphics card vendors (like Amd/Intel).
  • For multi-threaded testing only, there is some OpenMP usage (one can use any other parallel task library).
  • Tested with g++-7 (c++1y dialect) and g++-10
  • For a short main.cpp source, compiling and linking looks like this:
g++ -std=c++1y -O3 -g3 -Wall -c -fmessage-length=0 -march=native -mavx -fopenmp -pthread 
-fPIC -MMD -MP -MF"src/foo.d" -MT"src/foo.d" -o "src/foo.o" "../src/foo.cpp"

g++ -L/usr/local/cuda/lib64 -o "your_app_name" ./src/foo.o -lOpenCL -lpthread -lgomp

Usage is simple:

C++
GraphicsCardSupplyDepot depot;
const size_t n = 1000000000ull;           // number of elements of array, uses only VRAM
const size_t pageSize=1000;               // RAM usage
const int activePagesPerVirtualGpu = 5;   // RAM usage for caching (per GPU)

VirtualMultiArray<Bar> data(n,depot.requestGpus(),pageSize, activePagesPerVirtualGpu);
data.set(100,Bar());        // or: data[100]=Bar();
Bar value = data.get(100);  // or: Bar value = data[100];
  • Create a GraphicsCardSupplyDepot.
  • Request gpus from it.
  • Give them to VirtualArray<T> templated virtual array along with array size, page size and active pages per gpu.
  • Use get/set methods (or the operator [] overload with proxy class, T var=data[i]; data[j]=x;).
  • Array size needs to be integer multiple of page size.
    • Don't use too many active pages, they consume RAM.
C++
#include "GraphicsCardSupplyDepot.h"
#include "VirtualMultiArray.h"

// for testing
#include <iostream>
#include <numeric>
#include <random>

// a big enough object with plain data
template<typename T>
class ArrayData
{
public:
    ArrayData(){
        std::random_device rd;
        std::mt19937 rng(rd());
        std::uniform_real_distribution<float> rnd(-1,1000);
        for(int i=0;i<1000;i++)
        {
            data[i]=rnd(rng);
        }
    }
    T calc(){ return std::accumulate(data,data+1000,0); }
private:
    T data[1000];
};

int main(int argC, char ** argV)
{
    // gets all gpus. use depot(true) for some info about graphics cards in system
    GraphicsCardSupplyDepot depot; 

    // n needs to be integer multiple of pageSize !!!!
    const size_t n = 800000;
    const size_t pageSize=10;
    const int maxActivePagesPerGpu = 5;

    VirtualMultiArray<ArrayData<int>> data(n,depot.requestGpus(),pageSize,
                                        maxActivePagesPerGpu,{2,4,4});
    //std::vector<ArrayData<int>> data;
    //data.resize(n);

    #pragma omp parallel for
    for(size_t i=0;i<n;i++)
    {
        data.set(i,ArrayData<int>());
        //data[i]=ArrayData<int>();
    }

    #pragma omp parallel for
    for(size_t i=0;i<n;i++)
    {
        if(data.get(i).calc()<0){ std::cout<<"error: doctor"<<std::endl;};
        //if(data[i].calc()<0){ std::cout<<"error: doctor"<<std::endl;};
    }

    return 0;
}

Run it on a system with this idle-stats:

Image 4

(10 minutes with vector vs 11 seconds with virtual array)

The {2,4,4} parameter is used for choosing ratio of vram usage (both amount and max cpu threads) per card. Test system has 3 cards, first one has less free memory, hence the 2 value. Other cards fully dedicated to program, so they are given 4. This value is also the number of virtual graphics cards per physical graphics card. 2 means there are 2 channels working on 2 virtual gpus, available for 2-cpu-threads to access array elements, concurrently. 4 means 4 threads in-flight for accessing elements, just by that card. {2,4,4,} lets 10 CPU threads access the array concurrently. If a card needs to be removed from this, simply zero is given: {0,4,4} disables first card. If second card has 24GB memory and others only 4GB, then {4,24,4} should work. Not tested with high-end system. If parameter is not given, its default value gives all cards 4 virtual cards which is equivalent to {4,4,....,4} and may need a lot of cpu cores to be fully used.

{2,4,4} also means, there are "maxActivePagesPerGpu=5" amount of actively computed/kept-on-RAM pages on each channel. First card has 2 clones, so it has 10 pages. Other cards have 20 each. Total 50 pages. Each page is given 10 elements in same sample code. This means a total of 500 elements in RAM, for 10-way threading. Since each element costs 4000 bytes, RAM holds only 2MB data. Much bigger pages give better sequential access performance, up to 3.3GB/s on test machine below:

On a PC with parts of FX8150(2.1GHz), GT1030(2GB, pcie v2.0 4-lanes), 2xK420(2GB, pcie v2.0 4 lanes, 8 lanes) and 4GB single-channel 1333MHz ddr RAM, performance is measured as this:

//                                                           cpu                active 
// testing method    object size  throughput    page size    threads  n objects pages/gpu
// random            44 bytes     3.1   MB/s    128 objects  8        100k      4
// random            4kB          238.3 MB/s    1   object   8        100k      4
// serial per thread 4kB          496.4 MB/s    1   object   8        100k      4
// serial per thread 4kB          2467.0MB/s    32  objects  8        100k      4
// serial per thread 44 bytes     142.9 MB/s    32  objects  8        100k      4
// serial per thread 44 bytes     162.3 MB/s    32  objects  8        1M        4
// serial per thread 44 bytes     287.0 MB/s    1k  objects  8        1M        4
// serial per thread 44 bytes     140.8 MB/s    10k objects  8        10M       4
// serial per thread 44 bytes     427.1 MB/s    10k objects  8        10M       100
// serial per thread 44 bytes     299.9 MB/s    10k objects  8        100M      100
// serial per thread 44 bytes     280.5 MB/s    10k objects  8        100M      50 
// serial per thread 44 bytes     249.1 MB/s    10k objects  8        100M      25 
// serial per thread 44 bytes     70.8  MB/s    100kobjects  8        100M      8  
// serial per thread 44 bytes     251.1 MB/s    1k  objects  8        100M      1000
// interleaved thread44 bytes     236.1 MB/s    1k  objects  8        100M      1000
// interleaved thread44 bytes     139.5 MB/s    32  objects  8        100M      1000
// interleaved thread44 bytes     153.6 MB/s    32  objects  8        100M      100
// interleaved thread4kB          2474.0MB/s    32  objects  8        1M        5

It handles 4.5GB data with consuming only 100MB-1GB RAM (depends on total active pages).

Another example that actually does something (Mandelbrot set generation code from here):

C++
#include "GraphicsCardSupplyDepot.h"
#include "VirtualMultiArray.h"

// testing
#include <iostream>
#include <fstream>
#include <complex>

class Color
{
public:
	Color(){}
	Color(int v){val=v;}
	int val;
};

const int width  = 1024;
const int height = 1024;

int value ( int x, int y)  {std::complex<float> point((float)x/width-1.5, 
          (float)y/height-0.5);std::complex<float> z(0, 0);
    unsigned int nb_iter = 0;
    while (abs (z) < 2 && nb_iter <= 34) {
           z = z * z + point;
           nb_iter++;
    }
    if (nb_iter < 34) return (255*nb_iter)/33;
    else return 0;
}

int main(int argC, char ** argV)
{
	GraphicsCardSupplyDepot depot;

	// n needs to be integer multiple of pageSize !!!!
	const size_t n = width*height;
	const size_t pageSize=width;
	const int maxActivePagesPerGpu = 10;

	VirtualMultiArray<Color> img(n,depot.requestGpus(),pageSize,maxActivePagesPerGpu);

	std::cout<<"Starting cpu-compute on gpu-array"<<std::endl;

        #pragma omp parallel for
	for (int i = 0; i < width; i++) {
	    for (int j = 0; j < height; j++)  {
		 img.set(j+i*width,Color(value(i, j)));
	    }
    }

	std::cout<<"mandelbrot compute finished"<<std::endl;

	std::ofstream my_Image ("mandelbrot.ppm");
	if (my_Image.is_open ()) {
	     my_Image << "P3\n" << width << " " << height << " 255\n";
	     for (int i = 0; i < width; i++) {
	          for (int j = 0; j < height; j++)  {
	               int val = img.get(j+i*width).val;
	               my_Image << val<< ' ' << 0 << ' ' << 0 << "\n";
	          }
	     }
	     my_Image.close();
	}
	else std::cout << "Could not open the file";
	std::cout<<"File saved"<<std::endl;
	return 0;
}

Verification:

Image 5

Produced in an openmp-parallelized for-loop in setter method, read serially single threaded from get method.

Benefits of Overlapping I/O

CPU caches have only several cycles of overhead which makes only nanoseconds. RAM has tens of nanoseconds latency. Pcie latency is anywhere between 1 microsecond and 20 microseconds even if its just 1 byte data requested from graphics card. So, without doing anything while waiting for the data is a heavy loss of CPU cycles. This is against the benefit of using this tool (not losing cycles waiting on HDD swap death). Luckily, a CPU can run more threads than its number of logical cores. For this, some of threads are required to yield() their current work whenever they are in an unknown-duration waiting state, such as waiting for pcie data arrival for microseconds (a microsecond is eternity for a CPU).

Default OpenCL way to wait (without any idle/bust wait policy defined):

C++
clFinish(commandQueue);

This is so simple but can also be so much CPU unfriendly, especially when the vendor of GPU had chosen the busy-wait policy as default for OpenCL implementation. At this point, CPU thread can not go on anywhere else and has to wait for clFinish() to return. A lot of CPU cycles wasted without using any floating-point pipeline nor integer-pipeline nor L1 cache.

A solution to make it scalable up to 64 threads on a 8-core/8-thread CPU (for simplicity, any error-handling was removed):

C++
// this is the OpenCL data transmission command to be idle-waited
cl_event evt;
clEnqueueReadBuffer(q->getQueue(),...,.., &evt);

clFlush(q->getQueue());

const cl_event_info evtInf = CL_EVENT_COMMAND_EXECUTION_STATUS;
cl_int evtStatus0 = 0;
clGetEventInfo(evt, evtInf,sizeof(cl_int), &evtStatus0, nullptr);

while (evtStatus0 != CL_COMPLETE)
{
   clGetEventInfo(evt, evtInf,sizeof(cl_int), &evtStatus0, nullptr);            
   std::this_thread::yield();
}

The clFlush() part is just an opening signal to OpenCL driver to let it start issuing commands in command queue, asynchronously. So this command immediately returns and the driver begins sending commands to graphics card.

Event object data is populated by the clEnqueueReadBuffer() (it reads data from VRAM into RAM) last parameter and the event tracking starts.

clGetEventInfo() is the query command to know if event has reached a desired state. When event status becomes CL_COMPLETE, the command is complete and data is ready on the target buffer.

Until the event becomes CL_COMPLETE, the while loop continues yielding its work to some other thread for some useful work. It could be an nbody algorithm's compute loop or some data copying inside L1 cache of CPU which is independent of DMA work being done outside of CPU or even another I/O thread that starts another DMA operation in parallel or hiding latency of LRU cache of another array access (LRU implementation info below).

Here is a 64-thread random-access benchmark of data sets ranging from 120kB to 4.6GB, on FX8150(8 thread/8 core):

[click here for a full-sized image]

Image 6

When randomly-accessed data set size is larger than LRU capacity, the I/O is overlapped with other I/O (both between multiple graphics cards and within same card).

When random-access is done inside an L3-sized data set, any I/O is overlapped with L1 data transmissions during object copying from get() method to an Object variable. The test CPU has 8 MB L3 capacity so the performance increase is fast around 8MB data set size on the graph. The probability of I/O is very very low when data set is smaller than LRU and LRU has cached nearly all of content. But until its full, it continues fetching data from VRAM of graphics cards. Thanks to the overlapped I/O, it's faster than 8 thread version and a lot faster than single threaded one.

One issue about this graph is that performance drops again on left side. This is caused by page-lock contention because the less objects exist in array, the more threads race for the same lock. Each object in that benchmark is 60kB long so whole array of 4.6GB data is made of only about 75000 objects.

Least Recently Used (LRU) Cache Implementation

Depending on number of cache lines required by user, one of three different data paths are followed:

1 cache line (numActivePage = 1) per virtual gpu: direct eviction on demand. Whenever required virtual index is not cached in active page (cache line), it is fetched from VRAM (before old one is sent back if it had edited data).

C++
fImplementation=[&](const size_t & ind)
{
  if(directCache->getTargetGpuPage()!=ind) // if virtual index not found
  {
     updatePage(directCache, ind);         // update the cache line
     directCache->reset();                 // uncheck the "edited" boolean
  }
  return directCache;
};

Then simply the get/set method uses the returned page pointer to do any read/write operation on it (and sets edited status if edits).

2-12 cache lines (numActivePage = 2..12) per virtual gpu: this is a direct&fast mapping of "priority queue" to the codes with a cache-friendly way of moving data.

C++
Page<T> * const accessFast(const size_t & index)
{
   Page<T> * result=nullptr;
   auto it = std::find_if(usage.begin(),usage.end(),
             [index](const CacheNode<T>& n){ return n.index == index; });
   if(it == usage.end())
   {
       if(usage[0].page->getTargetGpuPage()!=index)
       {
           updatePage(usage[0].page, index);
           usage[0].page->reset();
       }
       usage[0].index=index;
       usage[0].used=ctr++;
       result = usage[0].page;
   }
   else
   {
       it->used=ctr++;
       result = it->page;
   }
   insertionSort(usage.data(),usage.size());
   return result;
}

First, it checks if the required element(virtual index) is in std::vector. If not found, simply replaces first element with the required virtual index so that it will be cached for future accesses. Here, a counter is used to give it highest usage frequency directly, before evicting the page contents. This part can be tuned for other types of eviction policies. (For example, if LFU(least frequently used) eviction is required, then the counter can be removed and just usage[0].index = usage.[usage.size()/2].index + 1; can be used to send it to middle of vector. When it goes to middle of vector, left side elements continue for survival to not become victim next time and right side becomes privileged and easier to stay in cache. More like a "segmented" version of LFU. )

If it's already found in cache, then its counter value is updated to highest value in vector.

Finally, the insertion sort does the necessary ordering (shifting of right-hand-side elements in this version).

13 or more cache lines (numActivePage = 13+) per virtual gpu: this version uses std::unordered_map and std::list to have the necessary scaling up many more cache lines (like 500) without slowing down.

C++
Page<T> * const accessScalable(const size_t & index)
{
    Page<T> * result=nullptr;
    typename std::unordered_map<size_t,
    typename std::list<Page<T>*>::iterator>::iterator it =
                                  scalableMapping.find(index);
    if(it == scalableMapping.end())
    {
        // not found in cache
        Page<T> * old = scalableCounts.back();

        size_t oldIndex = old->getTargetGpuPage();
        if(old->getTargetGpuPage()!=index)
        {
            updatePage(old, index);
            old->reset();
        }

        scalableCounts.pop_back();
        scalableMapping.erase(oldIndex);

        // add a new
        scalableCounts.push_front(old);
        typename std::list<Page<T>*>::iterator iter = scalableCounts.begin();
        scalableMapping[index]=iter;

        result = old;
    }
    else
    {
        // found in cache
        // remove
        Page<T> * old = *(it->second);
        scalableCounts.erase(it->second);

        // add a new
        scalableCounts.push_front(old);
        auto iter = scalableCounts.begin();
        scalableMapping[index]=iter;

        result = old;
    }

    return result;
}

Basically the same algorithm, just with higher efficiency for higher amount of cache lines. The unordered map holds necessary mapping from virtual index (VRAM buffer) to active page (physical page / RAM buffer) but the buffer pointer is held inside a list instead. Because on a list, it is quick (O(1)) to remove a node from back and add it to front. Unordered map also does the mapping quick (O(1)), thanks to its efficient(hash-table?) implementation. Instead of linearly searching a vector, finding the element directly is better. Once it is not found, eviction of "least recently used" page(cache-line) is done. If already found, then no eviction, just an access to buffer in RAM(or L1/L2/L3).

Taking back-node of list and moving it to head completes the cycle of LRU so that next time someone doesn't find something in cache, next victim is ready at the new back-node and the newest front-node is safe from being evicted for a long time (or a short time if there is cache thrashing happening in user's access pattern).

Once an implementation is chosen and assigned to fImplementation function member, it is called whenever a page is requested:

C++
// same access for both reading and writing
Page<T> * const access(const size_t & index)
{
    return fImplementation(index);
}

after calculation of frozen(VRAM)-page index from VirtualArray.h:

C++
T get(const size_t & index)
{
     const size_t selectedPage = index/szp;
     Page<T> * sel = pageCache->access(selectedPage);
     return sel->get(index - selectedPage * szp);
}

after user calls get() method from VirtualMultiArray.h:

C++
// get data at index
// index: minimum value=0, maximum value=size-1 
// but not checked for overflowing/underflowing
T get(const size_t & index) const{
     const size_t selectedPage = index/pageSize;
     const size_t numInterleave = selectedPage/numDevice;
     const size_t selectedVirtualArray = selectedPage%numDevice;
     const size_t selectedElement = numInterleave*pageSize + (index%pageSize);

     std::unique_lock<std::mutex> lock(pageLock.get()[selectedVirtualArray].m);
     return va.get()[selectedVirtualArray].get(selectedElement);
}

It does interleaved access to LRUs. Every K-th page belongs to another virtual graphics card and every virtual graphics card is cached. So, if one accesses pages 1,5,9,13,17,21, ... it requires same LRU if virtual graphics card setting is like this: std::vector<int> memMult = { 1,1,1,1 } ; The associativity of each LRU creates a strided pattern and multiple LRUs combined into a continuous parallel-caching with increased throughput because of independent page locks. Normally, a single LRU works with only 1 thread at a time and causes too high lock contention. But with N parallel LRUs, N number of threads can run at a time and help overlapping I/O with math operations.

For a simple image processing benchmark (that computes Gaussian Blur on a Mandelbrot image), throughput increased by 300% after addition of LRU (combined with tiled processing). Similarly, overlapped sequential reading/writing benefited from LRU and up to some point, even random-access pattern gained performance.

Features

Bulk Reading and Writing

These methods decrease average number of page locking per element greatly and achieve much higher bandwidth than basic get/set methods (sometimes up to 50x).

Writing to array:

C++
std::vector<Obj> vec;
vec.push_back(Obj(1));
vec.push_back(Obj(1));

// write 2 elements to vArray starting from index i
vArray.writeOnlySetN(i,vec);

Reading from array:

C++
// read 2 elements from vArray
std::vector<Obj> result = vArray.readOnlyGetN(i,2)

Mapping

This method takes a user function and serves a raw buffer pointer to it as a better alternative for bulk operations involving SIMD/vector instructions, faster data copies, anything that needs an aligned buffer. When pages are big enough, it has slight advantage over bulk read/write methods.

In short, it optionally reads whole region into buffer, runs function, optionally writes data back into virtual array.

C++
// T = int
// map region [303,803]
arr.mappedReadWriteAccess(303,501,[](int * buf){
    // use same index for access
    for(int i=303;i<303+501;i++)
        buf[i]=i;
},pinned=false,read=true,write=true,userPtr=nullptr);

If pinned parameter is true, mlock/munlock commands of Linux are used to stop OS paging the buffer out.

If read parameter is true, buffer is filled with latest data bits from virtual array before running the user function.

If write parameter is true, buffer contents are written back to virtual array after running the user function.

If userPtr parameter is not nullptr, then internal allocation is skipped and userPtr is used for user function+r/w copies instead. The user function takes negatively offset version of userPtr so the algorithm inside the function uses same indexing base as virtual array. If index=100 and range=100, then inside the function, user can access buf as buf[100]...buf[199].

Since the temporary buffer within mapping is aligned better than std::vector and is allocated all at once (unlike bulk-read's K number of allocations where K=number of pages locked), it tends to complete quicker. If user gives a dedicated pointer, then allocation is evaded altogether for a lesser latency before running the user function.

Uncached (streaming) Read/Write Operations

For some algorithms, there is no performance gain from any page size (cache line size) nor any number of active pages (cache lines). Uncached versions of (currently only for scalar versions of) get/set methods lower the latency of access. It is the same as get/set but with an addition of decorating any uncached accessing block with streamStart/Stop commands.

C++
arr.streamStart();   // flushes any edited data to vram

arr.setUncached(..); // just like set but uncached, 2x lower latency
arr.getUncached(..); // just like get but uncached, ~10% lower latency

arr.streamStop();    // brings updated data from vram into LRU cache

If there will be no cached-access until the end of scope, then these extra start/stop commands are not required. Since cached writes need an eviction and bringing new data back, it costs twice as a cached read operation. But uncached write has only one way data movement so it is twice as fast in terms of latency.

Thread Safety Of Bulk R/W and Mapping

Both bulk and map operations are thread-safe within each page (pages are always thread-safe) but when their regions span multiple pages, then user is needed to make data consistency through explicit synchronizations. Multiple pages are not locked at the same time. Only 1 page is locked & processed at a time. Due to this, user needs to stop any concurrent accesses that overlap/touch same region. A single big bulk/map operation can have multiple read/write accesses which may conflict with other threads read/write until completed.

It is safe to use read-only versions (of map / bulk read / get) concurrently even in overlapped regions. Data is always preserved intact unless a writing method is used in same region. This may or may not invalidate data to be uploaded to graphics cards before reading it from another thread.

It is safe to do concurrent bulk write / write / map write on non-overlapped regions.

It is safe to do concurrent any (bulk or not )write / any (bulk or not)read operation on non-overlapped regions.

Why Bulk Read/Write?

Some algorithms are optimized for speed by loading more of data at once (such as "tiling") or just explicit caching of a region. Then the data is computed more efficiently on faster memory. For example, in an nbody algorithm, the data in RAM can be tiled as smaller "cache"-sized chunks to complete groups of force calculations quicker than just streaming data from RAM always. Sometimes chunks are as small as several CPU registers (register-tiling). Same rule applies to virtual memory too. "Paging" of this virtual array is just an implicit caching. Then user can further improve it by using a custom sized "tile" and load from "pages" into "tiles" that have both proper alignment for SIMD/vectorization of C++ compiler and the lower latency of "locking only once per page instead of once per element".

Why Map?

Mapping offers a raw pointer that does not have the awful side-effects of SetterGetter proxy class that works for the array subscription operator [] overloading.

The raw pointer given by mapping is also aligned at 4096 which helps improve data copying performance between devices (such as graphics cards) and RAM and running aligned-versions of SIMD instructions.

To further optimize, optionally allows pinning of buffer and user-defined pointer to stop allocating on each mapping.

GPU-accelerated find()

To find one or more elements inside the virtual array at a constant time,

C++
// obj: object instance
// member: member of object to be used for searching value
// indexListMaxSize: maximum number of indices found
template<typename S>
std::vector<size_t> find(T & obj, S & member, const int indexListMaxSize=1);

is used. First parameter is an object instance that contains a search-value as a member. Second parameter is the member to search for. This member needs to be direct member of "obj" object. This way, this method can compute its byte offset in the object and use it in all GPUs' "find" kernels to search member value in all objects. Last parameter is maximum number of indices returned from each OpenCL data channel. If 1000 is given, and if there are 10 total channels, then a total of 10000 indices may be returned from this method. All elements at those indices have the same member value. Usage is simple:

Plain Data

C++
char searchObj = 'X';
std::vector<size_t> found = charArray.find(searchObj,searchObj);

This searches for at most 1 element per OpenCL channel that contains 'X' value.

Object

C++
MyObject obj;
obj.id=450; 
std::vector<size_t> found = charArray.find(obj,obj.id,100);

This searches for all objects that contains id=450 and maximum 100 results per OpenCL channel are returned.

The order of results (indices) in the vector are not guaranteed to be same between multiple calls to find().

Since it uses GPUs, the search performance relies on VRAM+GPU. For a system with a single GTX 1070, searching 1GB data for ~200 elements takes at most 45 milliseconds while a GT1030 + 2x K420 system takes 250 milliseconds to find the same elements that contain the required member value.

If element member value to find is at end of array, then this completes much quicker than std::find. If element member is found at the beginning of array, then std::find is faster. But, during the searches, RAM bandwidth is not used, only VRAM bandwidth is required. Only the result index array is copied from VRAM to RAM and is negligible when compared to std::find's bandwidth requirement. If its a small array that can fit into L1, then there is no use for a virtual array anyway. This virtual array is for big data that does not fit inside RAM.

Since the search time is ~constant and it returns an array of indices for elements, it is something like using a std::map with an augmentation of one-to-many relation for read-only access.

Currently, this is the only GPU-accelerated method. Maybe in future, a std::transform-like method can be added. But may not be needed at all, if user's project has an OpenCL/CUDA accelerated compute library already.

During the method call before running "find" kernel, all active pages on RAM are flushed to VRAMs of graphics cards so that the find() actually returns updated results. It is not advised to do concurrent write operations on virtual array during find() call. Reading concurrently is ok.

A Vectorized & Tiled N-body Compute Benchmark Using Virtual Array

Nbody algorithm is used for computing movements of masses under a gravitational field that is produced by masses. This example focuses on its most bottlenecking part, the force calculation, of N number of masses.

Since CPUs have more than one floating-point pipeline per core, namely SIMD, it would be a waste to not use all of them in a math-heavy algorithm, like the Nbody algorithm. To do this the easy way, mipp library is used (it is a header-only library and can use SSE/AVX/AVX512). All a developer needs to do is to use vector classes of it to do the computations in bigger chunks like 4 at a time, 8 at a time, 16, ...

Vectorization is just one part of optimization. Other part is tiling. Tiling makes use of repeated access to a smaller region of memory to use caching better.

  • L1 cache tiling (assuming just 4000 x,y,z coordinates can fit inside L1 of CPU)
  • RAM tiling (getting big chunks of virtual array data to RAM at once, working on them efficiently)

Cache tiling is simple. Just making use of small set of variables (coordinates of some particles) again and again is enough to use cache and many CPUs have at least tens of kB of L1 cache which is enough to hold thousands of mass coordinate data (x=4 byte, y=4 byte, z=4 byte: 12 bytes per particle). To do this, a tile of particles is prepared and is re-scanned for every particle in particle array. Actually there is no particle here, just arrays. X array, Y array, Z array for 3D position data and VX array, VY array, VZ array for 3D velocity data.

RAM tiling is just for optimizing virtual array's x,y,z reading efficiency. Reading 4000 particles at once means page-lock overhead & pcie latency is divided by 4000 on average per particle. Mapped access to virtual array does this and serves a raw pointer to work on data. Since mapped access also aligns the temporary buffer to 4096, any aligned-version of AVX/SSE instructions can be used on it. Aligned SSE/AVX data loading is generally faster than unaligned data loading.

Benchmark codes:

C++
#include "GraphicsCardSupplyDepot.h"
#include "VirtualMultiArray.h"
#include "PcieBandwidthBenchmarker.h"
#include "CpuBenchmarker.h"

// testing
#include <iostream>
#include "omp.h"

// a simd tool from github
#define MIPP_ALIGNED_LOADS
#include "mipp/mipp.h"

int main()
{
    const int simd = mipp::N<float>();
    std::cout<<"simd width: "<<simd<<std::endl;

    const int n = simd * 40000;
    std::cout<<"particles: "<< n <<std::endl;

    GraphicsCardSupplyDepot depot;
    VirtualMultiArray<float> xVir(n, depot.requestGpus(), 4000, 1,{5,15,10});
    VirtualMultiArray<float> yVir(n, depot.requestGpus(), 4000, 1,{5,15,10});
    VirtualMultiArray<float> zVir(n, depot.requestGpus(), 4000, 1,{5,15,10});
    VirtualMultiArray<float> vxVir(n, depot.requestGpus(), 4000, 1,{5,15,10});
    VirtualMultiArray<float> vyVir(n, depot.requestGpus(), 4000, 1,{5,15,10});
    VirtualMultiArray<float> vzVir(n, depot.requestGpus(), 4000, 1,{5,15,10});

    // if you don't initialize data, 
    // floating point NaN values slow down the speed too much
    {
        CpuBenchmarker init(0,"init");
        #pragma omp parallel for
        for(int i=0;i<n;i+=1000)
        {
            xVir.mappedReadWriteAccess(i,1000,[i](float * ptr)
                 { for(int j=i;j<i+1000;j++) ptr[j]=j; },false,false,true);
            yVir.mappedReadWriteAccess(i,1000,[i](float * ptr)
                 { for(int j=i;j<i+1000;j++) ptr[j]=j; },false,false,true);
            zVir.mappedReadWriteAccess(i,1000,[i](float * ptr)
                 { for(int j=i;j<i+1000;j++) ptr[j]=j; },false,false,true);
            vxVir.mappedReadWriteAccess(i,1000,[i](float * ptr)
                 { for(int j=i;j<i+1000;j++) ptr[j]=j; },false,false,true);
            vyVir.mappedReadWriteAccess(i,1000,[i](float * ptr)
                 { for(int j=i;j<i+1000;j++) ptr[j]=j; },false,false,true);
            vzVir.mappedReadWriteAccess(i,1000,[i](float * ptr)
                 { for(int j=i;j<i+1000;j++) ptr[j]=j; },false,false,true);
        }
    }

    mipp::Reg<float> smoothing = 0.0001f;
    // mapped array access
    {
        CpuBenchmarker bench(((size_t)n) * n * sizeof(float) * 3, 
                       "mapped access for force-calc",((size_t)n)*n);
        const int tileSize = 4000;
        const int regTile = 500;
        #pragma omp parallel for num_threads(32) schedule(dynamic)
        for (int i = 0; i < n; i += regTile)
        {
            std::vector<float> x0 = xVir.readOnlyGetN(i,regTile);
            std::vector<float> y0 = yVir.readOnlyGetN(i,regTile);
            std::vector<float> z0 = zVir.readOnlyGetN(i,regTile);

            mipp::Reg<float> fma1[regTile];
            mipp::Reg<float> fma2[regTile];
            mipp::Reg<float> fma3[regTile];
            for(int j=0;j<regTile;j++)
            {
                fma1[j]=0.0f;
                fma2[j]=0.0f;
                fma3[j]=0.0f;
            }

            for (int ii = 0; ii < n; ii += tileSize)
            {

                xVir.mappedReadWriteAccess(ii, tileSize, [&,ii](float* ptrX1) {
                    const float* __restrict__ const ptrX = ptrX1 + ii;
                    yVir.mappedReadWriteAccess(ii, tileSize, [&,ii](float* ptrY1) {
                        const float* __restrict__ const ptrY = ptrY1 + ii;
                        zVir.mappedReadWriteAccess(ii, tileSize, [&, ii](float* ptrZ1) {
                            const float* __restrict__ const ptrZ = ptrZ1 + ii;

                            for (int ld = 0; ld < tileSize; ld += simd)
                            {
                                mipp::Reg<float> x = mipp::load(ptrX + ld);
                                mipp::Reg<float> y = mipp::load(ptrY + ld);
                                mipp::Reg<float> z = mipp::load(ptrZ + ld);

                                for(int reg0 = 0; reg0 < regTile; reg0++)
                                {
                                    const int reg = reg0 ;
                                    const mipp::Reg<float> x0r = x0[reg];
                                    const mipp::Reg<float> y0r = y0[reg];
                                    const mipp::Reg<float> z0r = z0[reg];

                                    const mipp::Reg<float> dx = mipp::sub(x,x0r);
                                    const mipp::Reg<float> dy = mipp::sub(y,y0r);
                                    const mipp::Reg<float> dz = mipp::sub(z,z0r);
                                    const mipp::Reg<float> dx2 = mipp::mul(dx,dx);
                                    const mipp::Reg<float> dy2 = mipp::mul(dy,dy);
                                    const mipp::Reg<float> dz2 = mipp::mul(dz,dz);
                                    const mipp::Reg<float> dxy2 = mipp::add(dx2,dy2);
                                    const mipp::Reg<float> dz2s = 
                                                mipp::add(dz2,smoothing);
                                    const mipp::Reg<float> smoothed = 
                                                mipp::add(dxy2,dz2s);
                                    const mipp::Reg<float> r = mipp::rsqrt(smoothed);
                                    const mipp::Reg<float> r3 = 
                                                mipp::mul(mipp::mul(r, r), r);


                                    fma1[reg] = mipp::fmadd(dx, r3, fma1[reg]);
                                    fma2[reg] = mipp::fmadd(dy, r3, fma2[reg]);
                                    fma3[reg] = mipp::fmadd(dz, r3, fma3[reg]);
                                }
                            }
                        }, false, true, false);
                    }, false, true, false);
                }, false, true, false);
            }

            for(int j=0;j<regTile;j++)
            {
                vxVir.set(i+j, vxVir.get(i+j) + mipp::hadd(fma1[j]));
                vyVir.set(i+j, vyVir.get(i+j) + mipp::hadd(fma2[j]));
                vzVir.set(i+j, vzVir.get(i+j) + mipp::hadd(fma3[j]));
            }
        }
    }

    return 0;
}

First, virtual arrays are allocated (which may take several seconds for slow computers).

C++
VirtualMultiArray<float> xVir(n, depot.requestGpus(), 4000, 1,{5,15,10});

Here, 4000(elements) is page size, 5,15,10 are parallel OpenCL data channels, 1 is page cache per channel. This means there are a total of (5+15+10)*1*4000 = 120000 elements cached inside RAM. This is a bit higher than 1/3 of all particles. So, maximum 1/3 of particles have RAM speed if all of them are concurrently used. For a system with all pcie bridges having same number of pcie channels, {5,5,5} or {10,10,10} or a similar arbitrary multiplier set can be used to maximize pcie usage efficiency. For the development computer, second card has higher pcie bandwidth and first card has some latency issues due to OS functions. To automate finding best bandwidth, PcieBandwidthBenchmarker().bestBandwidth(5 /* or 10 */ ) can be used instead of {...}.

Then arrays are initialized to some arbitrary values to counter any denormalized floating point operation slowdown effects (without initialization, it gets 10x slower on development machine). Lastly, forces are computed in a brute-force way (O(N^2) force calculations). Simd length (on fx8150 & -mavx -march=native g++ compiler options) is 8. This means its using AVX instructions of CPU to compute 8 data at once. With simd length = 8, there are 320000 particles to simulate. On a SSE CPU, it would create 160000 particles instead.

Due to the scalability of I/O operations, 32-threads are used in OpenMP pragma directive. This is for 3 graphics cards on pcie v2.0, a 8 thread cpu and 64bit build. It may have a different performance gain on other systems. But each "mapped access" to virtual array means a VRAM-RAM data movement which takes some time and is asynchronous to CPU thanks to the DMA engines. So, when a thread is blocked at waiting for VRAM data, another thread can context-switch in its place and continue doing other stuff. This way, some(or all?) of I/O latency is hidden behind that new task and overall performance increases. Normally, an NBody algorithm would work just fine with number of threads = 8 on a 8-core CPU, since it is pure math. But virtual array accessing adds I/O.

Same many-threads scalability can happen for any other I/O bottlenecked algorithm, not just N-body. Especially when system has more async graphics card DMA engines than number of CPU cores. For example, some Nvidia Tesla cards have 5 engines per card. So even a dual-core CPU would use 10-20 threads on a dual Tesla system.

The mapping part:

C++
xVir.mappedReadWriteAccess(ii, tileSize, [&,ii](float* ptrX1) { 
    const float* __restrict__ const ptrX = ptrX1 + ii;
    ...
},false,true,false);

alone is not enough to extract full performance out of SIMD instructions. Compiler needs to know that pointers are not an alias. This is achieved by adding __restrict__ word for the pointer. Then compiler can do necessary optimizations which can lead to vectorization of math codes.

If this Nbody algorithm was optimized by Barnes-Hut, it could handle millions of particles with ease. If it was Fast-Fourier-Transform approach, it could handle billions of particles. On these bigger-scale simulations, bigger chunks of data would be read from virtual array and this may lead to OS paging some of them in/out during computations. To stop this, the first "false" parameter there could be switched to "true" to stop OS doing that. Since this simple example is a lot less likely to feel OS paging, it is not needed and even can cause performance degradation since pinning an array takes some time too.

Vectorization part is easy, thanks to mipp library, it would also work fine with any other vectorization library:

C++
// for all elements in virtual array's RAM tile
for (int ld = 0; ld < tileSize; ld += simd) 
{ 
   // aligned data load from memory
   // unaligned version: mipp::loadu(ptr) is slower
   mipp::Reg<float> x = mipp::load(ptrX + ld); 
   mipp::Reg<float> y = mipp::load(ptrY + ld); 
   mipp::Reg<float> z = mipp::load(ptrZ + ld); 

   // for all elements in cache tile
   for(int reg0 = 0; reg0 < regTile; reg0++) 
   { 
       // select a particle id in cache tile
       const int reg = reg0 ; 

       // broadcast particle data to all lanes of AVX vector (8 element copy)
       const mipp::Reg<float> x0r = x0[reg]; 
       const mipp::Reg<float> y0r = y0[reg]; 
       const mipp::Reg<float> z0r = z0[reg]; 

       // find 3D relative position between cache tile particle and RAM tile particle
       // in 8-element chunks
       const mipp::Reg<float> dx = mipp::sub(x,x0r); 
       const mipp::Reg<float> dy = mipp::sub(y,y0r); 
       const mipp::Reg<float> dz = mipp::sub(z,z0r); 

       // compute squares of relative positions
       // in 8-element chunks
       const mipp::Reg<float> dx2 = mipp::mul(dx,dx); 
       const mipp::Reg<float> dy2 = mipp::mul(dy,dy); 
       const mipp::Reg<float> dz2 = mipp::mul(dz,dz); 

       // find sum of squared distances in all dimensions + add a smoothing
       // in 8-element chunks
       const mipp::Reg<float> dxy2 = mipp::add(dx2,dy2); 
       const mipp::Reg<float> dz2s = mipp::add(dz2,smoothing); 
       const mipp::Reg<float> smoothed = mipp::add(dxy2,dz2s); 

       // find r (distance) between 2 selected particles
       // in 8-element chunks
       const mipp::Reg<float> r = mipp::rsqrt(smoothed); 

       // we need 3rd power of it because (dx+dy+dz)/r^3 equals (unit vector) *(1/r^2)
       // in 8-element chunks
       const mipp::Reg<float> r3 = mipp::mul(mipp::mul(r, r), r); 

       // multiply with dx,dy,dz and add to force components fx,fy,fz
       // no mass value used, all particles with same mass assumed
       // in 8-element chunks
       fma1[reg] = mipp::fmadd(dx, r3, fma1[reg]); 
       fma2[reg] = mipp::fmadd(dy, r3, fma2[reg]); 
       fma3[reg] = mipp::fmadd(dz, r3, fma3[reg]); 
   } 
}

The trick about optimizing force storage (fmadd part) performance is, not using "store" commands. Just using a vector of CPU registers in fmadd (last 3 instructions there) operation. This lets compiler to do load&store where only necessary and use registers directly where it can. Giving compiler some freedom helps in getting some performance. For FX8150 at 3.6GHz(no turbo), this achieves 90GFLOPS compute performance. At 3.6GHz, absolute peak of CPU is 230 GFLOPS but Nbody algorithm does not have equal number of "+" and "*" math operations. Due to this, absolute achievable maximum performance is actually 70% of peak which is 161 GFLOPS. To get 161 GLFLOPS, there has to be a register-tiling scheme that uses registers more often than memory accessing. For simplicity, only cache tiling is used here and still achieves more than 50% achievable peak compute performance. On a Ryzen R9 3950x CPU, it should get close to 1TFLOPS.

After force registers of a particle are accumulated with all force components from other particles, a horizontal-add operation adds 8 elements of them and writes it to velocity (virtual)array as 1 element.

Basic Error Handling

Currently, constructor of VirtualMultiArray checks page size - array length and number of gpus - number of pages based inconsistencies and throws relevant error messages.

C++
try
{          
     // throws error when number of pages (n/pageSize) < 20+20+20
     // throws error when n/pageSize division is not integer result
     VirtualMultiArray<char> img(n,gpus,pageSize,maxActivePagesPerGpu,{20,20,20});
}
catch(std::exception & e)
{
	std::cout<<e.what()<<std::endl;
}

Also data copying between active pages and graphics cards (during get/set/map/read/write operations) have "throw" that tell which part is failing.

Points of Interest

Edit: LRU caching overhead was optimized out by making all I/O be waited on idle-loop. Now, a 8 logical core CPU can run 64 threads, 48 of them idle-waiting on I/O while 8 of them doing math.

Feel free to add any ideas like making it an in-memory database, a gpu-accelerated science tool, bug-fixes, etc.

Optimization

Since user can change ratio of memory usage and bandwidth usage per card with "memMult" parameter (vector<int>, {2,2,3,..}), some computers with NVMe drives can offload some of the work from a low-end graphics card or an integrated-gpu (that shares same RAM as its host CPU) to those NVMe if swap file is on that drive. This could enable different load-balancing patterns on element accesses. Currently, the balancing is simply interleaved gpu access from each element. Virtual page 1 access goes to virtual gpu 1, page 2 goes virtual gpu 2, ... possibly a hundred virtual gpus can serve for 100-element-wide threaded page fetching.

If physical cards are not on same type of pcie bridge (1 on 16x slot, 1 on 8x slot, etc..), VirtualMultiArray's "memMult"(std::vector<int>) constructor parameter can be used to put more pressure on the higher bandwidth slot by giving it something like {2,1} or {3,1} or any ratio required. One can even prepare a mini-benchmark to pick best values for the local computer.

Using small objects is not efficient if pages are not given enough objects to make up for the pcie latency. A slow pcie may do only 100k read/write operations per second (or ~10 microseconds latency) so the more bytes on object data, the higher the efficiency. One potential optimization for image processing could be converting pixels into 16x16 tiles (256 pixels, 3 floats each, 3kB data per tile object) to have spatial locality and pcie efficiency.

Currently, all active pages are pinned buffers which are not movable by OS paging system. In future versions, there will be an option to disable this because pinned memory is rather scarce resource that may not work for too big pages. If "error: buffer read" or similar output is observed on std::cout, then probably pageSize is too high or there are too many pages or the object is too big to be pinned by OS/OpenCL (or simply there is not enough memory left).

Edit: some new features

To auto-tune data sharing ratio for maximum bandwidth, there is an optional benchmark class:

C++
#include "PcieBandwidthBenchmarker.h"

// by default, uses 128MB vram of each card to test their pcie-copy performance
PcieBandwidthBenchmarker bench;

// slowest pcie card receives 2 channels, 
// others get higher "integer" number of channels
// that depends on their pcie performance
std::vector<int> ratios = bench.bestBandwidth(2);

// which can be adjusted for better benchmark results
PcieBandwidthBenchmarker bench(250);

// slowest card gets 10 channels, 
// better chance of overlapping multiple data-copies in multithreaded element access
std::vector<int> ratios = bench.bestBandwidth(10);

// in development computer, this had best results (3500MB/s) for sizeof(Obj)=1MB
VirtualMultiArray<Obj> data1(..,..,..,..,bench.bestBandwidth(2));

To auto-tune data sharing ratio for maximum size, there is a new parameter (mem) for constructor of VirtualMultiArray:

C++
// array data shared between cards in tune with their vram size
// 2 GB card serves 1GB, if 24 GB card serves 12 GB
auto maximizeMem = VirtualMultiArray<Type>::MemMult::UseVramRatios;

// this is default value, just keeps what user gives in "memMult" parameter {1,2,3}
auto customMem   = VirtualMultiArray<Type>::MemMult::UseDefault;

VirtualMultiArray<Obj> data1(..,..,..,..,bench.bestBandwidth(2),customMem);
VirtualMultiArray<Obj> data1(..,..,..,..,overridenParameter,maximizeMem);

If mem parameter is given "UseVramRatios", then it overrides ratios in memMult parameter except zero valued elements:

C++
// array data shared between cards in tune with their vram size
// 2 GB card serves 1GB, if 24 GB card serves 12 GB
auto maximizeMem = VirtualMultiArray<Type>::MemMult::UseVramRatios;

VirtualMultiArray<Obj> data1(..,..,..,..,{0,overridden,overridden},maximizeMem);
VirtualMultiArray<Obj> data2(..,..,..,..,{0,0,overridden},         maximizeMem);

So that combined memory-size is still maximized but only using non-zero ratio cards.

Benchmark method uses both device-to-host and host-to-device copies concurrently to measure combined bandwidth instead of just one-way performance. Maybe later read-optimized or write-optimized benchmark options can be added. For now, benchmark measures read+write performance at once.

Optimizing memMult parameter for maximum size may reduce the combined bandwidth below the equal-data-distribution case (which is the default case when user does not use last two parameters of VirtualMultiArray's constructor). For example, a card with 24GB vram would effectively throttle all other 2GB cards throughput with "MemMult::UseVramRatios".

Benchmarking class also doesn't take "combined size" into consideration and may cause capacity issues if in a very asymmetrical gpu system.

Virtual array=5GB

             PCIE lanes  bandwidth  maximize bandwidth   maximize capacity     default

Gtx 1080ti:  1x          1GB/s      1GB/s                1GB/s                 1.0GB/s
Gtx 1070  :  4x          2GB/s      2GB/s                0.7GB/s               1.0GB/s
Gt1030    :  4x          2GB/s      overflow error       0.2GB/s               1.0GB/s

             PCIE lanes  vram      maximize bandwidth    maximize capacity     default

Gtx 1080ti:  1x          11GB      550MB                  2.6GB                1.7GB
Gtx 1070  :  4x          8 GB      2.2GB                  1.9GB                1.7GB
Gt1030    :  4x          2 GB      overflow error         0.5GB                1.7GB

Currently, algorithm does not take these points into consideration. Users may need to manually change ratios in benchmark results to evade buffer errors. Especially the main card of system (that sends frames to screen) has a constant memory usage by OS which can be as high as 100s of MB. That card's (probably 1st card in ratio array) multiplier can be manually reduced down to {1,3,3} instead of directly using {2,3,3} output of a benchmark.

History

  • 5th February, 2021: Currently, this is a very basic version with very little documentation in itself but at least beats a slow HDD in random-access performance by 15x at least and beats an old SSD in sequential read/write. Also when combined with big enough objects, it beats a NVMe.
  • 8th February, 2021: Auto-tuning parameter option added to constructor of VirtualMultiArray class. Adjusts memory consumption on each card to maximize array-size allowed (but doesn't check if other apps use same video card so main card may overflow when approaching to maximum too close). Also added a benchmarking class that fine-tunes data sharing ratios to maximize bandwidth (instead of size).
  • 13th February, 2021: Added bulk read/write and mapping operations to greatly increase throughput
  • 6th March, 2021: Added gpu-accelerated find() method and an example for nbody algorithm
  • 15th March, 2021: Added LRU caching, how its overhead was balanced by overlapped I/O and benchmark results

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)