Introduction
This article introduces GPGPU via DirectX11 Compute Shaders.
GPGPU (General-Purpose Computing on Graphics Processing Units) involves using graphics processing units to perform repeated calculation, utilizing the vast array of processing elements available on the GPU.
This article will demonstrate a very simple trigonometric calculation executed on the GPU.
Additional attached code shows the classic use GPGPU with square matrix squaring (multiplication) by spawning nrow*nrow number of GPU threads. This example is chosen since the output elements can be calculated independently.
Background
GPGPU has been around for more than a year with NVIDIA introducing CUDA, AMD introducing close to metal and AMD stream, and many other enthusiasts trying to use DirectX9 pixel shaders to achieve GPGPU.
Using the code
The attached code is compiled using VS2010 Beta 1 using libraries from DirectX SDK (August 2009) on Windows 7 RC. This code will not run on Windows XP since DirectX11 is not available for Windows XP. Some parts of the source code are picked up from DirectX SDK August 09 samples and adapted to suite the program.
The code starting point is Start(void*)
. The program is divided into the following sub parts:
Creation of a device (the easiest part)
Use D3D_DRIVER_TYPE_REFERENCE
for emulation, and D3D_DRIVER_TYPE_HARDWARE
to run code on GPU (you will require hardware support for this).
D3D11CreateDevice( NULL,D3D_DRIVER_TYPE_REFERENCE,
NULL, D3D11_CREATE_DEVICE_SINGLETHREADED|D3D11_CREATE_DEVICE_DEBUG,
NULL, 0,D3D11_SDK_VERSION, &pDeviceOut, &flOut, &pContextOut );
Load the GPU
The tough bit is the programmer must load the buffers to the GPU for processing. The attached source code will shed a lot more light on this:
HRESULT CreateStructuredBufferOnGPU( ID3D11Device* pDevice,
UINT uElementSize, UINT uCount, VOID* pInitData,
ID3D11Buffer** ppBufOut )
{
*ppBufOut = NULL;
D3D11_BUFFER_DESC desc;
ZeroMemory( &desc, sizeof(desc) );
desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS | D3D11_BIND_SHADER_RESOURCE;
desc.ByteWidth = uElementSize * uCount;
desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
desc.StructureByteStride = uElementSize;
if ( pInitData )
{
D3D11_SUBRESOURCE_DATA InitData;
InitData.pSysMem = pInitData;
return pDevice->CreateBuffer( &desc, &InitData, ppBufOut );
}
else
return pDevice->CreateBuffer( &desc, NULL, ppBufOut );
}
HRESULT CreateBufferSRV( ID3D11Device* pDevice, ID3D11Buffer* pBuffer,
ID3D11ShaderResourceView** ppSRVOut )
{
D3D11_BUFFER_DESC descBuf;
ZeroMemory( &descBuf, sizeof(descBuf) );
pBuffer->GetDesc( &descBuf );
D3D11_SHADER_RESOURCE_VIEW_DESC desc;
ZeroMemory( &desc, sizeof(desc) );
desc.ViewDimension = D3D11_SRV_DIMENSION_BUFFEREX;
desc.BufferEx.FirstElement = 0;
if ( descBuf.MiscFlags & D3D11_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS )
{
desc.Format = DXGI_FORMAT_R32_TYPELESS;
desc.BufferEx.Flags = D3D11_BUFFEREX_SRV_FLAG_RAW;
desc.BufferEx.NumElements = descBuf.ByteWidth / 4;
}
else
if ( descBuf.MiscFlags & D3D11_RESOURCE_MISC_BUFFER_STRUCTURED )
{
desc.Format = DXGI_FORMAT_UNKNOWN;
desc.BufferEx.NumElements =
descBuf.ByteWidth / descBuf.StructureByteStride;
}
else
{
return E_INVALIDARG;
}
return pDevice->CreateShaderResourceView( pBuffer, &desc, ppSRVOut );
}
HRESULT CreateBufferUAV( ID3D11Device* pDevice, ID3D11Buffer* pBuffer,
ID3D11UnorderedAccessView** ppUAVOut )
{
D3D11_BUFFER_DESC descBuf;
ZeroMemory( &descBuf, sizeof(descBuf) );
pBuffer->GetDesc( &descBuf );
D3D11_UNORDERED_ACCESS_VIEW_DESC desc;
ZeroMemory( &desc, sizeof(desc) );
desc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
desc.Buffer.FirstElement = 0;
if ( descBuf.MiscFlags & D3D11_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS )
{
desc.Format = DXGI_FORMAT_R32_TYPELESS;
desc.Buffer.Flags = D3D11_BUFFER_UAV_FLAG_RAW;
desc.Buffer.NumElements = descBuf.ByteWidth / 4;
}
else
if ( descBuf.MiscFlags & D3D11_RESOURCE_MISC_BUFFER_STRUCTURED )
{
desc.Format = DXGI_FORMAT_UNKNOWN;
desc.Buffer.NumElements =
descBuf.ByteWidth / descBuf.StructureByteStride;
}
else
{
return E_INVALIDARG;
}
return pDevice->CreateUnorderedAccessView( pBuffer, &desc, ppUAVOut );
}
Run
This command dispatches the data to the processing elements available to the GPU, and its performance is directly related to the hardware and driver support (this is for the device created using D3D_DRIVER_TYPE_HARDWARE
).
pd3dImmediateContext->Dispatch( X, Y, Z );
Read output buffer
Earlier, using DirectX9, this part was the most painful bit, but with DirectX 11 Compute Shaders, this has become a lot easier.
First, create a temporary read buffer with the CPU access flag set to D3D11_CPU_ACCESS_READ
. Then, copy the buffer, and map it to a pointer as shown below:
pd3dImmediateContext->CopyResource( debugbuf, pBuffer );
BufType *p;
pContextOut->Map( debugbuf, 0, D3D11_MAP_READ, 0, &MappedResource );
p = (BufType*)MappedResource.pData;
Points of interest
With Compute Shaders, we can implement Physics based simulations involving liquids (probably my next project).
I have also implemented compute shader using Vulkan: https://bitbucket.org/asif_bahrainwala/matrix-multiply/src/master/