DirectCompute (DirectX 11.1 API)

Stephen Marz (23 May 2013)

Introduction

DirectCompute is Microsoft's version of OpenCL (Open Computing Language). This allows generic processing to be done on the GPU (Graphics Processing Unit) which is typically highly parallel, vector math.

The code and explanations provided are from my own personal experience with DirectCompute 5.0. I am not a Microsoft representative nor do I work for Microsoft

Data Types

To set up a DirectCompute context, first the GPU must be enumerated. This is done very much in the same way that DirectX 11.1 (it shares the same API) sets up a GPU.
The Microsoft documentation is at: http://msdn.microsoft.com/en-us/library/windows/desktop/hh309466(v=vs.85).aspx

The data types using the DirectX 11.1 API are:
  • ID3D11Device1 - A class that defines the actual GPU device (the 1 means DirectX 11.1)
  • ID3D11DeviceContext1 - A class that connects CPU-side programs with the GPU-side programs
  • ID3D11ComputeShader - A class that stores the compute shader (program)
  • ID3D11Buffer - A class that stores a block of memory on the GPU
  • ID3D11UnorderedAccessView - A class that allows CPU<->GPU memory transfers

A note about the Component Object Model (COM) which all of the above data types follow (thus the I for "interface" prefixing all data types):
The component object model is a create and destroy type of model. All of these data types are pointers to a created object in memory. To free the resources on both the CPU and the GPU side require a call to:

object->Release()

The libraries you will need to link in are:

  1. dxgi.lib
  2. d3dcompiler.lib
  3. d3d11.lib NOTE: d3d11.lib is for both DirectX 11.1 and DirectX 11

The header files you will need are:

  1. d3d11_1.h NOTE:This is for DirectX 11.1, DirectX 11 is d3d11.h
  2. d3dcompiler.h

Initialization

To initialize a DirectCompute shader, you basically have to find the device you want to run it on, compile the shader, allocate the memory, and then dispatch the program.

Enumerating Adapters
IDXGIFactory2 *dxgiFactory;
IDXGIAdapter2 *dxgiAdapter;
HRESULT hresult;
int i = 0;

std::vector<IDXGIAdapter2*> adapterList;

hresult = CreateDXGIFactory1(__uuidof(IDXGIFactory2), (void**)&dxgiFactory);
if (FAILED(hresult)) {
   return false; // This typically means an argument is wrong (like if you only had Windows 7 and tried to create a Factory2)
}
while (dxgiFactory->EnumAdapters1(i++, (IDXGIAdapter1**)&dxgiAdapter) != DXGI_ERROR_NOT_FOUND) {
   adapterList.push_back(dxgiAdapter);
}

//Device creation code goes here

//Always ensure you free the resources!
for (i = 0;i < adapterList.size();i++) {
   adapterList[i]->Release();
}
dxgiFactory->Release();
Creating the Device Object
ID3D11Device *oldStyleDevice;
ID3D11DeviceContext *oldStyleContext;
D3D_FEATURE_LEVEL afl, fl[] = { D3D_FEATURE_LEVEL_11_1, D3D_FEATURE_LEVEL_11_0, D3D_FEATURE_LEVEL_10_1 };
hresult = D3D11CreateDevice(
   adapterList[0], // (Adapter) This can be NULL to let Windows choose, or you can specify a device from the enumerated devices from above
   D3D_DRIVER_TYPE_UNKNOWN, // (Adapter Type) This must be UNKNOWN if you specify the previous argument, can also be D3D_DRIVER_TYPE_{SOFTWARE,HARDWARE,WARP}
   NULL, // (Software Module) If you specified D3D_DRIVER_TYPE_SOFTWARE, this supplies the module that handles the software renderer
   0, // (Flags) An OR'ed list of flags (typically for software renderer)
   fl, // (Feature Level) This specifies a list of feature levels. If the feature level is not compatible, it will go down the list until it finds one that is.
   ARRAYSIZE(fl), // This specifies the number of feature levels in the array. Windows has a macro called ARRAYSIZE to help.
   &oldStyleDevice, // (Device) This is the device object that will be created.
   &afl, // (Feature Level) This is the actual feature level that is compatible with the device above.
   &oldStyleContext // (Context) This is the context object that will be created.
);
if (FAILED(hresult)) {
   return false; // The device could not be created successfully
}
//Typically you would stop there, however to get a DirectX 11.1 device, you have to internally cast the oldStyleDevice into the new style.
//To do that, we will use the device's and context's QueryInterface() method


ID3D11Device1 *newStyleDevice;
ID3D11DeviceContext1 *newStyleContext;
oldStyleDevice->QueryInterface(__uuidof(ID3D11Device1), (void**)&newStyleDevice);
oldStyleContext->QueryInterface(__uuidof(ID3D11DeviceContext1), (void**)&newStyleContext);
//Always remember to free the resources of an object you will no longer use
oldStyleDevice->Release();
oldStyleContext->Release();
Compiling the Shader
ID3DBlob *CSBlob, *ErrBlob; // Blob is just a storage mechanism in D3D
hresult = D3DCompileFromFile(
   L"myshader.cs", //This argument typically uses UNICODE, thus the L in front is necessary
   NULL, // (Macros) If you have macros inside of your shader, this will populate them, NULL if you don't
   NULL, // (Includes) If you have any includes, specify them here, otherwise NULL.
   "main", // (Entry Point) The name of the entry function (I called mine main here).
   "cs_5_0", // (Profile) The text name of the profile (cs_5_0 is DirectX 11, cs_4_1 is DirectX 10.1, etc.)
   0, // (Flags1) Any flags to be specified here
   0, // (Flags2) Any more flags to be specified.
   &CSBlob, // (Blob) Where to store the compiled shader (program)
   &ErrBlob // (Error blob) If there is an error, it will detail it in this location
);

if (FAILED(hresult)) {
   OutputDebugStringA((LPCSTR)ErrBlob->GetBufferPointer());
   ErrBlob->Release();
   return false;
}

ID3D11ComputeShader *computeShader;
hresult = newStyleDevice->CreateComputeShader(CSBlob->GetBufferPointer(), CSBlob->GetBufferSize(), 0, &computeShader);
if (FAILED(hresult)) {
   return false;
}
CSBlob->Release();
newStyleContext->CSSetShader(computeShader, 0, 0);
computeShader->Release(); //As of DirectX 11, CSSetShader holds a reference, so we can release it at our end. DirectX 10 does NOT, so you can't do this here
Setting up the Buffers
D3D11_BUFFER_DESC shaderDataBufferDesc; // This buffer will contain the input and output data for the shader
D3D11_BUFFER_DESC shaderCopyBufferDesc; // Since the CPU can't directly read GPU memory, this is a copy buffer to copy output data to for the CPU.
D3D11_UNORDERED_ACCESS_VIEW_DESC shaderAccessViewDesc; // An unordered access view gives the shader the ability to "view" the data buffer.

ID3D11Buffer *shaderDataBuffer;
ID3D11Buffer *shaderCopyBuffer;
ID3D11UnorderedAccessView *shaderAccessView;

// It is always a good idea to zero out the memory. The ZeroMemory() function is a macro to memset(dst, 0, len);
ZeroMemory(&shaderDataBufferDesc, sizeof(D3D11_BUFFER_DESC));
ZeroMemory(&shaderCopyBufferDesc, sizeof(D3D11_BUFFER_DESC));
ZeroMemory(&shaderAccessViewDesc, sizeof(D3D11_BUFFER_DESC));

// This is typical of DirectX. You fill a descriptor and pass it to some create function
shaderDataBufferDesc.Usage = D3D11_USAGE_DEFAULT; // There are several usage types, default allows the shader to read/write to this buffer
shaderDataBufferDesc.ByteWidth = sizeof(float)*4; // The ByteWidth field is the size of the buffer. We're making it the size of 4 floating points
shaderDataBufferDesc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED; // This is saying this buffer holds miscellaneous data in a structured fashion
shaderDataBufferDesc.BindFlags = D3D11_BIND_UNORDERED_ACCESS | D3D11_BIND_SHADER_RESOURCE; // These are OR'd flags saying that this buffer will used for unordered access inside of the shader
shaderDataBufferDesc.StructureByteStride = sizeof(float); // Since this is a structured buffer, it needs to know how to differentiate one primitive to the next.
hresult = newStyleDevice->CreateBuffer(&shaderDataBufferDesc, 0, &shaderDataBuffer);
if (FAILED(hresult)) {
   return false; // The buffer could not be created. Could be due to incorrect specifications in the descriptor or the GPU itself.
}

shaderCopyBufferDesc.Usage = D3D11_USAGE_STAGING; // The staging usage allows the CPU to interact with this buffer.
shaderCopyBufferDesc.ByteWidth = shaderDataBufferDesc.ByteWidth; // This buffer will copy the GPU memory to CPU readable memory, so this is the same size as above.
shaderCopyBufferDesc.CPUAccessFlags = D3D11_CPU_ACCESS_READ; // This buffer will be used by the CPU to read the results, we have to give it permission to do so.
shaderCopyBufferDesc.StructureByteStride = shaderDataBufferDesc.StructureByteStride; // Same as the data buffer's stride
hresult = newStyleDevice->CreateBuffer(&shaderCopyBufferDesc, 0, &shaderCopyBuffer);
if (FAILED(hresult)) {
   return false; // Could not create the copy buffer
}

shaderAccessViewDesc.Format = DXGI_FORMAT_UNKNOWN; // UNKNOWN should actually be DON'T CARE
shaderAccessViewDesc.Buffer.NumElements = 4; // The unordered access view can be accessed like an array using [], so this describes the size of this array
shaderAccessViewDesc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER; // Basically this descriptor describes it will be attached to a buffer
hresult = newStyleDevice->CreateUnorderedAccessView(shaderDataBuffer, &shaderAccessViewDesc, &shaderAccessView);
if (FAILED(hresult)) {
   return false; //For some reason the GPU did not allocate an unordered access view
}
newStyleContext->CSSetUnorderedAccessViews(0, 1, &shaderAccessView, 0); //Arguments: StartSlot, Number of UAVs, The Access View Object, and Initial Counts
shaderAccessView->Release(); // CSSetUnorderedAccessViews holds a reference, so we don't need to
Writing/Updating the Shader's Input Buffer
float data[] = { 1.0f, 2.0f, 3.0f, 4.0f }; //Put your initial data here, this will be overwritten with the shader's results
newStyleContext->UpdateSubresource(shaderDataBuffer, 0, 0, data, sizeof(float)*4, 0); //Arguments: The buffer, The subresource (0), A destination box, The data to write to the buffer, the size of the buffer, the depth of the buffer
//Now the buffer contains the data 1.0, 2.0, 3.0, and 4.0 for the shader
Making the Shader Do Work
newStyleContext->Dispatch(2, 2, 1); //Dispatch runs the shader. The arguments are the number of threads to run which is X*Y*Z, so in this case, 4 threads
Copy the Shader's Output
// After the Dispatch() command, the shader will run and write its results back into the buffer.
// The CPU can't access this buffer, so we have to copy it into a buffer the CPU can access.


D3D11_MAPPED_SUBRESOURCE msr;
float dataFromShader[4];

newStyleContext->CopyResource(shaderCopyBuffer, shaderDataBuffer); //This will copy the GPU's buffer into the CPU readable buffer (called shaderCopyBuffer)
//To read the buffer, we have to map it and then use the CPU to copy it into accessible data
hresult = newStyleContext->Map(shaderCopyBuffer, 0, D3D11_MAP_READ, 0, &msr); // Arguments: The buffer to map, The subresource to map, The type of mapping, The mapping flags, The mapped resource
if (FAILED(hresult)) {
   return false; //For some reason, the CPU could not map the copy buffer
}
// Now the structure msr holds pointer to the mapped data inside of the GPU
memcpy(dataFromShader, msr.pData, sizeof(float)*4); //This copies the data from the GPU into the CPU data
newStyleContext->Unmap(shaderCopyBuffer, 0); // ALWAYS UNMAP AFTER A MAP!!
//Now you can read from the dataFromShader[] array to get the data from the shader
The Actual Compute Shader (in HLSL)
// myshader.cs
//HLSL is very much like C or C++ except you're writing it for the GPU, not CPU
RWStructuredBuffer<float> Result : register (u0); //This is the actual data to and from this shader. Register u0 means we stuck it into the UAV slot 0

struct CSInput {
   //uint3 means unsigned int 3-vector
   uint3 Gid : SV_GroupID; //This is the group's identification number
   uint3 DTid : SV_DispatchThreadID; //This describes the actual dispatching thread identification number
   uint3 GTid : SV_GroupThreadID; //This describes the group of dispatching threads identification number
   uint GI : SV_GroupIndex; // This is basically Gid*DTid*GTid

};

//The following numthreads(2,2,1) limits this shader. These values have to be the same (or more than) what you pass to Dispatch()
[numthreads(2,2,1)]
//All this shader does below is multiply the value at the array by itself (squaring it)
void main(in CSInput csvalues) {
    float modValue = Result[csvalues.GI];
    Result[csvalues.GI] = modValue*modValue;
}
/*
To explain why we use csvalues.GI as the array subscript:
We are executing 4 threads concurrently. When a thread runs, it is assigned an index (which is the product minus 1 of the Gid, DTid, and GTid). Since our array contains a 4 floating point values, we use the Group Index (GI). Since the threads are ran concurrently, we have to be careful and only make sure we read and write to only the memory assigned to the thread. Otherwise, you'll have a race condition.
*/
Cleaning Up and Exiting
//Remember to Release() everything and then exit
shaderDataBuffer->Release();
shaderCopyBuffer->Release();
newStyleContext->Release();
newStyleDevice->Release();