Cuda Tutorial #1

Code:project1.tar.gz

Here is a since simple Hello World program taken from http://computer-graphics.se/hello-world-for-cuda.html. This tutorial will c explain the process of converting from a c implementation to a cuda implementation.
CPU version

#include <stdio.h>
 
#define N 7

void add_arrays(int i,char *a, int *b) 
{
	a[i] += b[i];
}
 
int main()
{
	// Setup the arrays
	char a[N] = "Hello ";
	int b[N] = {15, 10, 6, 0, -11, 1,0};

	// print the contents of a[]
	printf("%s", a);

	// Perform the array addition
	int i;
	for ( i = 0; i< N;i++)
		add_arrays(i,a,b);

	// Display the results
	printf("%s\n", a);

	return 1;
}


Cuda version

#include <stdio.h>

#define N 7
 
__global__ 
void add_arrays(char *a, int *b) 
{
	a[threadIdx.x] += b[threadIdx.x];
}
 
int main()
{
	// Setup the arrays
	char a[N] = "Hello ";
	int b[N] = {15, 10, 6, 0, -11, 1,0};
  
	char *ad;
	int *bd;
	const int csize = N*sizeof(char);
	const int isize = N*sizeof(int);
 
	// print the contents of a[]
	printf("%s", a);
 
	// Allocate and Transfer memory to the device
	cudaMalloc( (void**)&ad, csize ); 
	cudaMalloc( (void**)&bd, isize ); 
	
	cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice ); 
	cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice ); 
	
	// Perform the array addition
	dim3 dimBlock( N  );  
	dim3 dimGrid ( 1  );
	add_arrays<<<dimGrid, dimBlock>>>(ad, bd);
	
	// Copy the Contents from the GPU
	cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost ); 
	cudaFree( ad );
	
	// Display the results
	printf("%s\n", a);
	return EXIT_SUCCESS;
}

Overview of the program

  1. A string with the contents "Hello " is created and printed out
  2. An array of integers is created. This array when added to the string in a piecewise manner will change it from "Hello " to "World!"
  3. The contents of the string is printed out
The result of the program is the printing out of "Hello World". This cuda version program shows how to setup a kernel which runs on a block with 7 different threads, one for each element in the string and array. The next tutorial will explain more about gpu compute units, and gpu threads on a compute unit schedule multiple blocks

Defining the kernel

The kernel is the program that will be running on the gpu. The kernel is a function denote by prepending it with compiler directive __global__. In this example, the kernel will running on 7 different threads on a cpu compute unit. Unlike in the cpu version, in the cuda kernel each thread will share the same set of arguments so each thread needs an way to identify itself so it know what part of the array to work on. In the cuda kernel, since this is setup as a 1 dimensional problem each thread has a different threadIdx.x value starting with one and the first thread has a threadIdx.x value of 0, the second is 1, the third is 2 and so forth. The next tutorial will cover setting up 2 dimension problems.
Cpu version 

void add_arrays(int i,char *a, int *b) 
{
	a[i] += b[i];
}
Cuda Version

__global__ 
void add_arrays(char *a, int *b) 
{
	a[threadIdx.x] += b[threadIdx.x];
}

Allocating and Transfering Memory to the GPU

Before any work can happen on the gpu, any varibles that will need to be accessed by the kernel will need to be copied to the gpu. The function cudaMalloc creates space on the gpu for the varibles to reside.
	char *ad;
	int *bd;
	const int csize = N*sizeof(char);
	const int isize = N*sizeof(int);
			
	cudaMalloc( (void**)&ad, csize ); 
	cudaMalloc( (void**)&bd, isize ); 
	
Once memory is allocated, we will have to copy the contents of the local memory to the GPU. This is done with cudaMemcpy. This method takes a pointer to the local memory, a pointer to the GPU memory being copied to, the number of bytes that will be copied, and a flag which determines the direction of the memory transfer.
	cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice ); 
	cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice ); 

Scheduling the kernel

Now that the program has been converted to a Cuda program, you will have to change the way it is called. Rather than calling it with a for loop, the dimBlock function is used to define the number of threads the kernel will run on. It also describes the dimensions of the problem. This example is only using 1 dimension, but up to three dimensions are defined by dimBlock in the format dimBlock( xSize, ySize, zSize) but the total size is limit to the number of threads on the gpu compute unit.

The function dimGrid defines the number of blocks that will be scheduled on the different compute units on the gpu. In this example, only one block is being scheduled to run. Just like dimBlock, this can be defined in up to 3 dimensions.
Cpu version 

  int i;
  for ( i = 0; i< N;i++)
    add_arrays(i,a,b);
Cuda Version

  dim3 dimBlock( N  );  
  dim3 dimGrid ( 1  );
  add_arrays<<<dimGrid, dimBlock>>>(ad, bd);
	
Enqueuing the program to run is done so call calling the function as normal, but the two cuda parameters, gridsize and blocksize are added after the function name and is denoted in between <<< >>>.

Copy the results from the GPU

In the first line, the contents of a, is copied to location specified by ad on the GPU. This will tranfer for csize bytes. Next we run our program. Finally, once all the threads complete, we need to copy the contents from the GPU memory, and put it into local memory free the allocated memory, and print out the contents.
	cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost ); 
	cudaFree( ad );
	
	printf("%s\n", a);

Compiling

The cpu version is compiled with
gcc cpu.c -o cpuVersion
The cuda version is compiled with nvcc cuda.cu -o cudaVersion