CUDA Programming

CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

CUDA Guarantees • All threads in a block in the same SM will run at the same time. • All blocks in a kernel will finish before any blocks in the next kernel will start.

Memory Model • Every thread has local memory (e.g. local variables) • Threads in a block have access to a “per block” shared memory. • All threads can read and write to and from global memory. • CPU memory is separate from GPU memory and is called host memory.

Synchronization • Warning: threads can access and modify each other’s results in shared and global memory. • But what if a thread modifies another thread’s data? • We need a tool to synchronize memory access and to synchronized thread operations.

Barriers • Similar to MPI and OpenMP once a thread reaches a barrier it must wait until all other threads reach the barrier. • Then all threads may continue. • See the next slide for example code.

Barriers • Need for barriers intidx = threadIdx.x; __shared__ intarr[128]; arr[idx] = threadIdx.x; if(idx> 0 && idx <= 127) arr[idx] = arr[idx-1];

Barriers Continued • Should be rewritten as intidx = threadIdx.x; __shared__ int array[128]; __syncthreads(); array[idx] = threadIdx.x; if(idx > 0 && idx <= 127) { int temp = arr[idx-1]; __syncthreads(); arr[idx] = arr[idx-1]; __syncthreads(); }

__syncthreads() • __syncthreads() creates a barrier between block runs. • Implicit barriers also exist between kernel function calls. • So, CUDA is a hierarchy of computation, memory, and synchronization primitives.

Efficient CUDA Programs • High Level Strategies • Modern GPUs can perform 3 Trillion Math Operations Per Second (3TFLOPS) • Maximize intensity of math operations per unit of memory • Maximize number of useful compute operations per thread • Minimize time spent on memory access per thread

Minimize Time Spent On Memory • Move frequently accessed data to shared memory. • Memory Speed • Local > Shared >> Global >> Host • Local – registers/L1 cache • Local Memory Example __global__ void locMemEx(doublef) { doublelocal_f; local_f = f; } int main(intargc, char ** argv) { locMemEx<<<1,256>>>(10.2); cudaSynchronize(); }

Global Memory Example //Global memory __global__ void globalMemEx(double * myArr) { myArr[threadIdx.x] = 8.3 + myArr[threadIdx.x]; //myArris in global memory } int main(intargc, char ** argv) { float * myHostArr = malloc(sizeof(double)*256); float * devArr; cudaMalloc((void **) &devArr, sizeof(double)*256); for(i = 0; i < 256; i++) myHostArr[i] = i; cudaMemcpy((void *) devArr, (void *) myHostArr, sizeof(double)*256,cudaMemcpyHostToDevice); globalMemEx<<<1,256>>>(devArr); cudaMemcpy((void *) devArr, (void *) myHostArr, sizeof(double)*256,cudaMemcpyDeviceToHost); }

Shared Memory Example __global__ void shmemEx(double* arr) { inti, idx = threadIdx.x; doubleavg, sum = 0.0; __shared__ doubleshArr[256]; shArr[i] = arr[i]; __syncthreads(); for(i = 0; i < idx; i++){ sum += shArr[i]; } avg = sum / (idx + 1.0); if(arr[idx] > avg) arr[idx] = avg; //This code does not affect results. shArr[idx] += shArr[idx]; }

Code from Main Function shmemEx<<<1,256>>>(devArr); cudaMemcpy((void *) hostArr, (void *) devArr, sizeof(double)*256, cudaMemcpyHostToDevice);

Memory Access • Want threads to have contiguous memory accesses • GPU is most efficient when threads read or write to the same area of memory at the same time • Each thread when it accesses global memory must access a chunk of memory, not the single data item • Contiguous is good • Strided, not so good • Random, bad • In class exercise, we will draw pictures of each type of memory access.

Memory Conflicts • Many threads may try to access the same memory location. • Ex: 1,000,000 threads accessing 10 array elements • Solve with atomics • atomicAdd() • atomicMin() • atomicXOR() • atomicCAS() - compare and swap

Atomic Limitations • Only certain operations and data types • No mod or exponentiation • Mostly integer types • Can implement any atomic op with CAS, quite complicated though • Still no ordering constraints • Floating point arithmetic is non-associative • Ex: (a + b) + c != a + (b + c) • Serializes memory access • This makes atomic ops very slow

In Class Exercise • Try each of the following: • 10^6 threads incrementing 10^6 elements • 10^5 threads atomically incrementing 10^5 elements • 10^6 threads incrementing 1000 elements • 10^6 threads atomically incrementing 1000 elements • 10^7 threads atomically incrementing 1000 elements • Time your results

Avoiding Control Structures • In CUDA we want to avoid thread divergence because threads operate on the same kernel code at the same time. • Threads with branch statements will be forced to wait if they are not operating on the same code as all other threads (e.g. if one thread needs to operate on an else and the other on an if). • This means we should avoid if statements in GPU code whenever possible.

Divergence • Divergence (in terms of threads) means threads that do different things. • This can happen in both loops and if statements. • This occurs often where loops run for different numbers of iterations. • Keep in mind that all other GPU threads have to wait until all divergent threads finish.

CUDA Programming