GPU Tutorial: How To Program for GPUs

GPU Tutorial: How To Program for GPUs KrešimirĆosić1, (1)University of Split, Croatia TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Overview • CUDA • Hardware architecture • Programming model • Convolution on GPU

CUDA • ‘Compute Unified Device Architecture’ • Hardware and software architecture for issuing and managing computations on GPU • Massively parallel architecture • over 8000 threads is common • C for CUDA (C++ for CUDA) • C/C++ language with some additions and restrictions • Enables GPGPU – ‘General Purpose Computing on GPUs’

GPU: a multithreaded coprocessor SM SP: scalar processor ‘CUDA core’ Executes one thread SP SP SP SP SM streaming multiprocessor 32xSP (or 16, 48 or more) Fast local ‘shared memory’ (shared between SPs) 16 KiB (or 64 KiB) SP SP SP SP SP SP SP SP SP SP SP SP SHARED MEMORY GLOBAL MEMORY (ON DEVICE)

SM SP SP SP SP • GPU: • SMs • 30xSM on GT200, • 14xSM on Fermi • For example, GTX 480: • 14 SMs x 32 cores= 448 cores on a GPU SP SP SP SP SP SP SP SP SP SP SP SP SHARED MEMORY GDDR memory 512 MiB - 6 GiB GLOBAL MEMORY (ON DEVICE)

How To Program For GPUs SM SP SP SP SP • Parallelization • Decomposition to threads • Memory • shared memory, global memory SP SP SP SP SP SP SP SP SP SP SP SP SHARED MEMORY GLOBAL MEMORY (ON DEVICE)

Important Things To Keep In Mind SM SP SP SP SP • Avoid divergent branches • Threads of single SM must be executing the same code • Code that branches heavily and unpredictably will execute slowly • Threads shoud be independentas much as possible • Synchronization and communication can be done efficiently only for threads of single multiprocessor SP SP SP SP SP SP SP SP SP SP SP SP SHARED MEMORY

How To Program For GPUs SM SP SP SP SP • Parallelization • Decomposition to threads • Memory • shared memory, global memory • Enormous processing power • Avoid divergence • Thread communication • Synchronization, no interdependencies SP SP SP SP SP SP SP SP SP SP SP SP SHARED MEMORY GLOBAL MEMORY (ON DEVICE)

Programming model

Thread blocks BLOCK 1 THREAD (0,0) THREAD (0,1) THREAD (0,2) • Threads grouped in thread blocks • 128, 192 or 256 threads in a block THREAD (1,0) THREAD (1,1) THREAD (1,2) • One thread block executes on one SM • All threads sharing the ‘shared memory’ • 32 threads are executed simultaneously (‘warp’)

Thread blocks BLOCK 1 THREAD (0,0) THREAD (0,1) THREAD (0,2) • Blocks execute on SMs • - executein parallel • - execute independently! THREAD (1,0) THREAD (1,1) THREAD (1,2) • Blocks form a GRID • Thread ID • unique within block • Block ID • unique within grid Grid BLOCK 0 BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4 BLOCK 5 BLOCK 6 BLOCK 7 BLOCK 8

Code that executes on GPU: Kernels • Kernel • - a simple C function • - executes on GPU • - Executes in parallel • as many times as there are threads • The keyword __global__ tells the compiler to make a function a kernel (and compile it for the GPU, instead of the CPU)

Convolution • To get one pixel of output image: - multiply (pixelwise) mask with image at corresponding position - sum the products

__global__ void Convolve( float* img, int imgW, int imgH, float* filt, int filtW, int filtH, float* out) { const int nThreads = blockDim.x * gridDim.x; const int idx = blockIdx.x * blockDim.x + threadIdx.x; const int outW = imgW – filtW + 1; const int outH = imgH – filtH + 1; const int nPixels = outW * outH; for(int curPixel = idx; curPixel < nPixels; curPixel += nThreads) { int x = curPixel % outW; int y = curPixel / outW; float sum = 0; for (int filtY = 0; filtY < filtH; filtY++) for (int filtX = 0; filtX < filtW; filtX++) { int sx = x + filtX; int sy = y + filtY; sum+= img[sy*imgW + sx] * filt[filtY*filtW + filtX]; } out[y * outW + x] = sum; } } Kernel - Example code pt 1 for (int y = 0; y < outH; y++) for (int x = 0; x < outW; x++) {

Setup and data transfer • cudaMemcpy • transfer data to and from GPU (global memory) • cudaMalloc • Allocate memory on GPU (global memory) • GPU is the ‘device’, CPU is the ‘host’ • Kernel call syntax

int main() { ... float* img ... int imgW, imgH ... float* imgGPU; cudaMalloc((void**)& imgGPU, imgW * imgH * sizeof(float)); cudaMemcpy( imgGPU, // Destination img, // Source imgW * imgH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction ); float* filter ... int filterW, filterH ... float* filterGPU; cudaMalloc((void**)& filterGPU, filterW * filterH * sizeof(float)); cudaMemcpy( filterGPU, // Destination filter, // Source filterW * filterH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction ); Examle setup and data transfer 1

int resultW = imgW – filterW + 1; int resultH = imgH – filterH + 1; float* result = (float*) malloc(resultW * resultH * sizeof(float)); float* resultGPU; cudaMalloc((void**) &resultGPU, resultW * resultH * sizeof(float)); /* Call the GPU kernel */ dim3 block(128); dim3 grid(30); Convolve<<<grid, block>>> ( imgGPU, imgW, imgH, filterGPU, filterW, filterH, resultGPU ); cudaMemcpy( result, // Desination resultGPU, // Source resultW * resultH * sizeof(float), // Size in bytes cudaMemcpyDeviceToHost // Direction ); cudaThreadExit(); ... } Examle setup and data transfer 2

Speedup • Linear combination of 3 filters sized 15x15 • Image size: 2k x 2k • CPU: Core 2 @ 2.0 GHz (1 core) • GPU: Tesla S1070 (GT200 ) • 30xSM, 240 CUDA cores, 1.3 GHz • CPU: 6.58 s 0.89 Mpixels/s • GPU: 0.21 s 27.99 Mpixels/s 31 times faster!

CUDA capabilities • 1.0 GeForce 8800 Ultra/GTX/GTS • 1.1 GeForce 9800 GT, GTX, GTS 250 + atomic instructions … • 1.2 GeForce GT 220 • 1.3 Tesla S1070, C1060, GeForce GTX 275,285 + double precision (slow) … • 2.0 Tesla C2050, GeForce GTX 480, 470 + ECC, L1 and L2 cache, faster IMUL, faster atomics, faster double precision on Tesla cards …

CUDA essentials • developer.nvidia.com/object/cuda_3_1_downloads.html • Download • Driver • Toolkit (compiler nvcc) • SDK (examples) (recommended) • CUDA Programmers guide

Other tools • ‘Emulator’ • Executes on CPU • Slow • Simple profiler • cuda-gdb (Linux) • Paralel Nsight (Vista) • simple profiler • on-device debugger

... • ...

Logical thread hierarchy • Thread ID – unique within block • Block ID – unique within grid • To get globally unique thread ID: • Combine block ID and thread ID • Threads can access both shared and global memory

GPU Tutorial: How To Program for GPUs