CS179: GPU Programming

CS179: GPU Programming Lecture 11: Lab 5 Recitation

Today • Monte-Carlo Integration • Recap on CUBLAS/CURAND • Reductions • Optimizing a reduction

Monte-Carlo Integration • Integration is a common tool is computational math • Oftentimes used for finding areas • Integration is hard on a computer • Difficult to do analytically • Integration is sometimes analytically impossible • Can’t integrate exp(x2) analytically..

Monte-Carlo Integration • Could use discrete Riemann sum

Monte-Carlo Integration • What if there’s no predefined function? • Ex.: Area of union of shapes

Monte-Carlo Integration • Solution: Monte-Carlo Integration • Saturate bounded space with sample points • Check if each point is in any shape • Area = # of points in a shape / # of points total * area of space

Monte-Carlo Integration • Lab 5: Given N spheres in a bounded space, find the volume of their union • Possible to do analytically… • But very difficult! • Spheres have random positions, area of intersections, etc. • Makes good use of Monte-Carlo integration • Easy to check if a point is in any of the spheres • Easy to use CURAND to generate lots of points!

Lab 5 • Remember: CURAND has host API and device API • You will use both! • volumeCUBLAS: uses host API with CUBLAS • volumeCUDA: uses device API with reduction kernel

Lab 5volumeCUBLAS • Allocate necessary memory • Need memory for points • Need memory for 1 bool per point • Is point in any sphere? • Use CURAND host API to generate lots of points • Create, seed, generate, destroy • Use CheckPointsK kernel to see if each point is in a sphere • You must write this kernel! • Get total # of points in a sphere using cublasDasum • cublasDasum(int n, double *src, int stride) • Free initialized memory

Lab 5volumeCUDA • Allocate memory for data • Now, we also need memory for curandStates! • Generate lots of points using CURAND device API • Call GenerateRandom3K kernel -- but you must fill in the kernel! • Check if points are in sphere • Same as volumeCUBLAS • Use reduction to sum vector • More on this later… • Free memory

Lab 5Kernels • PointInSphere: Checks if a point is in a given sphere • Do this first! • Should be easy geometry • CheckPointsK: Checks if a point is in any sphere • Copy spheres to shared memory, then iterate through spheres • Remember to make sure array entry is non-NULL • GenerateRandom3K: Generates lots of float3 points • Use CURAND device API

Reduction • Iteratively reduces array via reduce function (ex. addition)

Reduction • Start with size = nPts / 2 • Repeatedly call reduction on block size, halving it each time • With main loop in host, device code is very simple… • Just need to add element i and element i + size for each thread • Alternatively, could build loop into device code, and call kernel only once • Once size == 1, we should have summed up all elements

Reduction • Lots of optimizations to make! • Avoiding thread divergence • Contiguous memory accesses • Avoiding shared memory bank conflicts • More we haven’t discussed yet… • Unrolling loops • Templates • And more!

Optimizations • Avoiding thread divergence • Avoid calls that make different calls to threads in same warp • if(threadIdx.x % 2 == 0) • Instead, group by warps • if(threadIdx.x / WARP_SIZE == 0) 1 2 3 4 1 2 3 4 0 2 0 1 3 2 7 4 3 2 7 4 0 1 10 2 7 4 10 2 7 4

Optimizations • Contiguous memory accesses • Memory is linear, can’t swap dimensions • Need to address non-sequential accesses… • Shared memory banks • Also solved by sequential addressing! 1 2 3 4 1 2 3 4 3 2 7 4 3 2 7 4 10 2 7 4 10 2 7 4

Optimizations • Example in reduction kernel: • Reversed loop indexing • for (inti = 1; i < max_size; i *= 2) { … } • for (inti = max_size / 2; i > 0; i /= 2) { … } 1 2 3 4 1 2 3 4 3 2 7 4 3 2 7 4 10 2 7 4 10 2 7 4

Optimizations • Unrolling loops • Basic idea: when reduction size < 32, threads are wasting space due to warps • Unrolling last iteration of loop saves useless work

Optimizations • Unrolling loops example: for (inti = max_size / 2; i > 0; i /= 2) { sdata[tid] += sdata[tid + i]; } for (inti = max_size / 2; i > 0; i /= 2) { sdata[tid] += sdata[tid + i]; if (tid < 32) { sdata[tid] += sdata[tid + 32]; sdata[tid] += sdata[tid + 16]; sdata[tid] += sdata[tid + 8]; // etc… } }

Optimizations • Advanced unrolling: templates • Exploit compiler to handle some conditions at compile-time • Use templated functions (like in C++) • Ex.: template<unsigned intblockSize> __global__ void kernel(…) { if (blockSize >= 512) // some reduction code; else if (blockSize >= 256) // some reduction code; // etc… } • Then, call templated function on host: • kernel<blockSize><<<gridSize, blockSize>>>(…);

Optimizations • Works well with a switch statement: switch (numThreads) { case 512: kernel<512><<<gridSize, blockSize>>>(…); case 256: kernel<256><<<gridSize, blockSize>>>(…); case 128: kernel<128><<<gridSize, blockSize>>>(…); // etc… }

CS179: GPU Programming