CS179: GPU Programming

CS179: GPU Programming Lecture 7: Lab 3 Recitation

Today • Miscellaneous CUDA syntax • Recap on CUDA and buffers • Shared memory for an N-body simulation • Flocking simulations • Integrators

CUDA Kernels • Launching the kernel: • kernel<<<gridDim, blockDim, sMemSize>>>(args); • Need to know gridDim, blockDim, sMemSize (and args) • If no sMemSize set, it will default to 0

CUDA Kernels • Grid and block architecture: • Grids can be 1D, 2D, or on CUDA 2.x+, 3D • Blocks can be 1D, 2D, or 3D • 1024 threads per block maximum (512 on older systems) • Dimension is only for convenience, choose what’s best for you • Most applications are fine in 1D • Image processing may lend more intuitively to a 2D block/grid • Shared memory size: • Requirement is application-dependent • Limited by CUDA version (probably 48kB for you)

CUDA Functions • Three different kinds of CUDA functions: • __host__: runs on CPU (__host__ keyword is superfluous) • __device__: runs on GPU, only called from GPU • Think of these as helper functions • __global__: runs on GPU, only called from CPU • These are our kernel functions

CUDA Functions • Things to be aware of: • On older CUDA, __device__ and __global__ don’t have recursion • Cannot have function pointers to __device__ functions • Restrictions on __global__ functions: • Must return void • 64kB maximum size for parameters

CUDA Functions • Error checking for memory calls: • Can check status of function using cudaGetErrorString() • For lab 3, we make you a macro: #define gpuErrchk(ans) { gpuAssert((ans), (char*)__FILE__, __LINE__); } inline void gpuAssert(cudaError_t code, char* file, int line, bool abort=true) { if (code != cudaSuccess) { fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line); if (abort) exit(code); } } • Call as gpuErrchk(cudaMemcpy(…));

CUDA Variables • Like functions, have a few different types: • __device__/__constant__ • Stored in global/constant memory, respectively • Accessible by all threads and blocks • Set using cudaMalloc, cudaMemset, cudaMemcpy, etc. • We can also write to __device__ memory on GPU • __shared__ • Lives in shared memory • Accessible only by threads within associated block • Requires syncthreads call to guarantee “correctness”

CUDA Variables • Some CUDA vector variable types: • char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4, double2, … • Vector components available via .x, .y, .z, .w • var.x • Make vectors with make_<type>(args) • var = make_float3(1.0, 2.0, 3.0); • dim3: used for assigning block/grid size • Essentially just a uint3 • Each component of a dim3 must be at least 1!

CUDA and Buffers • Need to know how to link buffers into CUDA • Nothing conceptually new, just some functions: • cudaGLRegisterBufferObject(bufferObj) • Used to first register the buffer into CUDA -- done once • cudaGLUnregisterBufferObject(bufferObj) • Once we’re done with it, we unregister -- done once • cudaGLMapBufferObject((void**)&devPtr, bufferObj); • Associates CUDA memory with the buffer -- done once per kernel call • cudaGLUnmapBufferObject(bufferObj); • Disassociates the buffer so OpenGL can read -- done once per kernel call after kernel finishes • Remember to include <cuda_gl_interop.h>

N-Body Simulation • 1 thread = 1 particle • Kernel call handles one step in simulation • Calculate acceleration, then velocity, then position* *not quite, as we’ll see in a few slides • 1 block wont be enough for all of the particles • How do we share all positions? • Load as much global memory into shared memory • Calculate acceleration based on those positions • Update velocity, then load new global memory and repeat

N-Body Simulation

Flocking • First, a video: • https://www.youtube.com/watch?v=ctMty7av0jc

Flocking • 2 main ideas (3 for bird flocking) • Separation: bugs will try and stay away from other bugs • Cohesion: bugs will try to stay near the center of the swarm • Alignment: birds will try and head towards the average heading • Not present in bug flocking algorithms

Flocking • Separation: think repelling magnets • Inverse squared law works pretty well: • accel ~= 1/d2, where d is the distance between two particles

Flocking • Cohesion: move towards average position • Cohesion fights separation, try and find factors that balance the two out well

Flocking • Alignment: steer towards average heading of neighbors • Dependent on both positions AND velocities! • Requires you make more buffers to store velocities • A fair amount more work.. good candidate for extra credit!

Integrators • After acceleration is calculated, update • Simple Euler is easiest… • But is a bad integrator! • Symplectic Euler works better: • Basic idea: update velocity based on old position, then update new position based on new velocity • new_vel = old_vel + dt * accel(old_pos) • new_pos = old_pos + dt * new_vel • More complex integrators can be even more accurate, but even harder to implement • If you have time, try implementing a different one (Runge-Kutta, maybe?) for EC

Pingponging • Two sets of buffers, one for new, one for old. • Why? • Suppose one block finishes while another block is still reading • New positions will be used for old calculations! • Solution: pingponging with 2 buffers • Both buffers already made for you • Use one set for old state, one for new state, then flip when done

Final Notes • When loading from shared memory, be sure not to try and access out of bounds memory • Can use %, or mod by shared memory size • Problem: % is slow! • Solution: We’re provided a WRAP macro for you: #define WRAP(x,m) ((x)<(m)?(x):((x)-(m)))

Final Notes • You will also need to set initial positions and velocities • This can be done however you’d like! • Idea: have a few initial clusters with semi-random velocities • Don’t feel restricted to this!

Final Notes • gluPerspective: controls the camera • Based on your simulation, current setup might not fit • Feel free to adjust! • gluPerspective(float fov, float aspect_ratio, float near, float far)

Final Notes • Due Wednesday, 5PM • OH at regular posted times • Important note: this lab will NOT work remotely! • Trying to ssh and compile will be fine, running will throw crazy errors! • 2 new CUDA-capable computers coming to 104ANB soon… • For now, get work done early if you need to use minuteman

CS179: GPU Programming