nBody Simulation with CUDA

Joshua Brunner, Alexander Zdun, Erik Stadler nBody Simulation with CUDA

Thread Syncronization • __syncthreads() • Keeps threads in a block synchronized • Can be bad if called too often, especially if there are a lot of threads in a block. • One way to optimize – eliminate unneeded synchronization calls. • cudaThreadSynchronize(); • Called from main function • Blocking condition to make sure results are printed out after N number of steps

Shared Memory • Optimize memory calls by moving data from global device memory into device shared memory • Bad to make lots of memory calls, but don’t overuse registers!

Register Optimization • Variables are stored in registers • Faster than shared memory, but very limited. • Want to ensure less registers so that you can spawn more thread blocks. • Increase warp occupancy (100% warp not always important) • Increasing the occupancy can reduce memory latency issues (more things can be running while other stuff is doing memory I/O)

Performance Evaluation • CUDA vs. OMP • Benched CUDA vs OMP • OMP benched on AMD Phenom 9950 BE (4x512KB L2, 2MB L3) @ 3GHz and University Linux systems (Data, Mikey). • CUDA averaged 3x speedup from 4 core AMD with OMP. • On 2 SM ION (Not benched on G80/GT200)

Performance Chart

Performance Chart (2)

ION vs G80 vs GF100 • nVidia ION has 8 cores per SM. • 2 SMs, means it has 16 cores. • G80 has 8 cores per SM • 16 SMs, 128 SPs (cores) • GF100 (Fermi) • 512 CUDA Cores • 32 cores / SM • A lot higher memory bandwidth (GDDR5)

Arch. Continued • Double precision vs single precision • Serialized program used double precision. • Opted for single precision for increased performance (GT200) • Side note: G80 demotes double type to 32-bit floats in software. • GT200 architecture takes a large performance hit when making double precision calculations. (Performance is 1/8 of max with Double Precision) • GF100 only takes a ½ performance hit.

nBody Simulation with CUDA

nBody Simulation with CUDA

Presentation Transcript

Parallel Programming with CUDA

Cuda

Programming with CUDA WS 08

CUDA

Parallel Event Driven Simulation using GPU (CUDA)

Geant4 based simulation of radiotherapy in CUDA

CUDA

Getting Dirty with CUDA

2D Room Acoustic Simulation with CUDA

Using CUDA Libraries with OpenACC

Accelerating MATLAB with CUDA

CUDA

Fast matrix multiplication with CUDA

CUDA

Fluid Simulation using CUDA

Programming With CUDA

Accelerating MATLAB with CUDA

Accelerating MATLAB with CUDA