SHOC: Overview and Kernel Walkthrough

SHOC: Overview and Kernel Walkthrough Kyle Spafford Keeneland Tutorial April 14, 2011

The Scalable Heterogeneous Computing Benchmark Suite (SHOC) • Focus on scientific computing workloads, including common kernels like SGEMM, FFT, Stencils • Parallelized with MPI, with support for multi-GPU and cluster scale comparisons • Implement in both CUDA and OpenCL for a 1:1 comparison • Include system, stability tests SHOC Results Browser (beta)http://ft.ornl.gov/~kspafford/shoctown/Shoctown.html

Download SHOC • Source code • http://ft.ornl.gov/doku/shoc/downloads • Build and Run • http://ft.ornl.gov/doku/shoc/gettingstarted • sh ./conf/config-keeneland.sh • make • cd tools • perl driver.pl –cuda –s 4 • Includes example output for Keeneland • FAQ • http://ft.ornl.gov/doku/shoc/faq

SHOC Categories • Performance: • Level 0 • Speeds and feeds: raw FLOPS rates, bandwidths, latencies • Level 1 • Algorithms: FFT, matrix multiply, stencil, sort, etc. • Level 2: • Application kernels: S3D (chemistry), molecular dynamics • System: • PCIe Contention, MPI latency vs. host-device bandwidth, NUMA • Stability: • FFT-based, error detection

(Level 0 Example): DeviceMemory • Motivation • Determine sustainable device memory bandwidth • Benchmark local, global, and image memory • Basic design • Test different memory access patterns, i.e. coalesced, uncoalesced • Measure both read and write bandwidth • Vary number of threads in a block Coalesced Thread 1 Thread 2 Thread sequential /Uncoalesced Thread 3 Thread 4

SHOC: Level 0 Tests • BusSpeedDownload/Readback • Measures bandwidth/latency of the PCIe bus • DeviceMemory • Measures global/constant/shared memory • KernelCompilation • Measures OpenCL JIT kernel compilation speeds • MaxFlops • Measures achievable FLOPS (synthetic, not-bandwidth bound) • QueueDelay • Measures OpenCL queueing system overhead

(Level 1 Example): Stencil2D • Motivation • Supports investigation of acceleratorusage within parallel application context • Serial and True Parallel versions • Basic design • 9-point stencil operation applied to 2D data set • MPI uses 2D Cartesian data distribution, with periodic halo exchanges • Applies stencil to data in local memory • OpenCL/CUDA observations • Runtime dominated by data movement • Between host and card • Between MPI processes

SHOC: Level 1 Tests • FFT • Reduction • Scan • SGEMM • Sort • SpMV • Stencil2D • Triad

(Level 2 Example): S3D • Motivation • Measure performance of important DOE application • S3D solves Navier-Stokes equations for a regular 3D domain, used to simulate combustion • Basic design • Assign each grid point to a device thread • Highly parallel, as grid points are independent • OpenCL/CUDA observations • CUDA outperforms OpenCL • Big factor: native transcendentals (sin, cos, tan, etc.) 3D Regular Domain Decomposition – Each thread handles a grid point, blocks handle regions

SHOC: Other Tests • Stability • FFTs are sensitive to small errors • Repeated simultaneous FFT/iFFT • Parallel for testing large systems • System • MPI contention • Impact of GPU usage on MPI latency • Chipset contention • Impact of MPI communication on GPU performance • NUMA • Multi-socket, multiple PCIe slot, multiple RAM banks

Compare OpenCL and CUDA • OpenCL improving, but still trailing CUDA • Tesla C2050, CUDA\OpenCL 3.2 RC2

Example Results

Reduction Walkthrough

Reduction Walkthrough • Fundamental kernel in almost all programs • Easy to implement, but hard to get right • We’ll walk through the optimization process. • Code for these kernels is at http://ft.ornl.gov/~kspafford/tutorial.tgz • Graphics from a similar presentation by Mark Harris, NVIDIA

Reduction Walkthrough • Start with the well-known, tree-based approach:

Algorithm Sketch • Launch 1 thread per element • Each thread loads an element from global memory into shared memory • Each block reduces its shared memory into 1 value • This value is written back out to global memory

Algorithm Sketch with Code • Main steps • Each thread loads a value from global memory into shared memory extern__shared__floatsdata[]; unsigned inttid = threadIdx.x; unsigned inti = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; • Synchronize threads __syncthreads(); • Reduce shared memory into a single value • Write value out to global memory

Reduction of Shared Memory for(int s = 1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } __syncthreads(); }

Reduction v0 • Problem 1: Divergent Warps • Problem 2: Modulo operator is expensive on GPUs

Recursive Invocation • Problem: We want a single value, but blocks can’t communicate • Solution: Recursive kernel invocation

Performance

Reduction v1 • Get rid of divergent branch and modulo operator

Reduction v1

Problem – Bank Conflicts • Shared memory is composed of 32 banks. • When multiple threads access *different* words in the *same* bank, access is serialized

Performance

Reduction v2 – Sequential Addressing

Performance

Reduction v3 – Unrolling the Last Warp • We know threads execute in a warp-synchronous fashion • For the last few steps, we can get rid of extra __syncthreads() calls

Performance

Reduction v4 – Multiple Elements Per Thread • Still have some instruction overhead • Can use templates to totally unroll the loop • Can have threads handle multiple elements from global memory • Bonus: reduces any array size to 2 kernel invocations • This is a useful optimization for most kernels

Performance

More about Reduction • http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf • Demo wget http://ft.ornl.gov/~kspafford/tutorial.tgz

Programming Problem - Scan • Now let’s think about how this extends to Scan (aka prefix sum) Scan takes a binary associative operator ⊕, and an array of n elements: [a0, a1, …, an-1], and returns the array [a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-1)]. Example: If ⊕ is addition [3 1 7 0 4]  [3 4 11 11 15]

Reduce-then-scan Strategy 7 3 8 5 5 1 2 6 Kernel 1: Reduce 10 13 6 8

Reduce-then-scan Strategy 7 3 8 5 5 1 2 6 Kernel 1: Reduce 10 13 6 8 Kernel 2: Exclusive Top-level scan 0 10 23 29

Reduce-then-scan Strategy 7 3 8 5 5 1 2 6 Kernel 1: Reduce 10 13 6 8 Kernel 2: Exclusive Top-level scan 0 10 23 29 Kernel 3: Bottom-level scan 7 10 18 23 28 29 31 37

Fast Scan Kernel • Use 2x shared memory as there are elements, set first half to 0, second half to input. 0 0 0 0 10 13 6 8 0 0 0 0 10 23 19 14 0 0 0 0 10 23 29 37 • fori=0; i < log2blockSize; i++) • smem[idx] += smem[idx-2i];

Example Code • Kernel Found in SHOC (src/level1/scan/scan_kernel.h) in the scanLocalMem function • You can adapt this function for the top-level exclusive scan and the bottom-level inclusive scans. • Problems: • Determine how reduction should stride across global memory • Figure out how to make it exclusive/inclusive (hint: remember the first half of smem is 0) • Figure out how to use the scan kernel for the bottom level scan

Good Luck! Further reading on Scan: • Examples in the CUDA SDK • http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf • http://back40computing.googlecode.com/svn/wiki/documents/ParallelScanForStreamArchitecturesTR.pdf

Motivation Classic n-body pairwise computation, important to all MD codes such as GPU-LAMMPS, AMBER, NAMD, Gromacs, Charmm Basic design Computation of the Lennard Jones potential force 3D domain, random distribution Neighbor list algorithm MD Walkthrough

Algorithm Sketch for each atom, i { force = 0; for each neighbor, j { dist = distance(pos[i],pos[j]); if (dist < cutoff) force += interaction(i,j); } forces[i] = force; }

Performance Observations • Neighbors are data-dependent • Results in an uncoalesced read on pos[j]. • Uncoalesced reads kill performance • But sometimes the texture cache can help

(in cuda/level1/md/MD.cu)

Performance on Keeneland

For the Hands-On Session • Scan Competition • Goal: Best performance on 16 MiB input of floats. • Use this slide deck and knowledge from the other presentations • Test harness with timing and correctness check provided in scan.cu • Download it from ft.ornl.gov/~kspafford/tutorial.tgz • Email submissions (scan.cu) to kys@ornl.gov • I will announce the winner at the end of the hands-on session.

Thanks! kys@ornl.gov

SHOC: Overview and Kernel Walkthrough