100M CUDA GPUs

100M CUDA GPUs GPU CPU CUDA Heterogeneous Computing Joy Lee Senior SW Engineer, Development & Technology Oil & Gas Finance Medical Biophysics Numerics Audio Video Imaging

Optimization

3 Steps to Port your C/C++ code to CUDA • Step 1: Single Thread • port your C/C++ code to single thread CUDA kernel, and make sure output result correct. • Focus on data movement between device & host memory • Step 2: Single Block • Port single thread kernel into single block kernel, and make sure output result correct. • Focus on parallelizing with thread index • Step 3: Multi Blocks & Threads • Port single block kernel into multi blocks kernel, and make sure output result correct. • Focus on fixing 2 layers index system, determine the best index utilization

3 Steps to optimize your CUDA kernels • Step 1: setup timers to measure the kernel time • use CUDA Event to measure kernel executing time • use clock() in kernel to measure executing time weight per part in detail • Step 2: kernel part & bottleneck division • analyze your kernel, divide it into multi parts • determine the bottleneck in each part • use profiler to help determine bottlenecks • Step 3: parts optimization • optimize each part one by one, from the most time consuming part • make sure the output correct after optimizing • make sure the kernel executing time become shorter after optimizing

Bottlenecks Division I • PCIE bound • Suffer too much cudaMemcpy between host and device memory • Memory bound • Suffer global memory (device memory) bandwidth, or non-coalesced memory access pattern • Computing bound • Suffer computing power limit (Flops) • Branch bound • Suffer too many branch conditions • Thread bound • Suffer too few threads • Register bound • Suffer too few available registers (conjugated with thread bound)

Bottlenecks Division II • PCIE bound • Try keeping all data in device memory as long as possible • Try using CUDA Stream to asynchronous data movement • Memory bound • Try using Texture, shared memory, constant memory, or cache (after Fermi) to reduce directly global memory I/O • Computing bound • Try reduce the operations in your algorithm • Try use intrinsic functions • Try trigger on –fast_math compiler options • After trying all possible ways, you face to the hardware limit, which means this part is almost optimized already, please change to faster card

Bottlenecks Division III • Branch bound • Reduce the number of branches, especially the diverged branches. • Thread bound • Use compiler option –Xptxas –v to watch the used register amount per thread, and the used smem (shared memory) amount per block • If the total register amount per block is over the spec, please try --maxrregcount <N> to set the maximum register usage per thread, but this will generate local memory (DRAM) usage, which will be performance drawback. • Register bound • Note: the number of variables declared in kernel is not equal to the register using amount, the compiler will optimize it to smaller register amount, and drop some not used variables • Try reduce the variables in your algorithm • Try change the computing order, this will make the lifetime of some variables shorter

Warp & Branches I • Warp = 32 threads • SP : MUL & MAD • SFU : sin(x), cos(x)… • 1 block divides into multi-warps • SM could execute 1 warp in 4 clocks

Warp & Branches II • Branch will make warp diverged, each part will be executed in time order • More diverged branches will be slower Non-diverged 2-fold diverged 3-fold diverged

Warp & Branches III • Fat & slim diverged warp • If there are some common instructions in diverged warp, move it out of branch will save some executing time • Generally speaking, make the branch as slim as possible will save time • Such as data load/save, common instructions,…etc will make the diverged warp fatter 1: common instruction 2 3 Slim diverged warp 1: com 2 1: com 3 Fat diverged warp

Estimate the computing throughput • Computing weight • Isolate the computing parts & measure their percentage in kernel through GPU clock() • Kernel executing time • Measure the kernel executing time, and calculate the computing time in these kernel parts • Used computing throughput in Flops • Count total arithmetic operations, and divide by this executing time

Example of intrinsic functions • __mul24(x, y) • faster than 32 bits product • computes the product of the 24 least significant bits of the integer parameters • __sinf(x) , __cosf(x), __expf(x) • very fast, single precision • less precision, but still ok • __sincosf(x,sptr,cptr)

Memory system • Thread Scope • Register: on die, fastest, default • Local memory: DRAM, non cached, slow (400~600 clocks) • Block Scope • Shared memory: on die, fast (4~6 clocks), Qualifier __shared__ • Grid Scope • Global memory: DRAM, non cached, slow (400~600 clocks) • Constant memory: on die, small (total 64KB), Qualifier __constant__ • Texture memory: read only, DRAM+cache, fast if cache hit • Cache (Fermi only): R/W cache

Count memory bandwidth (exercise) • Memory access weight • Isolate the memory access parts & measure their percentage in kernel thru GPU clock() • Kernel executing time • Measure the kernel executing time, and calculate the memory access time in kernel • Used memory bandwidth • Count total memory access bytes, and divide by the access time

Coalesced global memory I/O • Threads in ½ warp shares the same memory controller • If the memory access pattern in ½ warp is dense localized in memory, this will lead to good performance, cause it will form a single transaction. We call this coalesced I/O • if they diverge to different memory segments, the performance will drop due to multi transactions. We call this non-coalesced I/O • How many threads in warp shared the same memory controller may differ from hardware spec.

Wrap kernel as standard C/C++ functions • This can compile to kernels into standard object files or Library • Link to other languages: Java, Fortran, MATLAB, … • Not necessary to rewrite all non-C code into CUDA, we can call kernels from any other languages

Example: Wrap Kernel to C __global__ void ker_xxx(int* a, int* b){ //some CUDA kernel … } extern “C”{ //export as standard C format void xxx(int* a, int* b); }; void xxx(int* a, int* b){ //wrap the kernel into C function … ker_xxx<<<grid,block>>>(a,b); … }

Multi-GPU operations • Before CUDA 4.0 or Non-tesla Cards • One CUDA context can control only one GPU hardware, to send/receive data, and launch kernels (which means one CPU thread can control one GPU, since one CPU thread own one CUDA context) • We can use MPI, openMP, pthreads to create multi CPU threads, then use cudaSetDevice() to assign each CPU thread to each GPU • Data communications: copy data in global memory back to system memory, and transfer data through MPI, openMP, pthreads protocols. • CUDA 4.0 • UVA: universal virtual addressing (all GPU & CPU can see data from each other)

Hardware (SPA, Streaming Processors Array) TPC

Hardware (TPC, Texture/Processors Cluster)

Hardware (SM, Streaming Multiprocessor) • Warp = 32 threads • SP : MUL & MAD • SFU : sin(x), cos(x)… • 1 block divides into multi-warps • SM could execute 1 warp in 4 clocks

SPA, Streaming Processors Array Special Function Unit (SFU) Double Precision TP Array Shared Memory

How to use so many cores? 240 SP thread processors 30 DP thread processors Full scalar processor IEEE 754 double precision floating point Thread Processor Array (TPA) Thread Processor (TP) Special Function Unit (SFU) Double Precision Multi-banked Register File SpcOps ALUs FP/Int TP Array Shared Memory

100M CUDA GPUs

100M CUDA GPUs

Presentation Transcript

Cuda

CUDA Lecture 2 History of GPUs

CUDA

CUDA

100M CUDA GPUs

100m Hurdles

CUDA

Sourcery VSIPL++ for NVIDIA CUDA GPUs

CUDA

Beyond CUDA/GPUs and Future Graphics Architectures

100m

CUDA

CUDA

Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs