100M CUDA GPUs

100M CUDA GPUs GPU CPU CUDA Heterogeneous Computing Joy LEE @ NVIDIA Oil & Gas Finance Medical Biophysics Numerics Audio Video Imaging

GPU & CUDA Features

Comparing GPU & CPU system • Memory bandwidth • Host memory (CPU) ~ 10 GB/s • Device memory (GPU) ~ 100 GB/s • Bottleneck PCIE : 3~5 GB/s • More computing cores • CPU : 4~32 cores (Peak Perf : 10~100G FLOPS) • GPU : 32~512 cores (Peak Perf : >1T FLOPS) • Large scale parallel • Traditional (MPI, OpenMP, pthreads, …) : about 10~1K threads • CUDA : about 1K ~ 10M threads (the algorithm concern may be different)

Graphic Card GPU DRAM: GDDR3/5 (device memory) PCIE

Memory Bandwidth 240 cores Core 1 Core 2 Thread Manager L2 Cache 64 bit FSB GDDR3 102 GB/sec 12.8 GB/sec 512 bit Main Memory Main Memory 8x faster interface

Highway: Device memory 100 GB/s PCIE gen.2 x16 3~5 GB/s Street: Host memory 10 GB/s

NVIDIA’s GPUs : Ever Increasing Performance T10 = Tesla 10-series G9x = GeForce 9800 GTX G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 G70 NV40 NV35 NV30

Hierarchy of concurrent threads GRID: CUDA Kernel A kernel is a function executing on GPU One grid contains many blocks practically 64~1024 blocks for good performance BLOCK One block contains many threads practically 32~512 threads for good performance blockIdx : block index in Grid (0,1,2,3,…) Threads in the same block can cooperate Synchronize Share data with fast on-die memory called “Shared Memory” Threads threadIdx : thread index in Block (0,1,2,3,…) Thread . . . Block Grid

IDs and Dimensions Built-in variables: threadIdx, blockIdx blockDim, gridDim Dimensions set at launch time Can be unique for each section Blocks: 2D IDs, unique within a grid Threads: 3D IDs, unique within a block Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Block (1, 1) Thread (0, 2) Thread (0, 0) Thread (0, 1) Thread (1, 0) Thread (1, 2) Thread (1, 1) Thread (2, 2) Thread (2, 1) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 2) Thread (4, 1) Thread (4, 0)

CUDA Environment

CUDA Environments • Driver • running CUDA on specific cards & CUDA version • Toolkits • Compiler, library, debuger, profiler, … • SDK • CUDA Examples & some useful tools All FREE download from NVIDIA developer zone

CUDA Toolkits Compiler CUDA based on C/C++ (Available now for all supported OSs ) Library BLAS, FFT, RAND, SPARSE Profiler Sampling signals on GPU for optimizing reference Memory access parameters Execution (serialization, divergence) Debugger Runs on the GPU (GDB, Nexus, … etc) Documents CUDA Programming Guide CUDA Best Practice Guide API & LIB Reference …

CUDA SDK • Useful tools • deviceQuery – query available devices specs on this machine • bandwidthTest – test bandwidth of PCIE & device memory • Valuable materials for developers • Matrix Multiplication, Transpose • Tree Reduction • Simple BLAS • Multi GPU sample • … • Valuable middle-wares for developers • Video codec (MPEG, H.264, …) • Radix Sort • …

CUDA Programming Model

Simple 5 steps to program GPU with CUDA Step 1 : Allocate device memory Step 2 : Upload input data from host to device memory Step 3 : Call CUDA kernel(s) ….. (can call many kernels to manipulate data in device memory) Step 4 : Download output data from device to host memory Step 5 : Free device memory Init Alloc Function Function Function Lib Lib GPU CPU Operation 1 Operation 2 Operation 3

Heterogeneous Programming CUDA = serial program with parallel kernels, all in C Serial C code executes in a CPU thread Parallel kernel C code executes in thread blocksacross multiple processing elements . . . Serial Code (CPU) Parallel Kernel (GPU) Serial Code (CPU) Parallel Kernel (GPU) . . .

GPU Memory Allocation / Release Host (CPU) manages GPU memory: cudaMalloc (void ** pointer, size_tnbytes) cudaFree (void* pointer) int m = 1024; intnbytes = m*sizeof(int); int * devPtr = 0; cudaMalloc( (void**)& devPtr, nbytes); cudaFree(devPtr);

Data Copies cudaMemcpy( void *dst, void *src, size_tnbytes, enumcudaMemcpyKind direction); returns after the copy is complete blocks CPU thread doesn’t start copying until previous CUDA calls complete enumcudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice

Launching kernels on GPU Launch parameters: grid dimensions (up to 2D) thread-block dimensions (up to 3D) Launch 1D kernel kernel<<<1024, 512>>>(...); Launch 2D kernel dim3 grid(16, 16); dim3 block(16,16); kernel<<<grid, block>>>(...);

Sample code Step 1: Allocate device memory float* devPtr = 0; cudaMalloc((void**)&devPtr, sizeInBytes); Step 2: Copy data from host to device cudaMemcpy(devPtr, hostPtr, numOfBytes, cudaMemcpyHostToDevice); Step 3: Call CUDA kernel(s) kernel_function <<<gridDim, blockDim>>> (devicePtr); Step 4: Copy data from device back to host cudaMemcpy(hostPtr, devPtr, numOfBytes, cudaMemcpyDeviceToHost); Step 5: Free device memory cudaFree(devPtr);

Code executed on GPU C function with some restrictions: Single input variables can be directly transferred by parameters Array variables must be in device memory Return type is void Must be declared with a qualifier: __global__ : launched by CPU, cannot be called from GPU must return void __device__ : called from other GPU functions, cannot be launched by the CPU __host__ : can be executed by CPU Built-in ID: gridDim, blockDim, blockIdx, threadIdx

Sample kernel code void saxpy_serial(int n, float a, float *x, float *y) { for(int i = 0; i<n; ++i) y[i] = a*x[i] + y[i]; Must loop n times (Serial) } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if(i<n) y[i] = a*x[i] + y[i]; Only one time (Parallel) } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); Standard C Code Parallel CUDA Code

Memory scope • Thread Scope • Each thread has own local storage • Block Scope • Each thread block has own shared memory • Accessible only by threads within that block • Grid Scope (Global Scope) • Accessible by all threads as well as host (CPU)

Memory system • Thread Scope • Register: on die, fastest, default • Local memory: DRAM, non cached, slow (400~600 clocks) • Block Scope • Shared memory: on die, fast (4~6 clocks), Qualifier __shared__ • Grid Scope • Global memory: DRAM, non cached, slow (400~600 clocks) • Constant memory: on die, small (total 64KB), Qualifier __constant__ • Texture memory: read only, DRAM+cache, fast if cache hit • Cache (Fermi only): R/W cache

Memory : Location & Scope Thread . . . Registers Block Shared Memory Kernel Global Memory

Thread Synchronization Function void __syncthreads(); Synchronizes all threads in a BLOCK Once all threads have reached this point, execution resumes normally Used to avoid RAW / WAR / WAW hazards when accessing shared memory Should be used in conditional code only if the conditional is uniform across the entire thread block

Host Synchronization All kernel launches are asynchronous control returns to CPU immediately kernel starts executing once all previous CUDA calls have completed cudaThreadSynchronize() -- SYNC CPU & GPU blocks until all previous CUDA calls complete Memcopies are synchronous control returns to CPU once the copy is complete copy starts once all previous CUDA calls have completed Asynchronous CUDA calls provide: non-blocking memcopies ability to overlap memcopies and kernel execution

Synchronization Level • Block Sync • sync all threads in the block • Method: Call __syncthreads() in kernel • Grid Sync • sync all threads in the grid (kernel) • Method: Launch another kernel • CPU-GPU Sync • sync CPU & GPU • Method: Call cudaThreadSynchronize() in host C code

Device Management CPU can query and select GPU devices cudaGetDeviceCount( int* count ) cudaSetDevice( int device ) cudaGetDevice( int *current_device ) cudaGetDeviceProperties( cudaDeviceProp* prop, int device ) cudaChooseDevice( int *device, cudaDeviceProp* prop ) Multi-GPU setup: device 0 is used by default one CPU thread can control one GPU

CUDA Error Reporting to CPU All CUDA calls return error code: except for kernel launches cudaError_t type cudaError_t cudaGetLastError(void) returns the code for the last error (no error has a code) char* cudaGetErrorString(cudaError_t code) returns a null-terminted character string describing the error printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );

Compiling CUDA • C/C++ CUDA • Application NVCC • CPU Code • PTX Code Virtual Physical PTX to Target Compiler JIT compiler -- Driver • G80 • G200 … Hardware Target code

Onhand practice • SDK • Install & make • deviceQuery, bandwidthTest • Hello CUDA • Vector Add • Tree Reduction

Thinking in parallel

Example: Vector Addition • k = blockIdx.x * blockDim.x + threadIdx.x a [k] + b [k]  c [k]

Example: Vector Addition

Backup slides

PTX Example (SAXPY code) cvt.u32.u16 $blockid, %ctaid.x; // Calculate i from thread/block IDs cvt.u32.u16 $blocksize, %ntid.x; cvt.u32.u16 $tid, %tid.x; mad24.lo.u32 $i, $blockid, $blocksize, $tid; ld.param.u32 $n, [N]; // Nothing to do if n ≤ i setp.le.u32 $p1, $n, $i; @$p1 bra $L_finish; mul.lo.u32 $offset, $i, 4; // Load y[i] ld.param.u32 $yaddr, [Y]; add.u32 $yaddr, $yaddr, $offset; ld.global.f32 $y_i, [$yaddr+0]; ld.param.u32 $xaddr, [X]; // Load x[i] add.u32 $xaddr, $xaddr, $offset; ld.global.f32 $x_i, [$xaddr+0]; ld.param.f32 $alpha, [ALPHA]; // Compute and store alpha*x[i] + y[i] mad.f32 $y_i, $alpha, $x_i, $y_i; st.global.f32 [$yaddr+0], $y_i; $L_finish: exit;

Scalability Thread blocks can run in any order Concurrently or sequentially Facilitates scaling of the same code across many devices Scalability

Hardware multi-threads Kernel launched by host MT IU MT IU MT IU MT IU MT IU MT IU MT IU MT IU SP SP SP SP SP SP SP SP Device Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Device processor array . . . . . .

Example: Matrix Transpose • Coalesced read/write device memory best memory bandwidth utilization • Use shared memory to share data between threads in block • Use 2D blockDim & gridDim • Over 20x than Host in large matrix (ex. 3000x4000) BLOCK Host Invocation dim3grid (n/16+1,m/16+1,1); dim3block (16,16,1); Transpose <<< grid, block >>> (b, a, m, n);

Example: Matrix Transpose __global__void Transpose (float* b, float* a, int m, int n){ __shared__ float s[256]; //declare shared memory (all threads in block could see it) int x=blockIdx.x*blockDim.x + threadIdx.x; //compute index (x,y) int y=blockIdx.y*blockDim.y + threadIdx.y; if(y<m && x<n){ int i=y*n+x; //compute input address (make the lowest index fit threadIdx.x) int t=threadIdx.y*blockDim.x + threadIdx.x; //compute shared address s[t]=a[i]; //coalesced input (read in 16x16 matrix block) } __syncthreads(); //synchronize threads in block x=blockIdx.x*blockDim.x + threadIdx.y; //exchange threadIdx (x,y) usage y=blockIdx.y*blockDim.y + threadIdx.x; //compute index (x,y) if(y<m && x<n){ int o=x*m+y; //compute output address (make the lowest index fit threadIdx.x) int t=threadIdx.x*blockDim.y + threadIdx.y; //compute shared address b[o]=s[t]; //coalesced output (write out 16x16 matrix block) } }

Hardware (SPA, Streaming Processors Array) TPC

Hardware (TPC, Texture/Processors Cluster)

Hardware (SM, Streaming Multiprocessor) • Warp = 32 threads • SP : MUL & MAD • SFU : sin(x), cos(x)… • 1 block divides into multi-warps • SM could execute 1 warp in 4 clocks

SPA, Streaming Processors Array Special Function Unit (SFU) Double Precision TP Array Shared Memory

How to use so many cores? 240 SP thread processors 30 DP thread processors Full scalar processor IEEE 754 double precision floating point Thread Processor Array (TPA) Thread Processor (TP) Special Function Unit (SFU) Double Precision Multi-banked Register File SpcOps ALUs FP/Int TP Array Shared Memory

100M CUDA GPUs

100M CUDA GPUs

Presentation Transcript

Cuda

CUDA Lecture 2 History of GPUs

CUDA

CUDA

100m Hurdles

CUDA

Sourcery VSIPL++ for NVIDIA CUDA GPUs

CUDA

100M CUDA GPUs

Beyond CUDA/GPUs and Future Graphics Architectures

100m

CUDA

CUDA

Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs