1 / 46

100M CUDA GPUs

100M CUDA GPUs. GPU. CPU. CUDA. Heterogeneous Computing. Joy LEE @ NVIDIA. Oil & Gas. Finance. Medical. Biophysics. Numerics. Audio. Video. Imaging. GPU & CUDA Features. Comparing GPU & CPU system. Memory bandwidth Host memory (CPU) ~ 10 GB/s Device memory (GPU) ~ 100 GB/s

yakov
Télécharger la présentation

100M CUDA GPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 100M CUDA GPUs GPU CPU CUDA Heterogeneous Computing Joy LEE @ NVIDIA Oil & Gas Finance Medical Biophysics Numerics Audio Video Imaging

  2. GPU & CUDA Features

  3. Comparing GPU & CPU system • Memory bandwidth • Host memory (CPU) ~ 10 GB/s • Device memory (GPU) ~ 100 GB/s • Bottleneck PCIE : 3~5 GB/s • More computing cores • CPU : 4~32 cores (Peak Perf : 10~100G FLOPS) • GPU : 32~512 cores (Peak Perf : >1T FLOPS) • Large scale parallel • Traditional (MPI, OpenMP, pthreads, …) : about 10~1K threads • CUDA : about 1K ~ 10M threads (the algorithm concern may be different)

  4. Graphic Card GPU DRAM: GDDR3/5 (device memory) PCIE

  5. Memory Bandwidth 240 cores Core 1 Core 2 Thread Manager L2 Cache 64 bit FSB GDDR3 102 GB/sec 12.8 GB/sec 512 bit Main Memory Main Memory 8x faster interface

  6. Highway: Device memory 100 GB/s PCIE gen.2 x16 3~5 GB/s Street: Host memory 10 GB/s

  7. NVIDIA’s GPUs : Ever Increasing Performance T10 = Tesla 10-series G9x = GeForce 9800 GTX G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 G70 NV40 NV35 NV30

  8. Hierarchy of concurrent threads GRID: CUDA Kernel A kernel is a function executing on GPU One grid contains many blocks practically 64~1024 blocks for good performance BLOCK One block contains many threads practically 32~512 threads for good performance blockIdx : block index in Grid (0,1,2,3,…) Threads in the same block can cooperate Synchronize Share data with fast on-die memory called “Shared Memory” Threads threadIdx : thread index in Block (0,1,2,3,…) Thread . . . Block Grid

  9. IDs and Dimensions Built-in variables: threadIdx, blockIdx blockDim, gridDim Dimensions set at launch time Can be unique for each section Blocks: 2D IDs, unique within a grid Threads: 3D IDs, unique within a block Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Block (1, 1) Thread (0, 2) Thread (0, 0) Thread (0, 1) Thread (1, 0) Thread (1, 2) Thread (1, 1) Thread (2, 2) Thread (2, 1) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 2) Thread (4, 1) Thread (4, 0)

  10. CUDA Environment

  11. CUDA Environments • Driver • running CUDA on specific cards & CUDA version • Toolkits • Compiler, library, debuger, profiler, … • SDK • CUDA Examples & some useful tools All FREE download from NVIDIA developer zone

  12. CUDA Toolkits Compiler CUDA based on C/C++ (Available now for all supported OSs ) Library BLAS, FFT, RAND, SPARSE Profiler Sampling signals on GPU for optimizing reference Memory access parameters Execution (serialization, divergence) Debugger Runs on the GPU (GDB, Nexus, … etc) Documents CUDA Programming Guide CUDA Best Practice Guide API & LIB Reference …

  13. CUDA SDK • Useful tools • deviceQuery – query available devices specs on this machine • bandwidthTest – test bandwidth of PCIE & device memory • Valuable materials for developers • Matrix Multiplication, Transpose • Tree Reduction • Simple BLAS • Multi GPU sample • … • Valuable middle-wares for developers • Video codec (MPEG, H.264, …) • Radix Sort • …

  14. CUDA Programming Model

  15. Simple 5 steps to program GPU with CUDA Step 1 : Allocate device memory Step 2 : Upload input data from host to device memory Step 3 : Call CUDA kernel(s) ….. (can call many kernels to manipulate data in device memory) Step 4 : Download output data from device to host memory Step 5 : Free device memory Init Alloc Function Function Function Lib Lib GPU CPU Operation 1 Operation 2 Operation 3

  16. Heterogeneous Programming CUDA = serial program with parallel kernels, all in C Serial C code executes in a CPU thread Parallel kernel C code executes in thread blocksacross multiple processing elements . . . Serial Code (CPU) Parallel Kernel (GPU) Serial Code (CPU) Parallel Kernel (GPU) . . .

  17. GPU Memory Allocation / Release Host (CPU) manages GPU memory: cudaMalloc (void ** pointer, size_tnbytes) cudaFree (void* pointer) int m = 1024; intnbytes = m*sizeof(int); int * devPtr = 0; cudaMalloc( (void**)& devPtr, nbytes); cudaFree(devPtr);

  18. Data Copies cudaMemcpy( void *dst, void *src, size_tnbytes, enumcudaMemcpyKind direction); returns after the copy is complete blocks CPU thread doesn’t start copying until previous CUDA calls complete enumcudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice

  19. Launching kernels on GPU Launch parameters: grid dimensions (up to 2D) thread-block dimensions (up to 3D) Launch 1D kernel kernel<<<1024, 512>>>(...); Launch 2D kernel dim3 grid(16, 16); dim3 block(16,16); kernel<<<grid, block>>>(...);

  20. Sample code Step 1: Allocate device memory float* devPtr = 0; cudaMalloc((void**)&devPtr, sizeInBytes); Step 2: Copy data from host to device cudaMemcpy(devPtr, hostPtr, numOfBytes, cudaMemcpyHostToDevice); Step 3: Call CUDA kernel(s) kernel_function <<<gridDim, blockDim>>> (devicePtr); Step 4: Copy data from device back to host cudaMemcpy(hostPtr, devPtr, numOfBytes, cudaMemcpyDeviceToHost); Step 5: Free device memory cudaFree(devPtr);

  21. Code executed on GPU C function with some restrictions: Single input variables can be directly transferred by parameters Array variables must be in device memory Return type is void Must be declared with a qualifier: __global__ : launched by CPU, cannot be called from GPU must return void __device__ : called from other GPU functions, cannot be launched by the CPU __host__ : can be executed by CPU Built-in ID: gridDim, blockDim, blockIdx, threadIdx

  22. Sample kernel code void saxpy_serial(int n, float a, float *x, float *y) { for(int i = 0; i<n; ++i) y[i] = a*x[i] + y[i]; Must loop n times (Serial) } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if(i<n) y[i] = a*x[i] + y[i]; Only one time (Parallel) } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); Standard C Code Parallel CUDA Code

  23. Memory scope • Thread Scope • Each thread has own local storage • Block Scope • Each thread block has own shared memory • Accessible only by threads within that block • Grid Scope (Global Scope) • Accessible by all threads as well as host (CPU)

  24. Memory system • Thread Scope • Register: on die, fastest, default • Local memory: DRAM, non cached, slow (400~600 clocks) • Block Scope • Shared memory: on die, fast (4~6 clocks), Qualifier __shared__ • Grid Scope • Global memory: DRAM, non cached, slow (400~600 clocks) • Constant memory: on die, small (total 64KB), Qualifier __constant__ • Texture memory: read only, DRAM+cache, fast if cache hit • Cache (Fermi only): R/W cache

  25. Memory : Location & Scope Thread . . . Registers Block Shared Memory Kernel Global Memory

  26. Thread Synchronization Function void __syncthreads(); Synchronizes all threads in a BLOCK Once all threads have reached this point, execution resumes normally Used to avoid RAW / WAR / WAW hazards when accessing shared memory Should be used in conditional code only if the conditional is uniform across the entire thread block

  27. Host Synchronization All kernel launches are asynchronous control returns to CPU immediately kernel starts executing once all previous CUDA calls have completed cudaThreadSynchronize() -- SYNC CPU & GPU blocks until all previous CUDA calls complete Memcopies are synchronous control returns to CPU once the copy is complete copy starts once all previous CUDA calls have completed Asynchronous CUDA calls provide: non-blocking memcopies ability to overlap memcopies and kernel execution

  28. Synchronization Level • Block Sync • sync all threads in the block • Method: Call __syncthreads() in kernel • Grid Sync • sync all threads in the grid (kernel) • Method: Launch another kernel • CPU-GPU Sync • sync CPU & GPU • Method: Call cudaThreadSynchronize() in host C code

  29. Device Management CPU can query and select GPU devices cudaGetDeviceCount( int* count ) cudaSetDevice( int device ) cudaGetDevice( int *current_device ) cudaGetDeviceProperties( cudaDeviceProp* prop, int device ) cudaChooseDevice( int *device, cudaDeviceProp* prop ) Multi-GPU setup: device 0 is used by default one CPU thread can control one GPU

  30. CUDA Error Reporting to CPU All CUDA calls return error code: except for kernel launches cudaError_t type cudaError_t cudaGetLastError(void) returns the code for the last error (no error has a code) char* cudaGetErrorString(cudaError_t code) returns a null-terminted character string describing the error printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );

  31. Compiling CUDA • C/C++ CUDA • Application NVCC • CPU Code • PTX Code Virtual Physical PTX to Target Compiler JIT compiler -- Driver • G80 • G200 … Hardware Target code

  32. Onhand practice • SDK • Install & make • deviceQuery, bandwidthTest • Hello CUDA • Vector Add • Tree Reduction

  33. Thinking in parallel

  34. Example: Vector Addition • k = blockIdx.x * blockDim.x + threadIdx.x a [k] + b [k]  c [k]

  35. Example: Vector Addition

  36. Backup slides

  37. PTX Example (SAXPY code) cvt.u32.u16 $blockid, %ctaid.x; // Calculate i from thread/block IDs cvt.u32.u16 $blocksize, %ntid.x; cvt.u32.u16 $tid, %tid.x; mad24.lo.u32 $i, $blockid, $blocksize, $tid; ld.param.u32 $n, [N]; // Nothing to do if n ≤ i setp.le.u32 $p1, $n, $i; @$p1 bra $L_finish; mul.lo.u32 $offset, $i, 4; // Load y[i] ld.param.u32 $yaddr, [Y]; add.u32 $yaddr, $yaddr, $offset; ld.global.f32 $y_i, [$yaddr+0]; ld.param.u32 $xaddr, [X]; // Load x[i] add.u32 $xaddr, $xaddr, $offset; ld.global.f32 $x_i, [$xaddr+0]; ld.param.f32 $alpha, [ALPHA]; // Compute and store alpha*x[i] + y[i] mad.f32 $y_i, $alpha, $x_i, $y_i; st.global.f32 [$yaddr+0], $y_i; $L_finish: exit;

  38. Scalability Thread blocks can run in any order Concurrently or sequentially Facilitates scaling of the same code across many devices Scalability

  39. Hardware multi-threads Kernel launched by host MT IU MT IU MT IU MT IU MT IU MT IU MT IU MT IU SP SP SP SP SP SP SP SP Device Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Device processor array . . . . . .

  40. Example: Matrix Transpose • Coalesced read/write device memory best memory bandwidth utilization • Use shared memory to share data between threads in block • Use 2D blockDim & gridDim • Over 20x than Host in large matrix (ex. 3000x4000) BLOCK Host Invocation dim3grid (n/16+1,m/16+1,1); dim3block (16,16,1); Transpose <<< grid, block >>> (b, a, m, n);

  41. Example: Matrix Transpose __global__void Transpose (float* b, float* a, int m, int n){ __shared__ float s[256]; //declare shared memory (all threads in block could see it) int x=blockIdx.x*blockDim.x + threadIdx.x; //compute index (x,y) int y=blockIdx.y*blockDim.y + threadIdx.y; if(y<m && x<n){ int i=y*n+x; //compute input address (make the lowest index fit threadIdx.x) int t=threadIdx.y*blockDim.x + threadIdx.x; //compute shared address s[t]=a[i]; //coalesced input (read in 16x16 matrix block) } __syncthreads(); //synchronize threads in block x=blockIdx.x*blockDim.x + threadIdx.y; //exchange threadIdx (x,y) usage y=blockIdx.y*blockDim.y + threadIdx.x; //compute index (x,y) if(y<m && x<n){ int o=x*m+y; //compute output address (make the lowest index fit threadIdx.x) int t=threadIdx.x*blockDim.y + threadIdx.y; //compute shared address b[o]=s[t]; //coalesced output (write out 16x16 matrix block) } }

  42. Hardware (SPA, Streaming Processors Array) TPC

  43. Hardware (TPC, Texture/Processors Cluster)

  44. Hardware (SM, Streaming Multiprocessor) • Warp = 32 threads • SP : MUL & MAD • SFU : sin(x), cos(x)… • 1 block divides into multi-warps • SM could execute 1 warp in 4 clocks

  45. SPA, Streaming Processors Array Special Function Unit (SFU) Double Precision TP Array Shared Memory

  46. How to use so many cores? 240 SP thread processors 30 DP thread processors Full scalar processor IEEE 754 double precision floating point Thread Processor Array (TPA) Thread Processor (TP) Special Function Unit (SFU) Double Precision Multi-banked Register File SpcOps ALUs FP/Int TP Array Shared Memory

More Related