1 / 122

GPU Computing

GPU Computing. Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn. Overview. What is GPU?. Graphics Processing Unit First GPU: GeForce 256 (1999) Connected to motherboard via PCI Express High computational density and memory bandwidth Massively multithreaded many-core chips

xerxes
Télécharger la présentation

GPU Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU Computing Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

  2. Overview

  3. What is GPU? • Graphics Processing Unit • First GPU: GeForce 256 (1999) • Connected to motherboard via PCI Express • High computational density and memory bandwidth • Massively multithreaded many-core chips • Traditionally used for real-time rendering • Several millions units are sold each year.

  4. Graphics Cards

  5. GPU Pipeline

  6. GPU Pipeline Rasterization

  7. Anti-Aliasing Triangle Geometry Aliased Anti-Aliased

  8. GPGPU • General-Purpose Computing on GPUs • Massively Parallel, Simple Operations • Suitable for compute-intensive engineering problems • The original problem needs to be cast into native graphics operations. • Launched through OpenGL or DirectX API calls • Input data are stored in texture images and issued to the GPU by submitting triangles. • Highly restricted access to input/output • Very tedious, limited success with painstaking efforts

  9. Trend of Computing

  10. Control ALU ALU ALU ALU DRAM Cache DRAM CPU vs. GPU CPU GPU Multi-Core Many-Core Number of ALUs Memory Bandwidth

  11. Power of the Crowd • SM • Streaming Multiprocessor • Multi-threaded processor core • Processing unit for thread block • SPs (Streaming Processor) • SFUs (Special Function Unit) • SP • Streaming Processor • Scalar ALU for a single CUDA thread • SIMT • Single-Instruction, Multiple-Thread • Shared instruction fetch per 32 threads (warp) Streaming Multiprocessor Instruction L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP

  12. Need For Speed

  13. Green Computing GFLOPS per Watt GTX 750 Ti GTX 680 Intel Core i7-980XE GTX 580

  14. Supercomputing • TITAN, Oak Ridge National Laboratory • Speed: 24.8 PFLOPS (Theory), 17.6 PFLOPS (Real) • CPU: AMD Opteron 6274 (18,688 × 16 cores) • GPU: NVIDIA Tesla K20 (18,688 × 2496 cores) • Cost: US$ 97 Million • Power: 9 MW

  15. Personal Supercomputer

  16. What is CUDA? • Compute Unified Device Architecture • Introduced by NVIDIA in 2007 • Scalable Parallel Programming Model • Small extensions to standard C/C++ • Enable general-purpose GPU computing • Straightforward APIs to manage devices, memory etc. • Only supports NVIDIA GPUs. http://developer.nvidia.com/category/zone/cuda-zone

  17. Texture Texture Texture Texture Texture Texture Texture Texture Texture Host Input Assembler Thread Execution Manager Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Load/store Load/store Load/store Load/store Load/store Load/store Global Memory CUDA-Enabled GPU

  18. CUDA GPUs

  19. Fermi Architecture

  20. Kepler Architecture • GeForce GTX 680 (Mar. 22, 2012) • GK104, 28 nm process • 3.5 billion transistors on a 294 mm2 die • CUDA Cores: 1536 (8 SMs X 192 SPs) • Memory Bandwidth: 192 GB/S • Peak Performance: 3090 GFLOPS • TDP: 195W • Release Price: $499

  21. Maxwell Architecture • GeForce GTX 750 Ti (Feb. 18, 2014) • GM107, 28 nm process • 1.87 billion transistors on a 148 mm2 die • CUDA Cores: 640 (5 SMs X 128 Cores) • Memory Bandwidth: 86.4 GB/S • Peak Performance: 1306 GFLOPS • TDP: 60W • Release Price: $149

  22. CUDA Teaching Lab • GTX 750 (GM107) • Compute Capability: 5.0 • 512 CUDA Cores • 1GB, 128-bit GDDR5 • 80 GB/S • 1044 GFLOPS • TDP: 55W • RMB 799 • GT 630 (GK208) • Compute Capability: 3.5 • 384 CUDA Cores • 2GB, 64-bit GDDR3 • 14.4 GB/S • 692.7 GFLOPS • TDP: 25W • RMB 419

  23. CUDA Installation https://developer.nvidia.com/cuda-downloads

  24. CUDA: deviceQuery

  25. CUDA: bandwidthTest

  26. CUDA Applications

  27. CUDA Showcase

  28. Heterogeneous Computing Host Device

  29. Heterogeneous Computing

  30. Host Device Kernel 1 Kernel 2 Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 0) Thread (0, 1) Thread (0, 2) Thread (1, 1) Thread (1, 0) Thread (1, 2) Thread (2, 1) Thread (2, 0) Thread (2, 2) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 0) Thread (4, 1) Thread (4, 2) Grids, Blocks and Threads

  31. Thread Block • Threads have thread ID numbers within block. • Threads use thread ID to select work. • Threads are assigned to SMs in block granularity. • Each GT200 SM can have maximum 8 blocks. • Each GT200 SM can have maximum 1024 threads. • Threads in the same block can share data and synchronize. • Threads in different blocks cannot cooperate. • Each block can execute in any order relative to other blocks. Thread Id #:0 1 2 3 … m Thread program

  32. Code Example

  33. Kernel grid Device Block 2 Block 6 Block 0 Block 4 Block 5 Block 7 Block 3 Block 1 Device Block 5 Block 0 Block 1 Block 2 Block 3 Block 4 Block 6 Block 3 Block 7 Block 0 Block 5 Block 4 Block 6 Block 2 Block 1 Block 7 Transparent Scalability

  34. Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Constant Memory Memory Space • Each thread can: • Read/write per-thread registers • Read/write per-block shared memory • Read/write per-grid global memory • Read/only per-gridconstant memory GeForce GTX 680 Memory Bandwidth … 192 GB/S Single-Precision Floating Point … 4B Peak Performance … 3090 GFLOPS Practical Performance … 48 GFLOPS

  35. Hello World! int main(void) { printf(“Hello World!\n”); return 0; } __global__ void mykernel(void) { } int main(void) { mykernel<<<1,1>>>(); printf(“Hello World!\n”); return 0; } Your first CUDA code!

  36. Device Code • CUDA keyword __global__indicates a kernel function that: • Runs on the device. • Called from the host. • CUDA keyword __device__indicates a device function that: • Runs on the device. • Called from a kernel function or another device function. • Triple angle brackets <<< >>>indicate a call from host code to device code. • Kernel launch • nvcc separates source code into two components: • Device functions are processed by NVIDIA compiler. • Host functions are processed by standard host compiler. • $ nvcc hello.cu

  37. Addition on Device __global__ void add (int *a, int *b, int *c) { *c=*a+*b; } • add () will execute on the device. • add () will be called from the host. • a, b, c must point to device memory. • We need to allocate memory on GPU.

  38. Memory Management • Host and device memories are separate entities. • Device pointers point to GPU memory. • May be passed to/from host code. • May not be dereferenced in host code. • Host pointers point to CPU memory • May be passed to/from device code. • May not be dereferenced in device code. • CUDA APIs for handling device memory • cudaMalloc(), cudaFree(), cudaMemcpy() • C equivalents: malloc(), free(), memcpy()

  39. Addition on Device: main() int main(void) { int a, b, c; // host copies int *d_a, *d_b, *d_c; // device copies int size=sizeof(int); // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); a=2; b=7; // Copy inputs to device cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

  40. Addition on Device: main() // Launch add() kernel on GPU add<<<1,1>>>(d_a,d_b,d_c); // Copy result back to host cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }

  41. Moving to Parallel • Each call to add() adds two integers. • With add() running in parallel, we can do vector addition in parallel. • add<<<nblocks, 1>>>(d_a, d_b, d_c) • Each parallel invocation of add() is referred to as a block. • By using blockIdx.x to index into the array, each block handles a different index. • Block can be 2D: • dim3 nblocks(M, N) • blockIdx.x, blockIdx.y

  42. Vector Addition on Device __global__ void add (int *a, int *b, int *c) { c[blockIdx.x]=a[blockIdx.x]+b[blockIdx.x]; } Block 0 Block 1 c[0]=a[0]+b[0]; c[1]=a[1]+b[1]; Block 2 Block 3 c[2]=a[2]+b[2]; c[3]=a[3]+b[3];

  43. Vector Addition on Device: main() # define N 512 int main(void) { int*a, *b, *c;// host copies int *d_a, *d_b, *d_c; // device copies int size=N*sizeof(int); // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Allocate space of host copies of a, b, c // Set up initial values a=(int *)malloc(size); rand_ints(a, N); b=(int *)malloc(size); rand_ints(b, N); c=(int *)malloc(size); rand_ints(c, N);

  44. Vector Addition on Device: main() // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU with N blocks add<<<N, 1>>(d_a, d_b, d_c); // Copy results back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }

  45. CUDA Threads • Each block can be split into parallel threads. • Threads can be up to 3D: • dim3 nthreads(M, N, P) • threadIdx.x, threadIdx.y, threadIdx.z __global__ void add (int *a, int *b, int *c) { c[threadIdx.x]=a[threadIdx.x]+b[threadIdx.x]; } add<<<1, N>>>(d_a, d_b, d_c);

  46. Combining Blocks and Threads • We have seen parallel vector addition using: • Many blocks with one thread each • One block with many threads • Let’s adapt vector addition to use both blocks and threads. • Why bother?

  47. Indexing M=8; // 8 threads/block int index=threadIdx.x+blockIdx.x*M; int index=threadIdx.x+blockIdx.x*blockDim.x; __global__ void add (int *a, int *b, int *c) { int index=threadIdx.x+blockIdx.x*blockDim.x; c[index]=a[index]+b[index]; }

  48. Indexing #define N (2048*2048) #define M 512 // THREADS_PER_BLOCK … add<<<N/M, M>>>(d_a, d_b, d_c); __global__ void add (int *a, int *b, int *c, int n) { int index=threadIdx.x+blockIdx.x*blockDim.x; if (index<n) c[index]=a[index]+b[index]; } add<<<(N+M-1)/M, M>>>(d_a, d_b, d_c, N);

  49. Data Access Pattern radius radius How many times? input output

  50. Sharing Data Between Threads • Each thread generates one output element. • blockDim.x elements per block • Each input element needs to be read several times. • High I/O cost • Within a block, threads can share data via shared memory. • Data are not visible to threads in other blocks. • Extremely fast on-chip memory • Declared using keyword: __shared__, allocated per block. • Read (blockDim.x+2*radius)input elements from global to shared memory.

More Related