Download
tu e 5kk73 zhenyu ye bart mesman henk corporaal 2010 11 08 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Graphics Processing Unit (GPU) Architecture and Programming PowerPoint Presentation
Download Presentation
Graphics Processing Unit (GPU) Architecture and Programming

Graphics Processing Unit (GPU) Architecture and Programming

451 Views Download Presentation
Download Presentation

Graphics Processing Unit (GPU) Architecture and Programming

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08 Graphics Processing Unit (GPU)Architecture and Programming

  2. Today's Topics • GPU architecture • GPU programming • GPU micro-architecture • Performance optimization and model • Trends

  3. Today's Topics • GPU architecture • GPU programming • GPU micro-architecture • Performance optimization and model • Trends

  4. System Architecture

  5. GPU Architecture NVIDIA Fermi, 512 Processing Elements (PEs)

  6. What Can It Do? Render triangles. NVIDIA GTX480 can render 1.6 billion triangles per second!

  7. General Purposed Computing ref: http://www.nvidia.com/object/tesla_computing_solutions.html

  8. The Vision of NVIDIA • "Within the next few years, there will be single-chip graphics devices more powerful and versatile than any graphics system that has ever been built, at any price."  • -- David Kirk, NVIDIA, 1998

  9. ref: http://www.llnl.gov/str/JanFeb05/Seager.html Single-Chip GPU v.s. Fastest Super Computers

  10. Top500 Super Computer in June 2010

  11. GPU Will Top the List in Nov 2010

  12. The Gap Between CPU and GPU ref: Tesla GPU Computing Brochure

  13. GPU Has 10x Comp Density Given the same chip area, the achievable performance of GPU is 10x higher than that of CPU.

  14. Evolution of Intel Pentium Pentium I Pentium II Chip area breakdown Pentium III Pentium IV Q: What can you observe? Why?

  15. Extrapolation of Single Core CPU If we extrapolate the trend, in a few generations, Pentium will look like: Of course, we know it did not happen.  Q: What happened instead? Why?

  16. Evolution of Multi-core CPUs Penryn Chip area breakdown Bloomfield Gulftown Beckton Q: What can you observe? Why?

  17. Let's Take a Closer Look Less than 10% of total chip area is used for the real execution. Q: Why?

  18. The Memory Hierarchy Notes on Energy at 45nm:  64-bit Int ADD takes about 1 pJ. 64-bit FP FMA takes about 200 pJ. It seems we can not further increase the computational density.

  19. The Brick Wall -- UC Berkeley's View Power Wall: power expensive, transistors free Memory Wall: Memory slow, multiplies fastILP Wall: diminishing returns on more ILP HW David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link

  20. The Brick Wall -- UC Berkeley's View Power Wall: power expensive, transistors free Memory Wall: Memory slow, multiplies fastILP Wall: diminishing returns on more ILP HW Power Wall + Memory Wall + ILP Wall = Brick Wall David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link

  21. How to Break the Brick Wall? Hint: how to exploit the parallelism inside the application?

  22. Step 1: Trade Latency with Throughput Hind the memory latency through fine-grained interleaved threading.

  23. Interleaved Multi-threading

  24. Interleaved Multi-threading • The granularity of interleaved multi-threading: • 100 cycles: hide off-chip memory latency • 10 cycles: + hide cache latency • 1 cycle: + hide branch latency, instruction dependency

  25. Interleaved Multi-threading • The granularity of interleaved multi-threading: • 100 cycles: hide off-chip memory latency • 10 cycles: + hide cache latency • 1 cycle: + hide branch latency, instruction dependency Fine-grained interleaved multi-threading: Pros: ? Cons: ?

  26. Interleaved Multi-threading • The granularity of interleaved multi-threading: • 100 cycles: hide off-chip memory latency • 10 cycles: + hide cache latency • 1 cycle: + hide branch latency, instruction dependency Fine-grained interleaved multi-threading: Pros: remove branch predictor, OOO scheduler, large cache Cons: register pressure, etc.

  27. Fine-Grained Interleaved Threading Without and with fine-grained interleaved threading Pros:  reduce cache size, no branch predictor,  no OOO scheduler Cons:  register pressure, thread scheduler, require huge parallelism

  28. HW Support Register file supports zero overhead context switch between interleaved threads.

  29. Can We Make Further Improvement? • Reducing large cache gives 2x computational density. • Q: Can we make further improvements? Hint: We have only utilized thread level parallelism (TLP) so far.

  30. Step 2: Single Instruction Multiple Data GPU uses wide SIMD: 8/16/24/... processing elements (PEs) CPU uses short SIMD: usually has vector width of 4. SSE has 4 data lanes GPU has 8/16/24/... data lanes

  31. Hardware Support Supporting interleaved threading + SIMD execution

  32. Single Instruction Multiple Thread (SIMT) Hide vector width using scalar threads.

  33. Example of SIMT Execution Assume 32 threads are grouped into one warp.

  34. Step 3: Simple Core The Stream Multiprocessor (SM) is a light weight core compared to IA core. Light weight PE: Fused Multiply Add (FMA) SFU: Special Function Unit

  35. NVIDIA's Motivation of Simple Core "This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train." --Bill Dally, NVIDIA

  36. Review: How Do We Reach Here? NVIDIA Fermi, 512 Processing Elements (PEs)

  37. Throughput Oriented Architectures • Fine-grained interleaved threading (~2x comp density) • SIMD/SIMT (>10x comp density) • Simple core (~2x comp density) Key architectural features of throughput oriented processor. ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM 2010. (link)

  38. Today's Topics • GPU architecture • GPU programming • GPU micro-architecture • Performance optimization and model • Trends

  39. CUDA Programming Massive number (>10000) of light-weight threads.

  40. Express Data Parallelism in Threads  • Compare thread program with vector program.

  41. Vector program (vector width of 8) float A[4][8]; do-all(i=0;i<4;i++){ movups xmm0, [ &A[i][0] ]     incps xmm0     movups [ &A[i][0] ], xmm0 } Vector Program Vector width is exposed to programmers. • Scalar program • float A[4][8]; • do-all(i=0;i<4;i++){ • do-all(j=0;j<8;j++){ •         A[i][j]++; •     } • }

  42. CUDA program float A[4][8]; kernelF<<<(4,1),(8,1)>>>(A); __device__    kernelF(A){     i = blockIdx.x;     j = threadIdx.x;     A[i][j]++; } CUDA Program • CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP). • Hardware converts TLP into DLP at run time. • Scalar program • float A[4][8]; • do-all(i=0;i<4;i++){ • do-all(j=0;j<8;j++){ •         A[i][j]++; •     } • }

  43. Two Levels of Thread Hierarchy • kernelF<<<(4,1),(8,1)>>>(A); • __device__    kernelF(A){ •     i = blockIdx.x; •     j = threadIdx.x; •     A[i][j]++; • }

  44. Multi-dimension Thread and Block ID Both grid and thread block can have two dimensional index. • kernelF<<<(2,2),(4,2)>>>(A); • __device__    kernelF(A){ •     i = blockDim.x * blockIdx.y •         + blockIdx.x; •     j = threadDim.x * threadIdx.y •         + threadIdx.x; •     A[i][j]++; • }

  45. Scheduling Thread Blocks on SM Example: Scheduling 4 thread blocks on 3 SMs.

  46. Executing Thread Block on SM • kernelF<<<(2,2),(4,2)>>>(A); • __device__    kernelF(A){ •     i = blockDim.x * blockIdx.y •         + blockIdx.x; •     j = threadDim.x * threadIdx.y •         + threadIdx.x; •     A[i][j]++; • } Executed on machine with width of 4: Notes: the number of Processing Elements (PEs) is transparent to programmer. Executed on machine with width of 8:

  47. Name Cache? cycle read-only? Global L1/L2 200~400 (cache miss) R/W Shared No 1~3 R/W Constant Yes 1~3 Read-only Texture Yes ~100 Read-only Local L1/L2 200~400 (cache miss) R/W Multiple Levels of Memory Hierarchy

  48. Explicit Management of Shared Mem Shared memory is frequently used to exploit locality.

  49. Shared Memory and Synchronization 3x3 window on image Example: average filter with 3x3 window kernelF<<<(1,1),(16,16)>>>(A); __device__    kernelF(A){ __shared__ smem[16][16]; //allocate smem     i = threadIdx.y;     j = threadIdx.x;     smem[i][j] = A[i][j];     __sync();     A[i][j] = ( smem[i-1][j-1]                    + smem[i-1][j]                    ...                    + smem[i+1][i+1] ) / 9; } Image data in DRAM

  50. Shared Memory and Synchronization 3x3 window on image Example: average filter over 3x3 window kernelF<<<(1,1),(16,16)>>>(A); __device__    kernelF(A){     __shared__ smem[16][16];     i = threadIdx.y;     j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem __sync(); // thread wait at barrier     A[i][j] = ( smem[i-1][j-1]                    + smem[i-1][j]                    ...                    + smem[i+1][i+1] ) / 9; } Stage data in shared mem