1 / 20

Adaptive Input-aware Compilation for Graphics Engines

Adaptive Input-aware Compilation for Graphics Engines. Mehrzad Samadi 1 , Amir Hormati 2 , Mojtaba Mehrara 3 , Janghaeng Lee 1 and Scott Mahlke 1. 1 University of Michigan - Ann Arbor 2 Microsoft Research 3 NVIDIA Research. GPU Performance Gap. High performance at low cost

lars-gentry
Télécharger la présentation

Adaptive Input-aware Compilation for Graphics Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Input-aware Compilation for Graphics Engines MehrzadSamadi1, Amir Hormati2, Mojtaba Mehrara3, Janghaeng Lee1 and Scott Mahlke1 1University of Michigan - Ann Arbor 2Microsoft Research 3NVIDIA Research

  2. GPU Performance Gap • High performance at low cost • Peak performance is difficult to achieve GeForce GTX 680 GeForce GTX 590 In Practice GeForce GTX 480 GeForce GTX 280 GeForce 8800 GTX GeForce 7800 GTX

  3. TMV Performance on Various Input SquareMatrix RectangularMatrix RectangularMatrix

  4. GPU Execution Model SM 0 SM 7 SM 3 SM 2 SM 1 0 1 0 0 0 0 1 1 1 1 2 3 2 2 2 2 3 3 3 3 4 5 4 4 4 4 5 5 5 5 ExecutesThread 6 7 6 6 6 6 7 7 7 7 Regs Regs Regs Regs Regs Shared Shared Shared Shared Shared Grid 1

  5. Transposed Matrix Vector Multiplication (4 x 1M) Thread 0 ~ 15 Thread 0 ~ 15 Block 0 Block 1 Block 2 Block 3 IDLE SM 0 SM 1 SM 2 SM 3 SM 4 SM 5 SM 6 SM 7 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 Regs Regs Regs Regs Regs Regs Regs Regs Shared Shared Shared Shared Shared Shared Shared Shared

  6. Transposed Matrix Vector Multiplication (1M x 4) 125,000 blocks / SM SM 0 SM 1 SM 2 SM 3 Block 0 ~ 7 Block 8 ~ 15 SM 4 SM 5 SM 6 SM 7 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 Regs Regs Regs Regs Regs Regs Regs Regs Shared Shared Shared Shared Shared Shared Shared Shared Block 1,000,000

  7. GPU Programming Challenge - Portability GPU Architectures Cores : 240 2008 FastestMatrix-VectorMultiplicationfor any GPU for any input size Cores : 512 2011 Cores : 1536 2012

  8. Adaptic • Adaptive Input-aware Compilation for GPUs • Device-Portable • Input-Portable • Programmers can focus on the algorithms without concerning about low-level details • Streaming Language • Higher-level of abstraction • Separating Memory Access from Algorithm • e.g) StreamIt

  9. Actor 1 Splitter Actor 2 Actor 3 Actor 4 Actor 5 Joiner Actor 6 Stream It • Higher-level of abstraction • Decoupling computation and memory accesses • Coarse grain exposed parallelism, exposed communication • Streaming actors use buffers to communicate • A lot of recent works on extending portability of streaming applications

  10. Compilation Flow in Adaptic StreamIt Code Launch Kernel Target GPU Input Range • Why? • Global Memory Accesses • Large access latency • Optimizations • Memory Restructuring • Coalesced Access • Neighboring Access • Data Reuse Memory AccessOptimization Input size? • Splits Actors • More blocks will be generated • Alleviate resource under-utilization • Optimizations • Stream Reduction • Intra-actor Parallelization Input-unawareOptimization LargestInput SmallestInput Actor Segmentation LargeInput Input-awareOptimization SmallInput PerformanceModel • Integrate Actors • Merge several actors into one • Alleviate high resource contention • Optimizations • Vertical Integration • Horizontal Integration Actor Integration Kernel 0 Kernel 1 Kernel 2 Kernel 3 Several CUDA Kernels for various input range Offline Compilation Executable

  11. Memory Optimization • Global Memory - Large access latency • Not access the words in sequence • No coalescing 14 12 10 0 8 4 2 6 14 12 10 8 6 4 2 0 15 1 13 5 3 11 13 1 7 11 9 7 5 3 9 15 0 1 2 3 2 3 0 1 5 4 5 6 6 7 4 7 8 11 9 10 11 8 10 9 12 13 13 14 15 15 12 14 A[i, j]  Actor A has i pops and j pushes Global Memory Thread 0 Thread 1 Thread 2 Thread 3 A[4,4] A[4,4] A[4,4] A[4,4] Global Memory

  12. Memory Optimization • Global Memory - Large access latency • Not access the words in sequence • No coalescing 0 0 4 4 8 8 12 12 0 0 4 4 8 8 12 12 1 1 5 5 9 9 13 13 1 1 5 5 9 9 13 13 2 2 6 6 10 10 14 14 2 2 6 6 10 10 14 14 A[i, j]  Actor A has i pops and j pushes 3 3 7 7 11 11 15 15 3 3 7 7 11 11 15 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global Memory Thread 0 Thread 1 Thread 2 Thread 3 A[4,4] A[4,4] A[4,4] A[4,4] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global Memory

  13. Actor Segmentation Actor 0 Actor 0 Actor 1 Actor 2 Actor 3 4 x 1M Transposed Matrix-Vector Multiplication Block 0 Block 1 Block 2 Block 3 Block 0 ~ Block 31 Block 32 Block 64 Block 96

  14. Actor Integration • Merges several actors or threads to balance threads’ workloads • Vertical integration: reducing off-chip memory traffic by storing intermediate results in the shared memory. • Horizontal integration : reducing synchronization overhead and also lets the merged actors share instructions. Actor 1 Actor 1 Actor 2 Fused Actor 0 Actor 3 Splitter Fused Actor 1 Actor 4 Actor 5 Actor 6 Actor 7 Actor 6 Joiner Actor 8

  15. Experimental Setup • CPU - Intel Xeon X5650 • GPU • NVidia Telsa C2050 • 3GB GDDR 5 • NVidia GTX 285 • 2GB GDDR 2 • Benchmarks • CUBLAS Library 3.2 • NVidia SDK 3.1

  16. Result( Matrix Vector Multlipication)

  17. Results (Speedup) Input Size

  18. Results(BiCGSTAB) Input unaware Input unaware

  19. Summary • Performance of GPU is affected by • GPU Model / Input • CUDA / OpenCL Programming Model • Lacks Architecture and Input Portability • Scientific Applications use irregular input • Hard to get optimized performance • Proposed Adaptic • Architecture and input portable /w streaming language • Showed speedup over CUBLAS / SDK in various input range

  20. Q & A

More Related