1 / 28

GPUDet: A Deterministic GPU Architecture

GPUDet: A Deterministic GPU Architecture. Hadi Jooybar 1 , Wilson Fung 1 , Mike O’Connor 2, Joseph Devietti 3 , Tor M. Aamodt 1. 1 The University of British Columbia 2 AMD Research 3 University of Washington . GPUs are … Fast Energy efficient Commodity hardware. But…

fritzi
Télécharger la présentation

GPUDet: A Deterministic GPU Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPUDet: A Deterministic GPU Architecture Hadi Jooybar1, Wilson Fung1, Mike O’Connor2, Joseph Devietti3,Tor M. Aamodt1 1The University of British Columbia 2AMD Research 3University of Washington

  2. GPUs are … • Fast • Energy efficient • Commodity hardware • But… • Mostly use for certain range of applications • Why? • Communication among concurrent threads 1000s of Threads

  3. V0 V0 V0 Motivation 0 __global__ voidBFS_step_kernel(...) { 1 if( active[tid] ) { 2 active[tid] = false; 3 visited[tid] = true; 4 foreach (int id = neighbour_nodes){ 5 if( visited[id] == false ){ 6 cost[id] = cost[tid] + 1; 7 active[id] = true; 8 *over = true; 9 } } } } V1 V1 V1 V2 V2 V2 Cost = - Active = - Cost = 1 Active = 1 Cost = 1 Active = 1 Cost = - Active = - Cost = 1 Active = 1 Cost = 2 Active = 1 BFS algorithm Published in HiPC 2007

  4. Motivation • What about debuggers?! • The bug may appear occasionally or in different places in each run. I will debug it this time OMG! Where was that bug?!

  5. GPUDet • Strong Determinism (hardware proposal) • Same Outputs • Same Execution Path • Makes the program easier to • Debug • Test

  6. V0 Motivation 0 __global__ voidBFS_step_kernel(...) { 1 if( active[tid] ) { 2 active[tid] = false; 3 visited[tid] = true; 4 foreach (int id = neighbour_nodes){ 5 if( visited[id] == false ){ 6 cost[id] = cost[tid] + 1; 7 active[id] = true; 8 *over = true; 9 } } } } V1 V2 Cost = 1 Active = 1 Cost = 2 Active = 1 BFS algorithm Published in HiPC 2007

  7. GPUDet • Strong Determinism • Same Outputs • Same Execution Path • Makes the program easier to • Debug • Test • There is no free lunch • Performance Overhead • Our goal is to provide Deterministic Execution on GPU architectures with acceptable performance overhead

  8. GPU Architecture DRAM CPU DRAM L2 Cache Kernel launch x = input[threadID]; y= func(x); output[threadID] = y; Compute Unit L1 Cache workgroup 0 workgroup 2 workgroup 1 Memory Unit ALU ALU ALU Workgroups

  9. Outline • Introduction • GPU Architecture • Challenges • Deterministic Execution with GPUDet • GPUDet Optimizations • Workgroup-Aware Quantum Formation • Deterministic parallel commit using Z-Buffer Unit • Compute Unit level serialization • Results and Conclusion

  10. Deterministic GPU Execution Challenges … Isolation Isolation Quantum 0 Quantum n T0 … T1 • Isolation mechanism • Provide method to pause execution of a thread T2 T0 T0 T0 T0 T3 T1 T1 T1 T1 Normal Execution T2 T2 T2 T2 T3 T3 T3 T3 Communication Communication

  11. Deterministic GPU Execution Challenges • Isolation mechanism • Lack of private caches • Lack of cache coherency • Provide method to pause execution of a thread • Single Instruction Multiple Threads (SIMT) • Potential deadlock condition • Major changes in control flow hardware • Performance overhead … workgroupn wavefront

  12. Deterministic GPU Execution Challenges • Very large number of threads • Expensive global synchronization • Expensive serialization • Different program properties • Large number of short running threads • Frequent workgroup synchronization • Less locality in intra thread memory accesses

  13. Outline • Introduction • GPU Architecture • Challenges • Deterministic Execution with GPUDet • GPUDet Optimizations • Workgroup-Aware Quantum Formation • Deterministic parallel commit using Z-Buffer Unit • Compute Unit level serialization • Results and Conclusion

  14. Deterministic Execution of a Wavefront T15 T0 T2 T1 if (tid < 16) x[tid%2] = tid; … x[0] = 0 x[1] = 1 x[0] = 2 x[1] = 15 Execution of one wavefront is deterministic Coalescing Unit Address x Mask v v - - - - - - … - Data Race Data 14 15 - - - - - - … - x[0] = 14 Not modified x[1] = 15 To memory

  15. Deterministic GPU Execution Challenges Isolation Isolation … wavefront granularity • Isolation mechanism • Provide method to pause execution of a thread T0 T0 T1 T1 not a challenge anymore T2 T2 T3 T3 Communication Communication

  16. Read Only • GPUDet-Basic Global Memory Commit Reaching Quantum Boundary Load Op Atomic Op Store Buffers Wavefronts Instruction Count Atomic Operations Memory Fences Workgroup Barriers Execution Complete … Local Memory

  17. Outline • Introduction • GPU Architecture • Challenges • Deterministic Execution with GPUDet • GPUDet Optimizations • Workgroup-Aware Quantum Formation • Deterministic parallel commit using Z-Buffer Unit • Compute Unit level serialization • Results and Conclusion

  18. Workgroup-Aware Quantum Formation • Extra global synchronizations Reducing number of synchronizations Avoid unnecessary quantum termination Load Imbalance

  19. Workgroup-Aware Quantum Formation All reach a workgroup barrier Continue execution in the parallel mode Workgroup-Aware Decision Making Quanta are finished by workgroup barriers

  20. Workgroup-Aware Quantum Formation Workgroup-Aware Decision Making Finish execution of the Kernel function Deterministic workgroup partitioning

  21. Deterministic Parallel Commit using the Z-Buffer Unit • Z-Buffer Unit Depth Buffer Store Buffer Contents ≈ Color Values Wavefront ID ≈ Depth Values

  22. Compute Unit Level Serialization • GPUs preserve Point to Point Ordering Serialization is only among compute units

  23. Outline • Introduction • GPU Architecture • Challenges • Deterministic Execution with GPUDet • GPUDet Optimizations • Workgroup-Aware Quantum Formation • Deterministic parallel commit using Z-Buffer Unit • Compute Unit level serialization • Results and Conclusion

  24. Results • GPGPU-Sim 3.0.2 Applications with atomic operations 2x Slowdown

  25. Quantum Formation 19% Performance Improvementfor application with small kernel functions 20% Performance Improvementfor application with barriers

  26. Deterministic Parallel Commit using the Z-Buffer Unit 60% Performance Improvement on Average

  27. Compute Unit Level Serialization 6.1x Performance Improvement in Serial Mode

  28. Conclusion • Encourages programmers to use GPUs in broader range of applications • Exploits GPU characteristics to reduce performance overhead • Deterministic execution within a wavefront • Workgroup-aware quantum formation • Deterministic parallel commit using Z-Buffer Unit • Compute Unit level serialization Questions?

More Related