1 / 23

Cache Coherence for GPU Architectures

Cache Coherence for GPU Architectures. Inderpreet Singh 1 , Arrvindh Shriraman 2 , Wilson Fung 1 , Mike O’Connor 3 , Tor Aamodt 1. 1 University of British Columbia 2 Simon Fraser University 3 AMD Research. Image source: www.forces.gc.ca. What is a GPU?. Workgroups. CPU. Wavefronts.

laird
Télécharger la présentation

Cache Coherence for GPU Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cache Coherence for GPU Architectures Inderpreet Singh1, Arrvindh Shriraman2, Wilson Fung1, Mike O’Connor3, Tor Aamodt1 1 University of British Columbia 2 Simon Fraser University 3AMD Research Image source: www.forces.gc.ca

  2. What is a GPU? Workgroups CPU Wavefronts spawn GPU Core GPU Core GPU ▪▪▪ L1D L1D done CPU Interconnect CPU L2 Bank spawn ▪▪▪ GPU time

  3. Evolution of GPUs Graphics pipeline Compute (OpenCL, CUDA) e.g. Matrix Multiplication Pixel Shader Vertex Shader OpenGL/ DirectX

  4. Evolution of GPUs Future: coherent memory space Efficient critical sections Load balancing Stencil computation lock shared structure … computation … unlock Workgroups

  5. Challenge 1: Coherence traffic GPU Coherence Challenges No coherence MESI Load C Load D Load E Load F … Load G Load H Load I Load J … Load O Load P Load Q Load R … Load K Load L Load M Load N … 1.5 GPU-VI Load C Do not require C1 C2 C3 C4 coherence 2.2 1.0 Recalls L1D L1D L1D L1D 1.3 A B A B A B A B 0.5 rcl A rcl A rcl A Interconnect traffic ack ack rcl A ack ack L2/Directory gets C A B

  6. GPU Coherence Challenges Challenge 2: Tracking in-flight requests Significant % of L2 SShared S_M MModified L2 / Directory MSHR

  7. GPU Coherence Challenges Challenge 3: Complexity MESI L2 States Non-coherent L1 MESI L1 States Events States Non-coherent L2

  8. GPU Coherence Challenges All three challenges result from introducing coherence messages on a GPU Traffic: transferring Storage: tracking Complexity: managing GPU cache coherence without coherence messages? YES – using global time

  9. Temporal Coherence (TC) Global time Local Timestamp > Global Time  VALID Core 1 Core 2 ▪▪▪ L1D L1D Interconnect Global Timestamp < Global Time  NO L1 COPIES L2 Bank ▪▪▪ 0 0 A=0 A=0

  10. Temporal Coherence (TC) T=11 T=0 T=15 Core 1 Core 2 L1D L1D No coherence messages Interconnect Load A Store A=1 T=10 L2 Bank ▪▪▪ 10 0 10 A=0 A=0 A=0 A=0 10 A=0 A=1 10

  11. Temporal Coherence (TC) What lifetime values should be requested on loads? Use a predictor to predict lifetime values What about stores to unexpired blocks? Stall them at the L2?

  12. TC Stalling Issues Stall? Problem #1: Sensitive to mispredictions Problem #2: Impedes other accesses Problem #3: Hurts existing GPU applications Solution: TC-Weak

  13. TC-Weak Stores return Global Write Completion Time (GWCT) T=0 T=1 T=31 1 data=NEW 2 FENCE 3 flag=SET 1 data=NEW 2 FENCE 3 flag=SET 1 data=NEW 2 FENCE 3 flag=SET 1 data=NEW 2 FENCE 3 flag=SET GPU Core 1 GPU Core 2 L1D L1D No stalling at L2 30 30 data=OLD data=OLD GWCT Table W0: W1: GWCT Table W0: W1: Store data=NEW Store flag=SET Interconnect L2 Bank 30 30 data=NEW data=OLD 47 47 flag=SET flag=NULL

  14. TC-Weak

  15. Methodology GPGPU-Sim v3.1.2 for GPU core model GEMS Ruby v2.1.1 for memory system All protocols written in SLICC Model a generic NVIDIA Fermi-based GPU (see paper for details) Applications: 6 do not require coherence 6 require coherence Barnes Hut Cloth Physics Versatile Place and Route Max-Flow Min-Cut 3D Wave Equation Solver Octree Partitioning Locks Stencil communication Load balancing

  16. Interconnect Traffic Reduces traffic by 53% over MESI and 23% over GPU-VI for intra-workgroup applications Lower traffic than 16x-sized 32-way directory NO-COH MESI GPU-VI TC-Weak 2.3 1.50 1.25 1.00 Interconnect Traffic 0.75 0.50 0.25 Do not require coherence 0.00

  17. Performance TC-Weak with simple predictor performs 85% better than disabling L1 caches Performs 28% better than TC with stalling Larger directory sizes do not improve performance NO-L1 MESI GPU-VI TC-Weak 2.0 1.5 1.0 Speedup 0.5 0.0 Require coherence

  18. Complexity MESI L2 States Non-Coherent L1 TC-Weak L1 MESI L1 States Non-Coherent L2 TC-Weak L2

  19. Summary First work to characterize GPU coherence challenges Save traffic and energy by using global time Reduce protocol complexity 85% performance improvement over no coherence Questions?

  20. Backup Slides

  21. Lifetime Predictor One prediction value per L2 bank Events local to L2 bank update prediction value Events Prediction Expired load: ↑ Unexpired store: ↓ Unexpired eviction: ↓ L2 Bank T = 20 T = 0 Load A Store A prediction-- Prediction Value prediction++ A A 10 30

  22. TC-Strong vs TC-Weak Fixed lifetime for all applications Best lifetime for each application TCSUO TCS TCSOO 1.2 TCW TCW w/ predictor 1.4 1.0 1.2 Speedup 0.8 1.0 Speedup 0.8 0.6 All applications 0.6 All applications

  23. Interconnect Power and Energy

More Related