Cache Coherence for GPU Architectures

Cache Coherence for GPU Architectures Inderpreet Singh1, Arrvindh Shriraman2, Wilson Fung1, Mike O’Connor3, Tor Aamodt1 1 University of British Columbia 2 Simon Fraser University 3AMD Research Image source: www.forces.gc.ca

What is a GPU? Workgroups CPU Wavefronts spawn GPU Core GPU Core GPU ▪▪▪ L1D L1D done CPU Interconnect CPU L2 Bank spawn ▪▪▪ GPU time

Evolution of GPUs Graphics pipeline Compute (OpenCL, CUDA) e.g. Matrix Multiplication Pixel Shader Vertex Shader OpenGL/ DirectX

Evolution of GPUs Future: coherent memory space Efficient critical sections Load balancing Stencil computation lock shared structure … computation … unlock Workgroups

Challenge 1: Coherence traffic GPU Coherence Challenges No coherence MESI Load C Load D Load E Load F … Load G Load H Load I Load J … Load O Load P Load Q Load R … Load K Load L Load M Load N … 1.5 GPU-VI Load C Do not require C1 C2 C3 C4 coherence 2.2 1.0 Recalls L1D L1D L1D L1D 1.3 A B A B A B A B 0.5 rcl A rcl A rcl A Interconnect traffic ack ack rcl A ack ack L2/Directory gets C A B

GPU Coherence Challenges Challenge 2: Tracking in-flight requests Significant % of L2 SShared S_M MModified L2 / Directory MSHR

GPU Coherence Challenges Challenge 3: Complexity MESI L2 States Non-coherent L1 MESI L1 States Events States Non-coherent L2

GPU Coherence Challenges All three challenges result from introducing coherence messages on a GPU Traffic: transferring Storage: tracking Complexity: managing GPU cache coherence without coherence messages? YES – using global time

Temporal Coherence (TC) Global time Local Timestamp > Global Time  VALID Core 1 Core 2 ▪▪▪ L1D L1D Interconnect Global Timestamp < Global Time  NO L1 COPIES L2 Bank ▪▪▪ 0 0 A=0 A=0

Temporal Coherence (TC) T=11 T=0 T=15 Core 1 Core 2 L1D L1D No coherence messages Interconnect Load A Store A=1 T=10 L2 Bank ▪▪▪ 10 0 10 A=0 A=0 A=0 A=0 10 A=0 A=1 10

Temporal Coherence (TC) What lifetime values should be requested on loads? Use a predictor to predict lifetime values What about stores to unexpired blocks? Stall them at the L2?

TC Stalling Issues Stall? Problem #1: Sensitive to mispredictions Problem #2: Impedes other accesses Problem #3: Hurts existing GPU applications Solution: TC-Weak

TC-Weak Stores return Global Write Completion Time (GWCT) T=0 T=1 T=31 1 data=NEW 2 FENCE 3 flag=SET 1 data=NEW 2 FENCE 3 flag=SET 1 data=NEW 2 FENCE 3 flag=SET 1 data=NEW 2 FENCE 3 flag=SET GPU Core 1 GPU Core 2 L1D L1D No stalling at L2 30 30 data=OLD data=OLD GWCT Table W0: W1: GWCT Table W0: W1: Store data=NEW Store flag=SET Interconnect L2 Bank 30 30 data=NEW data=OLD 47 47 flag=SET flag=NULL

TC-Weak

Methodology GPGPU-Sim v3.1.2 for GPU core model GEMS Ruby v2.1.1 for memory system All protocols written in SLICC Model a generic NVIDIA Fermi-based GPU (see paper for details) Applications: 6 do not require coherence 6 require coherence Barnes Hut Cloth Physics Versatile Place and Route Max-Flow Min-Cut 3D Wave Equation Solver Octree Partitioning Locks Stencil communication Load balancing

Interconnect Traffic Reduces traffic by 53% over MESI and 23% over GPU-VI for intra-workgroup applications Lower traffic than 16x-sized 32-way directory NO-COH MESI GPU-VI TC-Weak 2.3 1.50 1.25 1.00 Interconnect Traffic 0.75 0.50 0.25 Do not require coherence 0.00

Performance TC-Weak with simple predictor performs 85% better than disabling L1 caches Performs 28% better than TC with stalling Larger directory sizes do not improve performance NO-L1 MESI GPU-VI TC-Weak 2.0 1.5 1.0 Speedup 0.5 0.0 Require coherence

Complexity MESI L2 States Non-Coherent L1 TC-Weak L1 MESI L1 States Non-Coherent L2 TC-Weak L2

Summary First work to characterize GPU coherence challenges Save traffic and energy by using global time Reduce protocol complexity 85% performance improvement over no coherence Questions?

Backup Slides

Lifetime Predictor One prediction value per L2 bank Events local to L2 bank update prediction value Events Prediction Expired load: ↑ Unexpired store: ↓ Unexpired eviction: ↓ L2 Bank T = 20 T = 0 Load A Store A prediction-- Prediction Value prediction++ A A 10 30

TC-Strong vs TC-Weak Fixed lifetime for all applications Best lifetime for each application TCSUO TCS TCSOO 1.2 TCW TCW w/ predictor 1.4 1.0 1.2 Speedup 0.8 1.0 Speedup 0.8 0.6 All applications 0.6 All applications

Interconnect Power and Energy

Cache Coherence for GPU Architectures

Cache Coherence for GPU Architectures

Presentation Transcript

Extra Cache Coherence Examples

Directory-Based Cache Coherence

Cache Coherence Schemes for Multiprocessors

Modern GPU Architectures

Cache coherence

Cache Coherence

Cache coherence for CMPs

The Cache-Coherence Problem

Cache Coherence

Cache coherence, etc… - MIMD –

Cache Coherence Protocols

The Cache-Coherence Problem

Power Efficient Cache Coherence

Directory-based Cache Coherence

Cache Coherence

The Cache-Coherence Problem

The Cache-Coherence Problem

Example Cache Coherence Problem

Cache Coherence Techniques for Multicore Processors

The Cache-Coherence Problem