Managing Wire Delay in CMP Caches

Managing Wire Delay in CMP Caches Brad Beckmann Dissertation Defense Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin-Madison 8/15/06

L2 Bank L2 Bank Current CMP: AMD Athlon 64 X2 CPU 0 CPU 1 2 CPUs 2 L2 Cache Banks

CPU 0 CPU 1 L1 D$ L1 I$ L1 I$ L1 D$ L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank CPU 5 CPU 6 CPU 3 CPU 4 CPU 7 CPU 2 L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 I$ L1 D$ L1 D$ L1 I$ L1 I$ L1 D$ CMP Cache Trends future technology (< 45 nm) today technology (~90 nm)

Maximize Cache Capacity 40+ Cycles A Slow Access Latency Baseline: CMP-Shared L1 I $ L1 I $ L2 Bank L2 Bank CPU 3 CPU 4 L1 D $ L1 D $ L1 I $ L1 I $ L2 Bank L2 Bank CPU 2 CPU 5 L1 D $ L1 D $ L1 I $ L1 I $ L2 Bank L2 Bank CPU 1 CPU 6 L1 D $ L1 D $ L1 I $ L1 I $ L2 Bank L2 Bank CPU 0 CPU 7 L1 D $ L1 D $

Fast Access Latency A Lower Effective Capacity A A Baseline: CMP-Private L1 I $ L1 I $ CPU 3 CPU 4 Private Private L1 D $ L1 D $ L2 L2 L1 I $ L1 I $ Private CPU 2 CPU 5 Private L1 D $ L1 D $ L2 L2 L1 I $ L1 I $ Private Private CPU 1 CPU 6 L1 D $ L1 D $ L2 L2 Thesis: both Fast Access & High Capacity L1 I $ L1 I $ Private Private CPU 0 CPU 7 L2 L1 D $ L1 D $ L2

#1 #2 #3 #4 #5 Thesis Contributions • Characterizing CMP workloads—sharing types • Single requestor • Shared read-only • Shared read-write • Techniques to manage wire delay • Migration← Previously discussed • Selective Replication← Talk’s focus • Transmission Lines← Previously discussed • Combination outperforms isolated techniques

Outline • Introduction • Characterization of CMP working sets • L2 requests • L2 cache capacity • Sharing behavior • L2 request locality • ASR: Adaptive Selective Replication • Cache block migration • TLC: Transmission Line Caches • Combination of techniques

Characterizing CMP Working Sets • 8 processor CMP • 16 MB shared L2 cache • 64-byte block size • 64 KB L1 I&D caches • Profile L2 blocks during their on-chip lifetime • Three L2 block sharing types • Single requestor • All requests by a single processor • Shared read only • Read only requests by multiple processors • Shared read-write • Read and write requests by multiple processors • Workloads • Commercial: apache, jbb, otlp, zeus • Scientific: (SpecOMP) apsi & art(Splash) barnes & ocean

Percent of L2 Cache Requests Majority of commercial workload requests for shared blocks Request Types

Percent of L2 Cache Capacity Majority of Capacity for Single Requestor Blocks

Costs of Replication • Decrease effective cache capacity • Storing replicas instead of unique blocks • Analyze average number of sharers • During on-chip lifetime • Increase store latency • Invalidate remote read-only copies • Run length [Eggers & Katz ISCA 88] • Average intervening remote reads between writes from the same processor + intervening reads between writes from different processors • For L2 requests

Few intervening requests: Commercial Workloads Widely Shared: All Workloads Sharing Behavior requests breakdown

High Locality Inter. Locality No Locality Low Locality Locality Graphs

Request to Block Distribution: Single Requestor Blocks Lower Locality

Request to Block Distribution: Shared Read Only Blocks High Locality L2 Cache MRU Hit Ratio

Request to Block Distribution: Shared Read-Write Blocks Intermediate Locality

Workload Characterization: Summary • Commercial workloads • significantshared read-only activity • Most of requests  42-71% • Little capacity without replication  9-21% • Highly shared  3.0-4.5 avg. processors • High request locality  3% of blocks account for 70% of requests • Shared read-only data great candidate for selective replication

Outline • Introduction • Characterization of CMP working sets • ASR: Adaptive Selective Replication • Replication effect on memory performance • SPR: Selective Probabilistic Replication • Monitoring and adapting to workload behavior • Evaluation • Cache block migration • TLC: Transmission Line Caches • Combination of techniques

Replication and Memory Cycles Memory cycles + (Pmiss x Lmiss) Instruction Instructions Average cycles for L1 cache misses (PlocalL2 x LlocalL2) + (PremoteL2 x LremoteL2) =

Replication Benefit: L2 Hit Cycles L2 Hit Cycles Replication Capacity

Replication and Memory Cycles Memory cycles (PlocalL2 x LlocalL2) + (PremoteL2 x LremoteL2) + Instruction Instructions Average cycles for L1 cache misses (Pmiss x Lmiss) =

Replication Cost:L2 Miss Cycles L2 Miss Cycles Replication Capacity

Optimal Replication Effectiveness:Total Cycles Total Cycle Curve Total Cycles Replication Capacity

Outline • Wires and CMP caches • Characterization of CMP working sets • ASR: Adaptive Selective Replication • Replication effect on memory performance • SPR: Selective Probabilistic Replication • Monitoring and adapting to workload behavior • Evaluation • Cache block migration • TLC: Transmission Line Caches • Combination of techniques

Identifying and Replicating Shared Read-only • Minimal coherence impact • Per cache block identification • Heuristic - not perfect • Dirty bit • Indicates written data • Leverage current bandwidth reduction optimization • Shared bit • Indicates multiple sharers • Set for blocks with multiple requestors

SPR: Selective Probabilistic Replication • Mechanism for Selective Replication • Control duplication between L2 caches in CMP-Private • Relax L2 inclusion property • L2 evictions do not force L1 evictions • Non-exclusive cache hierarchy • Ring Writebacks • L1 Writebacks passed clockwise between private L2 caches • Merge with other existing L2 copies • Probabilistically choose between • Local writeback  allow replication • Ring writeback  disallow replication

L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ SPR: Selective Probabilistic Replication L1 I $ Private L2 Private L2 CPU 3 CPU 4 L1 D $ L1 I $ Private L2 Private L2 CPU 2 CPU 5 L1 D $ L1 I $ Private L2 Private L2 CPU 1 CPU 6 L1 D $ L1 I $ Private L2 Private L2 CPU 0 CPU 7 L1 D $

SPR: Selective Probabilistic Replication Current Level Replication Capacity 3 5 1 4 0 2 Replication Levels real workloads

Outline • Introduction • Characterization of CMP working sets • ASR: Adaptive Selective Replication • Replication effect on memory performance • SPR: Selective Probabilistic Replication • Implementing ASR • Evaluation • Cache block migration • TLC: Transmission Line Caches • Combination of techniques

Implementing ASR • Four mechanisms estimate deltas • Decrease-in-replication Benefit • Increase-in-replication Benefit • Decrease-in-replication Cost • Increase-in-replication Cost • Triggering a cost-benefit analysis

lower level current level ASR: Decrease-in-replication Benefit L2 Hit Cycles Replication Capacity

ASR: Decrease-in-replication Benefit • Goal • Determine replication benefit decrease of the next lower level • Mechanism • Current Replica Bit • Per L2 cache block • Set for replications of the current level • Not set for replications of lower level • Current replica hits would be remote hits with next lower level • Overhead • 1-bit x 256 K L2 blocks = 32 KB

higher level current level ASR: Increase-in-replication Benefit L2 Hit Cycles Replication Capacity

ASR: Increase-in-replication Benefit • Goal • Determine replication benefit increase of the next higher level • Mechanism • Next Level Hit Buffers (NLHBs) • 8-bit partial tag buffer • Store replicas of the next higher • NLHB hits would be local L2 hits with next higher level • Overhead • 8-bits x 16 K entries x 8 processors = 128 KB

lower level current level ASR: Decrease-in-replicationCost L2 Miss Cycles Replication Capacity

ASR: Decrease-in-replication Cost • Goal • Determine replication cost decrease of the next lower level • Mechanism • Victim Tag Buffers (VTBs) • 16-bit partial tags • Store recently evicted blocks of current replication level • VTB hits would be on-chip hits with next lower level • Overhead • 16-bits x 1 K entry x 8 processors = 16 KB

higher level current level ASR: Increase-in-replicationCost L2 Miss Cycles Replication Capacity

ASR: Increase-in-replication Cost • Goal • Determine replication cost increase of the next higher level • Mechanism • Way and Set counters [Suh et al. HPCA 2002] • Identify soon-to-be-evicted blocks • 16-way pseudo LRU • 256 set groups • On-chip hits that would be off-chip with next higher level • Overhead • 255-bit pseudo LRU tree x 8 processors = 255 B • Overall storage overhead: 212 KB or 1.2% of total storage

ASR: Triggering a Cost-Benefit Analysis • Goal • Dynamically adapt to workload behavior • Avoid unnecessary replication level changes • Mechanism • Evaluation trigger • Local replications or NLHB allocations exceed 1K • Replication change • Four consecutive evaluations in the same direction

ASR: Adaptive Algorithm

Outline • Introduction • Characterization of CMP working sets • ASR: Adaptive Selective Replication • Replication effect on memory performance • SPR: Selective Probabilistic Replication • Implementing ASR • Evaluation • Cache block migration • TLC: Transmission Line Caches • Combination of techniques

Methodology • Full system simulation • Simics • Wisconsin’s GEMS Timing Simulator • Out-of-order processor • Memory system • Workloads • Commercial • apache, jbb, otlp, zeus • Scientific • Not shown here, in dissertation

System Parameters [ 8 core CMP, 45 nm technology ]

Replication Benefit, Cost, & Effectiveness Curves Benefit Cost

Replication Benefit, Cost, & Effectiveness Curves Effectiveness 4 MB, 150 Memory latency

ASR: Adapting to Workload Behavior Oltp: All CPUs

ASR: Adapting to Workload Behavior Apache: All CPUs

ASR: Adapting to Workload Behavior Apache: CPU 0

ASR: Adapting to Workload Behavior Apache: CPUs 1-7

Lack Dynamic Adaptation Comparison of Replication Policies • SPR  multiple possible policies • Evaluated 4 shared read-only replication policies • VR:Victim Replication • Previously proposed [Zhang ISCA 05] • Disallow replicas to evict shared owner blocks • NR: CMP-NuRapid • Previously proposed [Chishti ISCA 05] • Replicate upon the second request • CC:Cooperative Caching • Previously proposed [Chang ISCA 06] • Replace replicas first • Spill singlets to remote caches • Tunable parameter 100%, 70%, 30%, 0% • ASR:Adaptive Selective Replication • My proposal • Monitor and adjust to workload demand

Managing Wire Delay in CMP Caches

Managing Wire Delay in CMP Caches

Presentation Transcript

Caches

Caches

Caches

Caches

Caches

Caches

Caches in Systems

Caches

Managing Wire Delay in Large CMP Caches

Caches

Managing delay reduction projects

ASR: Adaptive Selective Replication for CMP Caches

Adaptive Insertion Policies for Managing Shared Caches

Caches

Caches

Wire Break Alarm with Delay Projects

Managing Wire Delay in CMP Caches

Caches

Managing Wire Delay in Large CMP Caches

PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches

ASR: Adaptive Selective Replication for CMP Caches