550 likes | 687 Vues
This lecture focuses on the concept of fair caching mechanisms in Chip-Multiprocessor (CMP) architectures, emphasizing the challenges posed by unfair cache sharing. The presentation highlights the impact of uniprocessor and CMP scheduling, and the phenomenon of priority inversion leading to performance degradation. Various metrics, including uniform slowdown and miss rate profiling, are discussed in detail, providing insights into the design of partitionable caches. By understanding these mechanisms, we can improve overall system throughput and mitigate starvation issues in concurrent processing environments.
E N D
ECE8833 Polymorphous and Many-Core Computer Architecture Lecture 6 Fair Caching Mechanisms for CMP Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering
Cache Sharing in CMP [Kim, Chandra, Solihin, PACT’04] Processor Core 1 Processor Core 2 L1 $ L1 $ L2 $ …… [Kim, Chandra, Solihin PACT2004] Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
Cache Sharing in CMP Processor Core 1 Processor Core 2 ←t1 L1 $ L1 $ L2 $ …… Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
Cache Sharing in CMP Processor Core 1 Processor Core 2 t2→ L1 $ L1 $ L2 $ …… Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
Cache Sharing in CMP Processor Core 1 Processor Core 2 ←t1 t2→ L1 $ L1 $ L2 $ …… t2’s throughput is significantly reduced due to unfair cache sharing. Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
Shared L2 Cache Space Contention Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
time slice t1 t2 t3 t1 t4 time slice t1 t1 t1 t1 t1 t3 t3 t2 t2 t4 Impact of Unfair Cache Sharing • Uniprocessor scheduling • 2-core CMP scheduling • gzip will get more time slices than others if gzip is set to run at higher priority (and it could run slower than others priority inversion) • It could further slows down the other processes (starvation) • Thus the overall throughput is reduced (uniform slowdown) P1: P2: 7
HIT Counters Value CTR Pos 0 CTR Pos 1 CTR Pos 2 CTR Pos 3 30 20 15 10 Misses = 25 Stack Distance Profiling Algorithm CTR Pos0 CTR Pos1 CTR Pos2 CTR Pos3 HIT Counters Cache Tag MRU LRU [Qureshi+, MICRO-39]
Stack Distance Profiling • A counter for each cache way, C>A is the counter for misses • Show the reuse frequency for each way in a cache • Can be used to predict the misses for associativity smaller than “A” • Misses for 2-way cache for gzip = C>A + Σ Ciwhere i = 3 to 8 • art does not need all the space for likely poor temporal locality • If the given space is halved for art and given to gzip, what happens?
Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown Execution time of ti when it runs alone. Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown Execution time of ti when it shares cache with others. Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown • We want to minimize: • Ideally: Try to equalize the ratio of miss increase of each thread Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown • We want to minimize: • Ideally: Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
LRU LRU LRU LRU Partitionable Cache Hardware • Modified LRU cache replacement policy • G. E. Suh, et. al., HPCA 2002 Per-thread Counter Current Partition Target Partition P1: 448B P1: 384B P2: 576B P2: 640B P2 Miss Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
LRU LRU LRU LRU LRU LRU * * LRU LRU Partitionable Cache Hardware • Modified LRU cache replacement policy • G. Suh, et. al., HPCA 2002 Current Partition Target Partition P1: 448B P1: 384B P2: 576B P2: 640B P2 Miss Partition granularity could be as coarse as one entire cache way Current Partition Target Partition P1: 384B P1: 384B P2: 640B P2: 640B Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
MissRate shared P1: P2: Repartitioning interval Target Partition P1: P2: Dynamic Fair Caching Algorithm MissRate alone Counters to keep miss rates running the process alone (from stack distance profiling) Ex) Optimizing M3 metric P1: P2: Counters to keep dynamic miss rates (running with a shared cache) 10K accesses found to be the best Counters to keep target partition size Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
MissRate shared MissRate shared P1: P1:20% P2: P2:15% Repartitioning interval Target Partition P1:256KB P2:256KB Dynamic Fair Caching Algorithm MissRate alone 1st Interval P1:20% P2: 5% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
MissRate shared P1:20% P2:15% Repartitioning interval Target Partition Target Partition P1:192KB P1:256KB P2:256KB P2:320KB Dynamic Fair Caching Algorithm MissRate alone Repartition! P1:20% P2: 5% Evaluate M3 P1: 20% / 20% P2: 15% / 5% Partition granularity: 64KB Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
MissRate shared MissRate shared P1:20% P1:20% P2:15% P2:15% Repartitioning interval Target Partition P1:192KB P2:320KB Dynamic Fair Caching Algorithm MissRate alone 2nd Interval P1:20% P2: 5% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
MissRate shared MissRate shared P1:20% P1:20% P2:15% P2:10% Repartitioning interval Target Partition Target Partition P1:192KB P1:128KB P2:384KB P2:320KB Dynamic Fair Caching Algorithm MissRate alone Repartition! P1:20% P2: 5% Evaluate M3 P1: 20% / 20% P2: 10% / 5% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
MissRate shared MissRate shared MissRate shared P1:25% P1:20% P1:20% P2: 9% P2:10% P2:10% Repartitioning interval Target Partition P1:128KB P2:384KB Dynamic Fair Caching Algorithm MissRate alone 3rd Interval P1:20% P2: 5% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
MissRate shared MissRate shared P1:25% P1:20% P2:10% P2: 9% Repartitioning interval Target Partition Target Partition P1:128KB P1:192KB P2:320KB P2:384KB Dynamic Fair Caching Algorithm MissRate alone Do Rollback if: P2: Δ<Trollback Δ=MRold-MRnew Repartition! P1:20% P2: 5% The best Trollback threshold found to be 20% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
Generic Repartitioning Algorithm Pick the largest and smallest as a pair for repartitioning Repeat for all candidate processes
# of ways given (1 to 16) Running Processes on Dual-Core [Qureshi & Patt, MICRO-39] • LRU: in real runs on avg., 7 ways were allocated to equake and 9 to vpr • UTIL • How much you use (in a set) is how much you will get • Ideally, 3 ways to equake and 13 to vpr # of ways given (1 to 16)
Defining Utility Utility Uab = Misses with a ways – Misses with b ways Low Utility Misses per 1000 instructions High Utility Saturating Utility Num ways from 16-way 1MB L2 Slide courtesy: Moin Qureshi, MICRO-39
PA UMON2 UMON1 Framework for UCP Shared L2 cache I$ I$ Core1 Core2 D$ D$ Main Memory Three components: • Utility Monitors (UMON) per core • Partitioning Algorithm (PA) • Replacement support to enforce partitions Slide courtesy: Moin Qureshi, MICRO-39
(MRU) H0 H1 H2 H3 H15 (LRU) ... + + + + + Utility Monitors (UMON) • For each core, simulate LRU policy using Auxiliary Tag Dir (ATD) • UMON-global (one way-counter for all sets) • Hit counters in ATD to count hits per recency position • LRU is a stack algorithm: hit counts utility E.g., hits(2 ways) = H0+H1 Set A Set B Set C Set D Set E Set F Set G Set H ATD
(MRU) H0 H1 H2 H3 H15 (LRU) ... + + + + + Utility Monitors (UMON) • Extra tags incur hardware and power overhead • DSS reduces overhead [Qureshi et al. ISCA’06] Set A Set A Set B Set B Set C Set C Set D Set D Set E Set E Set F Set F Set G Set G Set H Set H ATD
(MRU) H0 H1 H2 H3 H15 (LRU) ... + + + + + Utility Monitors (UMON) • Extra tags incur hardware and power overhead • DSS reduces overhead [Qureshi et al. ISCA’06] • 32 sets sufficient based on Chebyshev’s inequality • Sample every 32 sets (simple static) used in the paper • Storage < 2KB/UMON (or 0.17% L2) Set A Set B Set B Set E Set C Set F Set D UMON (DSS) Set E Set F Set G Set H ATD
Partitioning Algorithm (PA) • Evaluate all possible partitions and select the best • With aways to core1 and (16-a) ways to core2: Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1 Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2 • Select a that maximizes (Hitscore1 + Hitscore2) • Partitioning done once every 5 million cycles • After each partitioning interval • Hit counters in all UMONs are halved • To retain some past information
ways_occupied < ways_given Yes No Victim is the LRU line from miss-causing app Victim is the LRU line from other app Replacement Policy to Reach Desired Partition Use way partitioning [Suh+ HPCA’02, Iyer ICS’04] • Each Line contains core-id bits • On a miss, count ways_occupied in the set by miss-causing app • Binary decision for dual-core (in this paper)
UCP Performance (Weighted Speedup) UCP improves average weighted speedup by 11% (Dual Core)
UPC Performance (Throughput) UCP improves average throughput by 17%
Conventional LRU MRU LRU Incoming Block Slide Source: Yuejian Xie Slide Source: Yuejian Xie
Conventional LRU MRU LRU Occupies one cache blockfor a long time with no benefit! Slide Source: Yuejian Xie
LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU LRU Incoming Block Slide Source: Yuejian Xie 38
LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU LRU Adapted Slide from Yuejian Xie
LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU LRU LIP is not entirely new, Intel has tried this in 1998 when designing “Timna” (integrating CPU and Gfx accelerator that share L2) Slide Source: Yuejian Xie
BIP: Bimodal Insertion Policy [Qureshi et al. ISCA’07] LIP may not age older lines Infrequently insert lines in MRU position Let e = Bimodal throttle parameter if ( rand() < e ) Insert at MRU position; // LRU replacement policyelse Insert at LRU position; Promote to MRU if reused
DIP BIP LRU 1-ε ε LIP LRU DIP: Dynamic Insertion Policy [Qureshi et al. ISCA’07] • Two types of workloads: LRU-friendly or BIP-friendly • DIP can be implemented by: • Monitor both policies (LRU and BIP) • Choose the best-performing policy • Apply the best policy to the cache Need a cost-effective implementation “Set Dueling”
miss LRU-sets + BIP-sets – miss Follower Sets MSB = 0? No YES Use LRU Use BIP Set Dueling for DIP [Qureshi et al. ISCA’07] Divide the cache in three: • Dedicated LRU sets • Dedicated BIP sets • Follower sets (winner of LRU,BIP) n-bit saturating counter misses to LRU sets:counter++ misses to BIP sets : counter-- Counter decides policy for follower sets: • MSB = 0, Use LRU • MSB = 1, Use BIP n-bit cntr monitor choose apply (using a single counter) Slide Source: Moin Qureshi
PIPP [Xie & Loh ISCA’09] • What’s PIPP? • Promotion/Insertion Pseudo Partitioning • Achieving both capacity (UCP) and dead-time management (DIP). • Eviction • LRU block as the victim • Insertion • The core’s quota worth of blocks away from LRU • Promotion • To MRU by only one. Insert Position = 3 (Target Allocation) New Promote To Evict MRU LRU Hit Slide Source: Yuejian Xie 45
PIPP Example Core0’s Block Core1’s Block Core0 quota: 5 blocks Core1 quota: 3 blocks Request D Core1’s quota=3 1 A 2 3 4 B 5 C MRU LRU Slide Source: Yuejian Xie
PIPP Example Core0’s Block Core1’s Block Core0 quota: 5 blocks Core1 quota: 3 blocks Request 6 Core0’s quota=5 1 A 2 3 4 D B 5 MRU LRU Slide Source: Yuejian Xie
PIPP Example Core0’s Block Core1’s Block Core0 quota: 5 blocks Core1 quota: 3 blocks Request 7 Core0’s quota=5 1 A 2 6 3 4 D B MRU LRU Slide Source: Yuejian Xie
PIPP Example Core0’s Block Core1’s Block Core0 quota: 5 blocks Core1 quota: 3 blocks Request D 1 A 2 7 6 3 4 D MRU LRU Slide Source: Yuejian Xie
How PIPP Does Both Management MRU LRU Insert closer to LRU position Slide Source: Yuejian Xie 50