Utility-Based Cache Partitioning: High-Performance Runtime Mechanism

MoinuddinK.Qureshi, Univ of Texas at Austin MICRO’2006 Utility-Based Partitioning :A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches 2007, 12, 05 PAK, EUNJI

Outline • Introduction and Motivation • Utility-Based Cache Partitioning • Evaluation • Scalable Partitioning Algorithm • Related Work and Summary

Introduction • CMP and shared caches are common • Applications compete for the shared cache • Partitioning policies critical for high performance • Traditional policies: • Equal (half-and-half)  Performance isolation. No adaptation • LRU  Demand based. Demand ≠ benefit (e.g. streaming)

Background Utility Uab = Misses with a ways – Misses with b ways Low Utility High Utility Saturating Utility

Motivation Improve performance by giving more cache to the application that benefits more from cache

PA UMON2 UMON1 Framework for UCP • Three components: • Utility Monitors (UMON) per core • Partitioning Algorithm (PA) • Replacement support to enforce partitions Shared L2 cache I$ I$ Core1 Core2 D$ D$ Main Memory

(MRU)H0 H1 H2…H15(LRU) + + + + Set A Set A Set B Set B Set C Set C ATD Set D Set D Set E Set E Set F Set F Set G Set G Set H Set H Utility Monitors (UMON) • For each core, simulate LRU policy using ATD • Hit counters in ATD to count hits per recency position • LRU is a stack algorithm: hit counts  utility E.g. hits(2 ways) = H0+H1 MTD

Set A Set B Set B Set E Set C Set G Set D Set E Set F Set G Set H (MRU)H0 H1 H2…H15(LRU) + + + + Set A Set A Set B Set B Set C Set C Set D Set D Set E Set E Set F Set F Set G Set G Set H Set H Dynamic Set Sampling (DSS) • Extra tags incur hardware and power overhead • DSS reduces overhead [Qureshi,ISCA’06] • 32 sets sufficient (analytical bounds) • Storage < 2kB/UMON MTD ATD UMON (DSS)

DSS Bounds with Analytical Model Us = Sampled mean (Num ways allocated by DSS) Ug= Global mean (Num ways allocated by Global) P = P(Us within 1 way of Ug) By Cheb. inequality:P ≥ 1 – variance/n n = number of sampled sets In general, variance ≤ 3

Partitioning algorithm • Evaluate all possible partitions and select the best • With aways to core1 and (16-a) ways to core2: Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1 Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2 • Select a that maximizes (Hitscore1 + Hitscore2) • Partitioning done once every 5 million cycles

ways_occupied < ways_given Yes No Victim is the LRU line from miss-causing app Victim is the LRU line from other app Way Partitioning • Way partitioning support: [Suh+ HPCA’02, Iyer ICS’04] • Each line has core-id bits • On a miss, count ways_occupied in set by miss-causing application

Evaluation Methodology • Configuration • Two cores: 8-wide, 128-entry window • Private L1s • L2: Shared, unified, 1MB, 16-way • LRU-based Memory: 400 cycles, 32 banks • Benchmarks • Two-threaded workloads divided into 5 categories • Used 20 workloads (four from each type) 1.0 1.2 1.6 1.4 1.8 2.0 Weighted speedup for the baseline

Metrics • Weighted Speedup (default metric) • perf = IPC1/SingleIPC1 + IPC2/SingleIPC2 • correlates with reduction in execution time • Throughput • perf = IPC1 + IPC2 • can be unfair to low-IPC application • Hmean-fairness • perf = hmean(IPC1/SingleIPC1, IPC2/SingleIPC2) • balances fairness and performance

Results for weighted speedup UCP improves average weighted speedup by 11%

Results for throughput UCP improves average throughput by 17%

Results for hmean-fairness UCP improves average hmean-fairness by 11%

Effect of Number of Sampled Sets 8 sets 16 sets 32 sets All sets Dynamic Set Sampling (DSS) reduces overhead, not benefits

Scalability issues • Time complexity of partitioning low for two cores(number of possible partitions ≈ number of ways) • Possible partitions increase exponentially with cores • For a 32-way cache, possible partitions: • 4 cores  6545 • 8 cores  15.4 million • Problem NP hard  need scalable partitioning algorithm

Greedy Algorithm [Stone+ ToC ’92] • GA allocates 1 block to the app that has the max utility for one block. Repeat till all blocks allocated • Optimal partitioning when utility curves are convex • Pathological behavior for non-convex curves

Problem with Greedy Algorithm In each iteration, the utility for 1 block: U(A) = 10 misses U(B) = 0 misses Misses All blocks assigned to A, even if B has same miss reduction with fewer blocks Blocks assigned Problem: GA considers benefit only from the immediate block. Hence it fails to exploit huge gains from ahead

Lookahead Algorithm • Marginal Utility (MU) = Utility per cache resource • MUab = Uab/(b-a) • GA considers MU for 1 block. LA considers MU for all possible allocations • Select the app that has the max value for MU. Allocate it as many blocks required to get max MU • Repeat till all blocks assigned

Lookahead Algorithm (example) Iteration 1: MU(A) = 10/1 block MU(B) = 80/3 blocks B gets 3 blocks Misses Next five iterations: MU(A) = 10/1 block MU(B) = 0 A gets 1 block Blocks assigned Result: A gets 5 blocks and B gets 3 blocks (Optimal) Time complexity ≈ways2/2 (512 ops for 32-ways)

Results for partitioning algorithms Four cores sharing a 2MB 32-way L2 LRU UCP(Greedy) UCP(Lookahead) UCP(EvalAll) Mix1 (gap-applu-apsi-gzp) Mix4 (mcf-art-eqk-wupw) Mix3 (mcf-applu-art-vrtx) Mix2 (swm-glg-mesa-prl) LA performs similar to EvalAll, with low time-complexity

Summary • CMP and shared caches are common • Partition shared caches based on utility, not demand • UMON estimates utility at runtime with low overhead • UCP improves performance: • Weighted speedup by 11% • Throughput by 17% • Hmean-fairness by 11% • Lookahead algorithm is scalable to many cores sharing a highly associative cache

Utility-Based Cache Partitioning: High-Performance Runtime Mechanism

Utility-Based Cache Partitioning: High-Performance Runtime Mechanism

Presentation Transcript

INTRODUCTION TO DIGITAL SIGNAL PROCESSORS

Understanding Fourth Graders’ Mathematical Thinking: Issues and Insights for Teaching Fractions

Forensic Science and Fire Investigation

Mapping Groundwater Vulnerability to Contamination in Texas

(Mostly) micro-hydro (mostly) in Thailand

Corporations

TexasOnline e-Filing Overview eFiling 101: Upcoming Changes and Important Operational Dates Austin, Texas August 19 , 2

Cardiovascular Disease (CVD) in Texas

Musings: 25 Micro-Presentations.

Christina Kasprzak Austin, Texas March 2011

The Local Community as Story and Culture

Working Texas Style: The Changing Jobs Economy in Austin and Texas

MICRO TEACHING

John Moore, Director Regulatory Integrity Division Texas Workforce Commission Austin, Texas

Diabetes Prevention and Control

Redo Logs and Recovery

Generation in the Context of MT

Futurcast Web EASY Tutorial

Christina Kasprzak Austin, Texas November 2010

T exas H omeless E ducation O ffice

The Republic of Texas

April 7, 2008 Austin, Texas IAIABC All Committee Conference