Achieving Non-Inclusive Cache Performance with Inclusive Caches

Achieving Non-Inclusive Cache Performancewith Inclusive Caches MICRO 2010 Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer Intel Corporation, VSSAD Present by Soon-Won Hong

Motivation • Factors making caching important • CPU speed >> Memory speed • Chip Multi-Processors (CMPs) • Goal: • High performing LLC iL1 dL1 iL1 dL1 L2 L2 Last Level Cache (LLC)

Motivation • Factors making caching important • CPU speed >> Memory speed • Chip Multi-Processors (CMPs) • Goal: • High performing LLC • High performing cache hierarchy iL1 dL1 iL1 dL1 L2 L2 Last Level Cache (LLC)

Cache Hierarchy Core request evict L1 fill BackInval LLC fill victim memory • Inclusive Hierarchy • L1 subset of LLC

Cache Hierarchy Core request Core request victim evict L1 L1 fill fill BackInval LLC LLC fill fill victim victim memory memory • Inclusive Hierarchy • L1 subset of LLC • Exclusive Hierarchy • L1 is NOT in LLC

Cache Hierarchy Core request Core request Core request victim evict L1 L1 L1 fill fill fill BackInval LLC LLC LLC fill fill fill victim victim victim memory memory memory • Inclusive Hierarchy • L1 subset of LLC • Non-Inclusive Hierarchy • L1 not subset of LLC • Exclusive Hierarchy • L1 is NOT in LLC

Cache Hierarchy Core request Core request Core request victim L1 L1 L1 evict fill fill fill IN A NUTSHELL BackInval LLC LLC LLC fill fill fill victim victim victim memory memory memory (+) simplify cache coherence (−) waste cache capacity (−) back-invalidates limits performance • Inclusive Hierarchy • L1 subset of LLC • Non-Inclusive Hierarchy • L1 not subset of LLC • Exclusive Hierarchy • L1 is NOT in LLC Inclusive Caches Total Capacity: LLC>= LLC and <= (L1+LLC) L1 + LLC Back-Invalidate: YESNO NO (+) do not waste cache capacity (−) complicate cache coherence (−) extra hardware for snoop filtering Non-Inclusive Caches Coherence: LLC Acts AsLLC miss snoops ALL L1$ LLC miss snoops ALL L1$ Directory (or use Snoop Filter) (or use Snoop Filter)

Performance of Non-Inclusive and Exclusive LLCs AMD INTEL Baseline Inclusion (2-core CMP with 32KB L1, 256KB L2, LLC based on ratio) • Enforcing inclusion is bad when LLC is not significantly larger than MLC • Why Non-inclusive (NI) and Exclusive LLCs perform better? • Make use of extra cache capacity by avoiding duplication • Avoid problems dealing with harmful back-invalidates

Back-Invalidate Problem with Inclusive Caches • Inclusion Victims:Lines evicted from core caches due to LLC eviction • Small caches filter temporal locality • Small cache hits do not update LLC LRU • “Hot” small cache lines  LRU in LLC • Example Reference Pattern: … a, b, a, c, a, d, a, e, a, f… a b a L1: a b a L2: a b c a LRU MRU b a c b a Reference ‘e’ misses and evicts ‘a’ from hierarchy Next Reference to ‘a’ misses a c d a c b a d c b a a d e d d c b a e d c b

Inclusion Problem Exacerbated on CMPs! • Types of Applications: • Core Cache Fitting (CCF) Apps: working set fits in the core caches • LLC Fitting (LLCF) Apps:working set fits in the LLC • LLC Thrashing (LLCT) Apps:working set is larger than LLC iL1 dL1 iL1 dL1 L2 L2 CCF LLC LLCT LLCF

Inclusion Problem Exacerbated on CMPs! iL1 dL1 iL1 dL1 L2 L2 LLC

Inclusion Problem Exacerbated on CMPs! • CCF apps serviced from L2 cache and rarely from the LLC • Replacement state of CCF apps becomes LRU at LLC iL1 dL1 iL1 dL1 L2 L2 LLC

Inclusion Problem Exacerbated on CMPs! • CCF apps serviced from L2 cache and rarely from the LLC • Replacement state of CCF apps becomes LRU at LLC • LLCF app replaces CCF working set from LLC • Inclusion mandates removing CCF working set from entire hierarchy iL1 dL1 iL1 dL1 L2 L2 LLC

Main Idea of Temporal Locality Aware(TLA) • Temporal Locality Hints(TLH) • Early Core Invalidation(ECI) • Query Based Selection(QBS)

Eliminate “Inclusion Victims” Using Temporal Hints • Baseline policies only update replacement state at level of hit • Proposal: convey temporal locality in small caches to LLC • Temporal Locality Hints: • Non-data requests sent to update LLC replacement state Core request (L1 hit) Update LRU L1 L2 LLC (TLH) Update LRU

Improving Inclusive Cache Performance • Eliminate back-invalidates (i.e. build non-inclusive caches) • Increases coherence complexity • Goal: Retain benefits of inclusion yet avoid inclusion victims • Solution • Ensure LLC DOES NOT evict “hot” lines from core caches • Must identify LLC lines that have high temporal locality in core caches

Early Core Invalidate (ECI) • Main Idea: Derive temporal locality by removing line early from core caches • Early Core Invalidate (ECI): • Send early invalidate for the next victim in same set • If line is “hot”, it will be “rescued” from LLC  “rescue” updates LLC replacement state as a side effect L1 L2 Back Invalidate Miss Flow L3 Early Core Invalidate e d c b a LRU MRU Memory Next Victim

Query Based Selection (QBS) • Main Idea: Replace lines that are NOT resident in core caches • Query Based Selection (QBS): • LLC sends back-inval request • Core rejects back-inval if line is resident in core caches REJECT L1 L2 Back-Invalidate Request Miss Flow L3 a e d c b LRU MRU Memory

Query Based Selection (QBS) • Main Idea: Replace lines that are NOT resident in core caches • Query Based Selection (QBS): • LLC sends back-inval request • Core rejects back-inval if line is resident in core caches • If core rejects, update to MRU in LLC • LLC repeats back-inval process till core accepts back-inval request (or timeout) ACCEPT L1 L2 Back-Invalidate Request Miss Flow L3 b a e d c LRU MRU Memory

Example of TLH, ECI and QBS

Performance of L1 Temporal Locality Hints • L1 hints decrease 85% of gap between inclusion & non-inclusion • Limitations of L1 Hints: • Very high BW • num messages = num L1 hits 2T Workloads on a 1:4 Hierarchy 5.2% 6.1% Baseline Inclusion • *Our studies do not model TLH BW

Performance of Early Core Invalidate (ECI) • ECI decrease 55% of gap between inclusion & non-inclusion • Pros: • No HW overhead, Low BW • num messages = num LLC misses • Limitations: • Short time to rescue. Rescue must occur BEFORE next miss to set 2T Workloads on a 1:4 Hierarchy 3.4% 6.1% Baseline Inclusion

Performance of Query Based Selection (QBS) • QBS outperforms non-inclusion • Pros: • No HW overhead, Low BW • num messages = num LLC misses • No considering of time to rescue. 2T Workloads on a 1:4 Hierarchy 6.6% 6.1% Baseline Inclusion

Summary of TLA Cache Management (2-core CMP) Baseline Inclusion

Summary • Problem:Inclusive cache problem becomes WORSE on CMPs • E.g. Core Cache fitting + LLC Fitting/Thrashing • Conventional Wisdom: Primary benefit of non-inclusive cache is because of higher capacity • We show:primary benefit NOT capacity but avoiding back-invalidates • Proposal: Temporal Locality Aware Cache Management • Retains benefit of inclusion while minimizing back-invalidate problem • TLA managed inclusive cache = performance of non-inclusive cache

Achieving Non-Inclusive Cache Performance with Inclusive Caches

Achieving Non-Inclusive Cache Performance with Inclusive Caches

Presentation Transcript

INCLUSIVE LITERACY

Inclusive education

INCLUSIVE EDUCATION

Inclusive Design

Inclusive learning

Inclusive Design

Inclusive Fitness

“Inclusive Entrepreneurship”

Inclusive diffraction

Inclusive Learning

Inclusive services, inclusive environment

Inclusive Education

Inclusive Education

Achieving Inclusive Excellence at Rice :

Inclusive Education

Inclusive Fitness

Inclusive Literacy

Achieving Inclusive Growth in the North East

INCLUSIVE EDUCATION

Inclusive practice

Inclusive Growth

INCLUSIVE LITERACY