Adaptive Insertion Policies for High-Performance Caching
400 likes | 697 Vues
Adaptive Insertion Policies for High-Performance Caching. Aamer Jaleel Simon C. Steely Jr. Joel Emer. Moinuddin K. Qureshi Yale N. Patt. International Symposium on Computer Architecture (ISCA) 2007. Memory. L2 miss. Proc. L2. L1. Background.
Adaptive Insertion Policies for High-Performance Caching
E N D
Presentation Transcript
Adaptive Insertion Policies for High-Performance Caching Aamer JaleelSimon C. Steely Jr.Joel Emer Moinuddin K. QureshiYale N. Patt International Symposium on Computer Architecture (ISCA) 2007
Memory L2 miss Proc L2 L1 Background Fast processor + Slow memory Cache hierarchy (~10 cycles) (~2 cycles) (~300 cycles) L1 misses Short latency, can be hidden L2 misses Long-latency, hurts performance Important to reduce Last Level (L2) cache misses
Motivation • L1 for latency, L2 for capacity • Traditionally L2 managed similar to L1 (typically LRU) • L1 filters temporal locality Poor locality at L2 • LRU causes thrashing when working set > cache size Most lines remain unused between insertion and eviction
Dead on Arrival (DoA) Lines DoA Lines: Lines unused between insertion and eviction (%) DoA Lines • For the 1MB 16-way L2, 60% of lines are DoA • Ineffective use of cache space
art mcf Misses per 1000 instructions Misses per 1000 instructions Cache size in MB Cache size in MB Why DoA Lines ? • Streaming data Never reused. L2 caches don’t help. • Working set of application greater than cache size Soln: if working set > cache size, retain some working set
Overview Problem: LRU replacement inefficient for L2 caches Goal: A replacement policy that has: 1. Low hardware overhead 2. Low complexity 3. High performance 4. Robust across workloads Proposal: A mechanism that reduces misses by 21% and has total storage overhead < two bytes
Outline • Introduction • Static Insertion Policies • Dynamic Insertion Policies • Summary
Cache Insertion Policy • Two components of cache replacement: • Victim Selection:Which line to replace for incoming line? (E.g. LRU, Random, FIFO, LFU) • Insertion Policy:Where is incoming line placed in replacement list? (E.g. insert incoming line at MRU position) Simple changes to insertion policy can greatly improve cache performance for memory-intensive workloads
MRU LRU a b c d e f g h Reference to ‘i’ with traditional LRU policy: i a b c d e f g Reference to ‘i’ with LIP: a b c d e f g i LRU-Insertion Policy (LIP) Choose victim. Do NOT promote to MRU Lines do not enter non-LRU positions unless reused
Bimodal-Insertion Policy (BIP) LIP does not age older lines Infrequently insert lines in MRU position Let e = Bimodal throttle parameter if ( rand() < e ) Insert at MRU position;else Insert at LRU position; For small e , BIP retains thrashing protection of LIP while responding to changes in working set
Circular Reference Model [Smith & GoodmanISCA’84] Reference stream has T blocks and repeats N times. Cache has K blocks (K<T and N>>T) For small e , BIP retains thrashing protection of LIP while adapting to changes in working set
LIP BIP(e=1/32) Results for LIP and BIP (%) Reduction in L2 MPKI Changes to insertion policy increases misses for LRU-friendly workloads
Outline • Introduction • Static Insertion Policies • Dynamic Insertion Policies • Summary
Dynamic-Insertion Policy (DIP) • Two types of workloads: LRU-friendly or BIP-friendly • DIP can be implemented by: • Monitor both policies (LRU and BIP) • Choose the best-performing policy • Apply the best policy to the cache Need a cost-effective implementation “Set Dueling”
miss LRU-sets + BIP-sets – miss Follower Sets MSB = 0? No YES Use LRU Use BIP DIP via “Set Dueling” Divide the cache in three: • Dedicated LRU sets • Dedicated BIP sets • Follower sets (winner of LRU,BIP) n-bit saturating counter misses to LRU-sets:counter++ misses to BIP-set: counter-- Counter decides policy for Follower sets: • MSB = 0, Use LRU • MSB = 1, Use BIP n-bit cntr monitor choose apply (using a single counter)
Bounds on Dedicated Sets How many dedicated sets required for “Set Dueling”? μLRU, σLRU, μBIP, σBIP= Avg. misses and stdev. for LRU and BIP P(Best) = probability of selecting best policy P(Best) = P(Z< r√n) n = number of dedicated setsZ = standard Gaussian variabler = |μLRU-μBIP|/√(σLRU2 + σBIP2) (For majority workloads r > 0.2) 32-64 dedicated sets sufficient
DIP (32 dedicated sets) Results for DIP BIP (%) Reduction in L2 MPKI DIP reduces average MPKI by 21% and requires < two bytes storage overhead
DIP vs. Other Policies (%) Reduction in L2 MPKI DIP OPT Double(2MB) (LRU+RND) (LRU+LFU) (LRU+MRU) DIP bridges two-thirds of gap between LRU and OPT
IPC Improvement Processor: 4 wide, 32-entry windowMemory 270 cycles. L2: 1MB 16-way LRU IPC Improvement with DIP (%) DIP Improves IPC by 9.3% on average
Outline • Introduction • Static Insertion Policies • Dynamic Insertion Policies • Summary
Summary LRU inefficient for L2 caches. Most lines remain unused between insertion and eviction Proposed changes to cache insertion policy (DIP) has:1. Low hardware overhead Requires < two bytes storage overhead 2. Low complexity Trivial to implement. No changes to cache structure 3. High performance Reduces misses by 21%. Two-thirds as good as OPT 4. Robust across workloads Almost as good as LRU for LRU-friendly workloads
Questions source code:www.ece.utexas.edu/~qk/dip
DIP LRU 8MB 2MB 4MB 1MB } } } } DIP vs. LRU Across Cache Sizes MPKI Relative to 1MB LRU (%)(Smaller is better) Avg_16 art mcf swim health equake MPKI reduces till workload fits in the cache
DIP with 1MB 8-way L2 Cache 50 40 30 (%) Reduction in L2 MPKI 20 10 0 MPKI reduction with 8-way (19%) similar to 16-way (21%)
Interaction with Prefetching (PC-based stride prefetcher) DIP-NoPref LRU-Pref DIP-Pref (%) Reduction in L2 MPKI DIP also works well in presence of prefetching
Random Replacement (Success Function) Cache contains K blocks and reference stream contains T Prob that a block in cache survives 1 eviction = (1-1/K) Total number of evictions = (T-1)*Pmiss Phit = (1-1/K)^(T-1)*Pmiss) Phit = (1-1/K)^(T-1)(1-Phit) Iterative solution: Start at Phit=0 1. Phit = (1-1/K)^T