Adaptive Line Placement with the Set Balancing Cache

Adaptive Line Placement with the Set Balancing Cache Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec. 2009

Abstract • Efficient memory hierarchy design is critical due to the increasing gap between the speed of the processors and the memory. One of the source of inefficiency in current caches is the non-uniform distribution of the memory accesses on the cache sets. Its consequence is that while some cache sets may have working sets that are far from fitting in them, other sets may be underutilized because their working set has fewer lines than the set. • In this paper we present a technique that aims to balance the pressure on the cache sets by detecting when it may be beneficial to associate sets, displacing lines from stressed sets to underutilized ones. This new technique, called Set Balancing Cache or SBC, achieved an average reduction of 13% in the miss rate of ten benchmarks from the SPEC CPU2006 suite, resulting in an average IPC improvement of 5%.

What’s the Problem • Non-uniform distribution of the memory accesses on the cache sets • Some cache sets whose working set is larger than associativity of cache • However, other sets may be underutilized • An intuitive answer is to increase the associativity • Impact access latency and power consumption • More tags have to be read and compared Block 0 Block 1 Replace with Block 2 Way 0 Way 1 Block n Conflict miss, (large local miss) Good local miss, but underutilized

Related Works Efficient cache architecture for performance improvement Balance non-uniform distribution of accesses across $ sets Selective victim caching Novel replacement policy Increase possible places where a memory block can be placed A highly missed block in the past, has higher probability of imposing more misses in future Place blocks in a second associated line[1, 3,19] Only the most frequently missed blocks in conventional part of cache in recent past are retained in victim cache [2] Attempt to identify underutilized $ frames More than half the lines installed in cache are never reused before getting evicted Retain the lines to be replaced in underutilized $ frames of direct-mapped cache[12] Dynamically change the insertion policy between marking the most recently inserted lines in a set as MRU or LRU [13] Improve miss rate reduction Allow different $ sets have different # of lines according to program demand[14] Provide better performance and less HW cost Displace lines from sets with high local miss rates to sets with underutilized line This Paper: Retain the lines to be replaced in underutilized $ frames of direct-mapped cache Place blocks in a second associated line Allow different $ sets have different # of lines according to program demand

Set Balancing Cache • Shift lines from sets with high local miss rates to sets with underutilized lines • Goal: Balance saturation level among sets and avoid misses • Move lines from highly saturated sets to little saturated ones Replace with Block 2 Way 0 Way 1 Conflict Miss (large local miss) Block 0 Block 1 Good local miss, but underutilized Block n Block 1 Perform displacement from highly saturated set

The Involved Techniques • The mechanism to measure the saturation level of each cache set • Saturation counter in each cache set • Increment when cache miss, and decrement when cache hit • The association algorithm • Decide which set(s) are to be associated in the displacement • The displacement algorithm • Decide when to displace lines to an associated set • The search algorithm • How to find the displaced line

Index Static Set Balancing Cache • The association algorithm • Each cache set is statically associated to another specific set • Complement the most significant bit of the index • The displacement algorithm (require 2 conditions) • Perform displacement when a line is evicted from a set whose saturation counter is maximum value 2K-1 (K: associativity) • Saturation counter of the receiver is below “displacement limit” K • The search algorithm • Every cache line has an additional displaced bit (d bit) • d= 0: the line is native to the set; d= 1: displace from its associated set • Every set has an additional second search bit (sc bit) • sc= 1: the associated set hold displaced lines “00” “01” “10” “11” 1st search: tag equality and d=0 2nd search in the associated set: tag equality and d=1 (not found and sc=1)

Example: Static SBC Operation • Presume a 2-way cache with 4 sets • The upper tag in each set is the most recently used • The saturation counters operate in the range 0 to 3 00000 2st ref: miss but sc=1 -> perform second search in set 2 1st ref: miss and sc=0 -> no second search but sat counter reach max value “3” Displace the evicted line (tag:10010) from set 0 to set 2 Tag is found and d=1 -> second hit

Dynamic Set Balancing Cache - 1 • Goal: associate a set whose saturation counter reaches max value with a set has smallest saturation level • The association algorithm • Identify the smallest one using Destination Set Selector (DSS) • Keep the data related to the less saturated sets • When the saturation counter of a free set is updated • The index of this set is compared with the “index” of all the valid DSS entries • If hit, the corresponding entry is updated with the new saturation level • If miss, and its saturation level is smaller than max.level • The set index and its saturation level are stored in DSS entry pointed by max.entry “0000” 1 2 DSS Min. 1 “1101” 3 Saturation LVL V Index of Set Saturation LVL DSS Entry Index of Set Min. saturation level in DSS (provide index of best set for association) 1 “0110” 10 2 0 “0000” 0 1 9 “1001” 1 Max. 2 Saturation LVL DSS Entry Max. saturation level in DSS 10 2 3

Dynamic Set Balancing Cache - 2 • The association algorithm (Cont.) • Invalidation of the DSS entry • Scenario 1: when the saturation level reaches the “displacement limit” • Scenario 2: when the entry pointed by “min” is used for association • Association Table (AT) • One entry per set • Record the relation of association • AT(i). Index • Index of the set associated with set i • AT(i). s/d • 1: being associated • 0: not associated yet • Its entry stores its own index “0000” 0 Association Table “0001” 0 Set i AT. index AT. s/d 1 “0110” “0000” “0011” 0 “0001” “0010” “0011” ….

Dynamic Set Balancing Cache - 3 • The displacement algorithm • Just as SSBC, displacement when a line is evicted from a set whose saturation counter is max value 2K-1 (K: associativity) • The search algorithm • Just as SSBC, every cache line has an additional displaced bit (d bit) • 1st search: tag equality and d=0 • AT(i). s/d= 0 • 2nd search in set AT(i). index: tag equality and d=1 • Move LRU line from source set to destination set to replace it • (the LRU line of destination set will be evicted) (not found) (result in a miss) AT(i). s/d= 1 (not found)

Example: Dynamic SBC Operation • Presume a 2-way cache with 4 sets • The upper tag in each set is the most recently used • The saturation counters operate in the range 0 to 3 1st ref: miss and AT(i). s/d= 0 -> not associated yet but sat counter reach max value “3” 2st ref: miss but AT(i). s/d= 1 -> perform second search in set AT(0).index= 3 Displace the evicted line (tag:10010) from set 0 to set 3 (assume DSS provides set 3 as candidate) Tag is found and d=1 -> second hit

Experimental Setup • Use SESC simulator [15] with two baseline configuration • Two-level on-chip cache • L2 (unified) cache: 2MB/8-way/64B/LRU • Both SSBC/DSBC are applied in the L2 cache • Three-level on-chip cache • L2 (unified) cache: 256KB/8-way/64B/LRU • L3 (unified) cache: 2MB/16-way/64B/LRU • Both SSBC/DSBC are applied in the two lower level cache (i.e. L2 and L3) • Destination Set Selector used in dynamic SBC • Four entries • Benchmarks • 10 representative benchmarks of the SPEC CPU 2006 suite

Performance Comparison- Miss, Hit, and Secondary Hit rates • Both SBCs keep same ratio of first access hits as standard $ • But they turn misses into secondary hits • DSBC achieves better results than SSBC as expected (First Hit) L2 cache in the two-level configuration Smaller is better L2 cache in the three-level configuration (c) L3 cache in the three-level configuration

Performance Comparison- Average Access Time Improvement (Reduction) L2 cache in the two-level configuration • The average improvement of access time • Two-level configuration • 4% and 8% reduction in L2 for SSBC and DSBC respectively • Three-level configuration • 3% and 6% reduction in L2 for SSBC and DSBC respectively • 10% and 12% reduction in L3 for SSBC and DSBC respectively Small 1% slowdown, since the baseline miss is small L2 cache in the three-level configuration L3 cache in the three-level configuration DSBC achieves better average access time than SSBC

Performance Comparison- IPC Improvement Two-level configuration Three-level configuration • In two-level configuration • SBCs have positive effect on performance • However, two kinds of benchmarks get no benefit from SBCs • 1. Thebaseline miss rate is small (like 444.namd or 445.gobmk) • 2. few accesses to L2 (like 458.sjeng) • In three-level configuration • The improvement is larger and applied to all benchmarks (Double L2 & L3 associativity) (Double L2 associativity) SBCs achieve better results than doubling associativity

Cost of SBC – Storage Requirement and Energy • Storage requirement of SBC • Based on a 2MB/8-way/64B/LRU cache, and address of 42 bits • Energy overhead • Less than 1% for SBC • But 79% overhead for the baseline with double associativity d bit: identify displaced lines Saturation counter per set In SSBC, require sc bit per set In DSBC, instead use one Association Table entry per set In DSBC, require a Destination Set Selector (4 entry) to choose the best set of association 7KB (0.31% overhead) 13KB (0.58% overhead)

Cost of SBC – Area • The area overhead of SBC • Less than 1% for both SSBC and DSBC • But more than 3% overhead for the baseline with double associativity SBC offers more performance but requires less energy and area than doubling the associativity

Conclusions • This paper proposed a Set Balancing Cache (SBC) • Balance saturation level among sets and avoid misses • Displace lines from cache sets with high saturation level to underutilized sets • Two designs have been presented • Static Set Balancing Cache • Displacement between pre-established pairs of sets • Dynamic Set Balancing Cache • Try to associate each highly saturated set with the less saturated set available • Experimental results show that SBC • Achieve an average miss reduction of 9.2% and 12.8% for SSBC and DSBC respectively

Comment for This Paper • The paper extensively reviews the emerging techniques on cache architecture for performance improvement • But the HW implementation for assigning less saturated cache sets into Destination Set Selector (DSS) is not clear • Only explain the table used in DSS • But how to choose the suitable candidate for the limited table (4 entry)

Adaptive Line Placement with the Set Balancing Cache

Adaptive Line Placement with the Set Balancing Cache

Presentation Transcript

Central Intravenous Line Placement

Line Balancing Problem

Line Balancing

Cache-Conscious Data Placement

Assembly Line Balancing

Adaptive Cache Compression for High-Performance Processors

Assembly Line Balancing

Assembly line balancing

Adaptive Cache-Line Size Management on 3D Integrated Microprocessors

Outperforming LRU with an Adaptive Replacement Cache Algorithm

The Locality-Aware Adaptive Cache Coherence Protocol

A Cache-Oblivious Implicit Dictionary with the Working Set Property

Prefetching with Adaptive Cache Culling for Striped Disk Arrays

Cache Tables: Paving the way for an Adaptive Database Cache

ARC (Adaptive Replacement Cache)

Cache-Conscious Data Placement

Set-Associative Cache

Chapter 7 Assembly Line Balancing

Assembly Line Balancing

Assembly Line Balancing

Line Balancing in Operations Management

Line balancing