Single-pass Cache Optimization Clive Butler Clive Butler and Ruofan Yang
Introduction of Problem • Embedded system execute a single application or a class of applications repeatedly • Emerging methodology of designing embedded system utilizes configurable processors • Size, associativity, and line size. • Energy model and an execution time model are developed to find the best cache configuration for the given embedded application. • Current processor design methodologies rely on reserving large enough chip area for caches while conforming with area, performance, and energy cost constraints. • Customized cache allows designers to meet tighter energy consumption, performance, and cost constraints.
Introduction of Problem • In existing low power processors, cache memory is known to consume a large portion of the on-chip energy • Cache consumes up to 43% to 50% of the total system power of a processor. • In embedded systems where a single application or a class of applications are repeatedly executed on a processor, the memory hierarchy could be customized such that an optimal configuration is achieved. • The right choice of cache configuration for a given application could have a significant impact on overall performance and energy consumption.
Introduction of Problem • Estimating the hit and miss rates is fairly easy using tools such as Dinero. • Can be enormously time consuming to do so for various cache sizes, associativities and line sizes. • To use Dinero to estimate cache miss rate for a number of cache configurations means that a large program trace needs to be repeatedly read and evaluated which is time consuming. • Very time consuming.
Dinero • Dinero is a trace-driven cache simulator • Simulations are repeatable • One can simulate either a unified cache (mixed, data and instructions cached together) or separate instruction and data caches. • Cheaper (Hardware)
A din record is two- tuple label address. Cache parameters are set by command line options 0 read data, 1 write data, 2 instruction fetch. 3 escape record, 4 escape record (causes cache flush). Dinero uses the priority stack method of memory hierarchy simulation to increase flexibility and improve simulator performance in highly associative caches. Dinero
Introduction Method 1 Introduction Tree-base Method • Presents a methodology to rapidly and accurately explore the cache design space • Done by estimating cache miss rates for many different cache configurations simultaneously; and investigate the effect of different cache configurations on the energy and performance of a system. • Simultaneous evaluation can be rapidly performed by taking advantage of the high correlation between cache behavior of different cache configurations.
ASP-DAC paper General Simulation Process m(max)…..m(min)…..0 Cacheaddr. tag Array (stores tree addresses) Tree Step 1: index Step 3: Find node and go link list Cache Miss Table Step 2: Go to tree addr. and traverse the list Link List Step 4: Look for match
ASP-DAC paper Tree example 1010 101(0) 101(1) Cache Size 2 10(00) 10(10) 10(11) Cache Size 4 10(01) Cache Size 8 1(000) 1(100) 1(010) 1(110) 1(001) 1(101) 1(011) 1(111) Cacheaddr. tag 1010 Assume each forest has fix line size Bits are use find path (k)
ASP-DAC paper Link list set associative Assoc. = 1 Assoc. = 2 Assoc. = 4 Hit Miss Hit Hit Most recent element used Least recently used element Table for Miss Count L N A # of Cache Miss 1 4 1 0 1 4 1 1 *Rest of address is use as tag
ASP-DAC paper Link List LRU update Assoc. = 1 Assoc. = 2 Assoc. = 4 Most recent element used Least recently used element Table for Miss Count L N A # of Cache Miss 1 4 1 0 1 4 1 1
Detail Trace Example Example Specifications: • Cache Size (N) will vary from 32 bits max to 2 bits min • Associatively (A) will vary from 4 max to 1 min • Cache Set Size (M) will vary from 8 max to 1 min • Assume fix line size (L)
Detail Trace Example Instruction Trace k | m 1. 000000 => 0 2. 001000 => 8 3. 010000 => 16 4. 000000 => 0 5. 001000 => 8 6. 000000 => 0 7. 010000 => 16 Assoc. = 1 2 3 1 3 1 2 M=1 5 4 3 2 1 16 0 8 16 8 0 0 2 4 6 7 1 3 5 M=2 1 8 8 0 16 0 0 16 0 M=4 11 10 16 16 01 8 8 00 0 0 0 M=8 111 110 101 100 011 010 001 000 16 16 8 8 0 0 0
Detail Trace Example Instruction Trace k | m 1. 000000 => 0 2. 001000 => 8 3. 010000 => 16 4. 000000 => 0 5. 001000 => 8 6. 000000 => 0 7. 010000 => 16 Assoc. = 2 M=1 0 0 8 16 8 16 0 0 0 8 8 16 0 4 6 3 5 1 2 M=2 1 0 M=4 11 10 01 00 M=8 111 110 101 100 011 010 001 000
ASP-DAC Results • Using benchmarks from Mediabench • This method is on average 45 times faster to explore the design space. • compared to Dinero IV • Still having 100% accuracy.
Introduction Table-based Method • Two cache evaluation techniques include analytical modeling and execution-based evaluation to evaluate the design space • SPCE present a simplified, yet efficient way to extract locality properties for an entire cache configuration design space in just one single-pass • Includes related work, overview of SPCE, properties for addressing behavior analysis to estimate the cache miss rate, experiment and the results
Related Work • Much research exist in this area need multiple passes to explore all configurable parameters or employ large and complex data structures, which restricting their applicability • Algorithms for single-pass cache simulation exams concurrently a set of caches. Mattson; Hill and Smith; Sugumar and Abaham; Cascaval and Padua • Janapsatya et al. present a technique to evaluate all different cache parameters simultaneously, but not designed with a hardware implementation in mind • This paper’s methodology use simple array structures which are more amenable to a light-weight hardware implementation
Definitions • Time ordered sequence of referenced addresses -- T[t] (t is a positive integer),length |T|, such that T[t] is the t(th) address referenced • If T[ti] b = T[ti + d] b, then the addresses T[ti] and T[ti + d], are references to the same cache block of 2^b words • Define d as the delay or the number of unique cache references occurring between any two references where T[ti] b = T[ti + d] b
Definitions • Evaluate the locality in the sequence of addresses T[ti] of a running application ai by counting the occurrences where T[ti] b = T[ti+d] b and registering it in the cell L(b, d) of the locality table.(2^b is block size , d is delay)
Fully-Associative • A fully-associative cache configuration is defined by the notation cj (b, n), where b defines the line size in terms of words, and n the total number of lines in the cache • The locality table L(b, d) composes an efficient way to estimate the cache miss rate of fully-associative caches
Fully-Associative Example Locality table for the trace T d=3 T b A sequence of addresses d=2
Set-Associative • Most real-world cache devices are built as direct-map or set-associative structures • Since conflicts, L cannot be used to estimate misses , so define s as the number of sets independent of the associativity, for direct-mapped, set size=1, s=n • To analyze the cache conflicts, we build conflict table Kα (b is block size, s is set size), which in composed of α layers, one for each associativity explored
Set-Associative • The value stored in each element of the table Kα(b, s) indicates how many times the same block (size 2^b) is repeatedly referenced and results in a hit. • A given cache configuration with level of associativity w is capable of overcoming no more than w − 1 mapping conflicts. • The number of cache hits is determined by summing up the cache hits from layer α = 1 up to its respective layer α = w, where w refers to the associativity.
Experiment Setup • Implement SPCE as a standalone C++ program to process an instruction address trace file, gathered instruction address traces for 9 arbitrarily chosen from Motorola’s Power Stone benchmark suite using Simple Scalar • Since 64 bytes is the largest block size in the design space utilized, bmax=3; smax is defined by configuration with the maximum number of sets in the design space • Exam performance for our suite of benchmarks with SPCE and also with a very popular trace-driven cache simulator (DineroIV)
Results • Compare performance of SPCE and DineroIV for the 45 cache configurations.
Conclusion • Both Tree-based method and Table-based method (SPCE) facilitate in ease of cache miss rate estimation and also in reduction in simulation time. • Compared to DineroIV method, the average speedup is around 30 times. • Our future work includes extending the design space exploration by considering of a second level of cache.