Clive Butler

Single-pass Cache Optimization Clive Butler Clive Butler and Ruofan Yang

Introduction of Problem • Embedded system execute a single application or a class of applications repeatedly • Emerging methodology of designing embedded system utilizes configurable processors • Size, associativity, and line size. • Energy model and an execution time model are developed to find the best cache configuration for the given embedded application. • Current processor design methodologies rely on reserving large enough chip area for caches while conforming with area, performance, and energy cost constraints. • Customized cache allows designers to meet tighter energy consumption, performance, and cost constraints.

Introduction of Problem • In existing low power processors, cache memory is known to consume a large portion of the on-chip energy • Cache consumes up to 43% to 50% of the total system power of a processor. • In embedded systems where a single application or a class of applications are repeatedly executed on a processor, the memory hierarchy could be customized such that an optimal configuration is achieved. • The right choice of cache configuration for a given application could have a significant impact on overall performance and energy consumption.

Introduction of Problem • Estimating the hit and miss rates is fairly easy using tools such as Dinero. • Can be enormously time consuming to do so for various cache sizes, associativities and line sizes. • To use Dinero to estimate cache miss rate for a number of cache configurations means that a large program trace needs to be repeatedly read and evaluated which is time consuming. • Very time consuming.

Dinero • Dinero is a trace-driven cache simulator • Simulations are repeatable • One can simulate either a unified cache (mixed, data and instructions cached together) or separate instruction and data caches. • Cheaper (Hardware)

A din record is two- tuple label address. Cache parameters are set by command line options 0 read data, 1 write data, 2 instruction fetch. 3 escape record, 4 escape record (causes cache flush). Dinero uses the priority stack method of memory hierarchy simulation to increase flexibility and improve simulator performance in highly associative caches. Dinero

Introduction Method 1 Introduction Tree-base Method • Presents a methodology to rapidly and accurately explore the cache design space • Done by estimating cache miss rates for many different cache configurations simultaneously; and investigate the effect of different cache configurations on the energy and performance of a system. • Simultaneous evaluation can be rapidly performed by taking advantage of the high correlation between cache behavior of different cache configurations.

ASP-DAC paper General Simulation Process m(max)…..m(min)…..0 Cacheaddr. tag Array (stores tree addresses) Tree Step 1: index Step 3: Find node and go link list Cache Miss Table Step 2: Go to tree addr. and traverse the list Link List Step 4: Look for match

ASP-DAC paper Tree example 1010 101(0) 101(1) Cache Size 2 10(00) 10(10) 10(11) Cache Size 4 10(01) Cache Size 8 1(000) 1(100) 1(010) 1(110) 1(001) 1(101) 1(011) 1(111) Cacheaddr. tag 1010 Assume each forest has fix line size Bits are use find path (k)

ASP-DAC paper Link list set associative Assoc. = 1 Assoc. = 2 Assoc. = 4 Hit Miss Hit Hit Most recent element used Least recently used element Table for Miss Count L N A # of Cache Miss 1 4 1 0 1 4 1 1 *Rest of address is use as tag

ASP-DAC paper Link List LRU update Assoc. = 1 Assoc. = 2 Assoc. = 4 Most recent element used Least recently used element Table for Miss Count L N A # of Cache Miss 1 4 1 0 1 4 1 1

Detail Trace Example Example Specifications: • Cache Size (N) will vary from 32 bits max to 2 bits min • Associatively (A) will vary from 4 max to 1 min • Cache Set Size (M) will vary from 8 max to 1 min • Assume fix line size (L)

Detail Trace Example Instruction Trace k | m 1. 000000 => 0 2. 001000 => 8 3. 010000 => 16 4. 000000 => 0 5. 001000 => 8 6. 000000 => 0 7. 010000 => 16 Assoc. = 1 2 3 1 3 1 2 M=1 5 4 3 2 1 16 0 8 16 8 0 0 2 4 6 7 1 3 5 M=2 1 8 8 0 16 0 0 16 0 M=4 11 10 16 16 01 8 8 00 0 0 0 M=8 111 110 101 100 011 010 001 000 16 16 8 8 0 0 0

Detail Trace Example Instruction Trace k | m 1. 000000 => 0 2. 001000 => 8 3. 010000 => 16 4. 000000 => 0 5. 001000 => 8 6. 000000 => 0 7. 010000 => 16 Assoc. = 2 M=1 0 0 8 16 8 16 0 0 0 8 8 16 0 4 6 3 5 1 2 M=2 1 0 M=4 11 10 01 00 M=8 111 110 101 100 011 010 001 000

ASP-DAC Results • Using benchmarks from Mediabench • This method is on average 45 times faster to explore the design space. • compared to Dinero IV • Still having 100% accuracy.

Introduction Table-based Method • Two cache evaluation techniques include analytical modeling and execution-based evaluation to evaluate the design space • SPCE present a simplified, yet efficient way to extract locality properties for an entire cache configuration design space in just one single-pass • Includes related work, overview of SPCE, properties for addressing behavior analysis to estimate the cache miss rate, experiment and the results

Related Work • Much research exist in this area need multiple passes to explore all configurable parameters or employ large and complex data structures, which restricting their applicability • Algorithms for single-pass cache simulation exams concurrently a set of caches. Mattson; Hill and Smith; Sugumar and Abaham; Cascaval and Padua • Janapsatya et al. present a technique to evaluate all different cache parameters simultaneously, but not designed with a hardware implementation in mind • This paper’s methodology use simple array structures which are more amenable to a light-weight hardware implementation

SPCE Overview

Definitions • Time ordered sequence of referenced addresses -- T[t] (t is a positive integer),length |T|, such that T[t] is the t(th) address referenced • If T[ti] b = T[ti + d] b, then the addresses T[ti] and T[ti + d], are references to the same cache block of 2^b words • Define d as the delay or the number of unique cache references occurring between any two references where T[ti] b = T[ti + d] b

Definitions • Evaluate the locality in the sequence of addresses T[ti] of a running application ai by counting the occurrences where T[ti] b = T[ti+d] b and registering it in the cell L(b, d) of the locality table.(2^b is block size , d is delay)

Fully-Associative • A fully-associative cache configuration is defined by the notation cj (b, n), where b defines the line size in terms of words, and n the total number of lines in the cache • The locality table L(b, d) composes an efficient way to estimate the cache miss rate of fully-associative caches

Fully-Associative Example Locality table for the trace T d=3 T b A sequence of addresses d=2

Set-Associative • Most real-world cache devices are built as direct-map or set-associative structures • Since conflicts, L cannot be used to estimate misses , so define s as the number of sets independent of the associativity, for direct-mapped, set size=1, s=n • To analyze the cache conflicts, we build conflict table Kα (b is block size, s is set size), which in composed of α layers, one for each associativity explored

Set-Associative

Set-Associative • The value stored in each element of the table Kα(b, s) indicates how many times the same block (size 2^b) is repeatedly referenced and results in a hit. • A given cache configuration with level of associativity w is capable of overcoming no more than w − 1 mapping conflicts. • The number of cache hits is determined by summing up the cache hits from layer α = 1 up to its respective layer α = w, where w refers to the associativity.

Algorithm Implementation

Experiment Setup • Implement SPCE as a standalone C++ program to process an instruction address trace file, gathered instruction address traces for 9 arbitrarily chosen from Motorola’s Power Stone benchmark suite using Simple Scalar • Since 64 bytes is the largest block size in the design space utilized, bmax=3; smax is defined by configuration with the maximum number of sets in the design space • Exam performance for our suite of benchmarks with SPCE and also with a very popular trace-driven cache simulator (DineroIV)

Results • Compare performance of SPCE and DineroIV for the 45 cache configurations.

Conclusion • Both Tree-based method and Table-based method (SPCE) facilitate in ease of cache miss rate estimation and also in reduction in simulation time. • Compared to DineroIV method, the average speedup is around 30 times. • Our future work includes extending the design space exploration by considering of a second level of cache.

Clive Butler

Clive Butler

Presentation Transcript

Collective Bargaining Clive Thompson

Dr. Clive Rosen

Clive Staples Lewis

Joe Silmon, Clive Roberts

Joseph Butler

Gabrielle Butler

Paul Butler

Judith Butler

Clive Staples Lewis (C.S.Lewis)

THE BUTLER

THE BUTLER

India After Clive…

CLIVE L. KEATINGE

Receptionist Butler

United Way of Butler County Butler Collaborative for Families Butler Memorial Hospital

Patrick Butler

Stephanie Butler

Clive Sinclair

CLIVE DICKS

Parent workshop Clive Leach

CLIVE L. KEATINGE