A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

A One-Shot Configurable-Cache Tuner for Improved Energy and Performance Ann Gordon-Ross1, Pablo Viana2, Frank Vahid1, Walid Najjar1, and Edna Barros4 1Dept of Computer Science & Engineering - University of California, Riverside, USA 2Campus Arapiraca – Federal University of Alagoas, Brazil 3Centro de Informática - Federal University of Pernambuco, Brazil This work was supported by the U.S. National Science Foundation, and by the Semiconductor Research Corporation

Main Mem L1 D Cache L1 I Cache Processor Introduction • Memory access: 50% of embedded processor’s system power • Caches are power hungry • ARM920T (Segars 01) • M*CORE (Lee/Moyer/Arends 99) • Thus, caches are a good candidate for optimizations 53%

2KB 32 byte direct-mapped 4KB 16 byte 2-way 8KB 64 byte 4-way Introduction • Different applications have vastly different cache requirements • Total size, line size, and associativity • Cache parameters that don’t match an application’s behavior can waste over 60% of energy (Gordon-Ross 05) • Cache tuning is the process of determining the appropriate cache parameters for an application

Executing in base configuration Download application Energy Cache Tuning Tunable cache TC Tuning hw TC TC TC TC TC TC TC TC TC TC Runtime Cache Tuning • Best cache configuration can be determined by searching the design space during runtime • Runtime cache tuning is transparent to the designer and end user, but incurs runtime overhead in terms of energy and performance

Executing in base configuration Download application SPCE causes an increase in energy but no performance overhead Energy Tunable cache TC SPCE Switch to best config in “one-shot” SPCE SPCE Contribution • We introduce specialized hardware for non-intrusive runtime cache evaluation • Temporary energy overhead and no performance overhead • Single-pass multi-cache evaluation - SPCE • Special hardware simultaneously evaluates all cache configurations • Enables switching to the best configuration in one-shot

SPCE Key Points • Contributions compared to previous methods • Evaluates a highly configurable cache • Previous method offer little configurability • Little hardware overhead • Simple data structures • Elementary operations

Line size (number of words) b Number of lines t0 = 0 d 0 1 2 } } t0 = 0 t0 = 0 t0 = 0 1 t1 = 8 Number of conflicts determines cache sizes that would result in a hit HIT 3 3 t1 = 0 t1 = 0 t1 = 1 HIT 2 t2 = 16 3 t2 = 0 t2 = 2 t2 = 1 HIT } t3 = 0 } 4 HIT HIT t3 = 0 t3 = 0 t3 = 0 t4 = 8 2 5 3 HIT HIT t4 = 0 t4 = 0 t4 = 1 HIT 6 t5 = 0 Cache with 2 lines with 21 words per line (32 bytes) will have 5 hits and 7-5=2 misses 3 2 t5 = 0 t5 = 0 t5 = 0 7 HIT t6 = 16 8 HIT t6 = 2 t6 = 1 t6 = 0 3 2 2 HIT Address stream Table (stored hit info) SPCE • Monitors address stream to extract cache hit information for all configurations 24 different configs Fully-associative cache example (64-bit architecture) 6 1 For each line size … 1 1 1 >> 20*8 >> 21*8 >> 22*8 HIT HIT HIT HIT

Line size (number of words) b Number of sets s 0 1 2 1 2 3 4 5 4-way 6 2-way Tables (multiple layers) 7 Direct-mapped 8 Table (stored hit info) SPCE • SPCE determines hits for other set-associativities by counting the number of unique conflicts in the address trace

SPCE - Hardware • Designed and evaluated in synthesizable VHDL (stack)

4.6x less energy expended Results - Energy Savings • Energy savings compared to exploring the design space using a state-of-the-art intrusive heuristic (Zhang 03) • Values less than 1 denote an energy increase

7.7x faster Results - Tuning Speedup • Tuning speedup obtained compared to a state-of-the-art intrusive heuristic

Overheads • Evaluated SPCE compared to the ARM920T • Area • 12% area overhead • Due in large part to the TCAM stack structure • Power • Temporary 2.2X increase in power during short tuning cycle • Application need only iterate 4 times for average power overhead to reduce to 1%

Conclusions • SPCE is a specialized hardware structure to evaluate all cache configurations simultaneously • Enables non-intrusive runtime cache evaluation • Enables switching to best cache configuration in one shot • Compared to a state-of-the-art intrusive cache tuning heuristic • 4.6x less energy expended • 7.7x speedup in tuning time • 12% area overhead compared to ARM920T • Temporary 2.2x increase in power during short tuning time • Only 4 application iterations to recoup power

A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

Presentation Transcript

Improved Performance

Cache performance

Training for Improved Performance

Coaching for Improved Performance:

Coaching for Improved Performance:

Cache Performance

Fast Configurable-Cache Tuning with a Unified Second-Level Cache

Coaching for Improved Performance:

Coaching for Improved Performance:

Improved Performance

Optimizing Graph Algorithms for Improved Cache Performance

Cache Memory and Performance

Cache performance

Configurable Cache Subsetting for Fast Cache Tuning

Cache Performance

A Highly Configurable Cache Architecture for Embedded Systems

Programming for Cache Performance

A Self-Tuning Configurable Cache

Blade-Tuner Performance

Duramax Tuner Diesel Performance

Coaching for Improved Performance

A Self-Tuning Configurable Cache