Automatic Tuning of Two-Level Caches to Embedded Applications

Automatic Tuning of Two-Level Caches to Embedded Applications Nikil Dutt Center for Embedded Computer Systems School for Information and Computer Science University of California, Irvine Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported by the U.S. National Science Foundation, and by the Semiconductor Research Corporation

53% Introduction Main Mem • Memory access: 50% of embedded processor’s system power • Caches are power hungry • ARM920T(Segars 01) • M*CORE (Lee/Moyer/Arends 99) • Thus, the cache is a good candidate for optimizations L2 Cache L1 Cache Processor

Size • Excess fetch and static energy if too large • Excess thrashing energy if too small Motivation • Tuning cache parameters to an application can save energy: 60% on average • Balasubramonian’00,Zhang’03 • Each application has different cache requirements • One predetermined cache configuration can’t be best for all applications L1 Cache

Line size • Excess fetch energy if line size too large • Excess stall energy if line size too small Motivation • Tuning cache parameters to an application can save energy: 60% on average • Balasubramonian’00,Zhang’03 • Each application has different cache requirements • One predetermined cache configuration can’t be best for all applications L1 Cache

Cache associativity • Excess fetch energy per access if too high • Excess miss energy if too low } { Motivation • Tuning cache parameters to an application can save energy: 60% on average • Balasubramonian’00,Zhang’03 • Each application has different cache requirements • One predetermined cache configuration can’t be best for all applications L1 Cache

Choose lowest energy configuration Microprocessor Tuning Tuning Motivation • By tuning these parameters, the cache can be customized to a particular application L1 Cache Energy L2 Cache Possible Cache Configurations Main Memory

Microprocessor L1 Cache Tuning L2 Cache Tuning Main Memory Related Work • Configurable caches • Soft cores (ARM, MIPS, Tensillica, etc.) • Even for hard processors (Motorola M*Core - Malik ISLPED’00; Albonesi MICRO’00; Zhang ISCA’03) • Configurable cache tuning • Mostly manually in practice • Sub-optimal, time-consuming • L1 automated methods • Platune (Givargis TCAD’02, Palesi CODES’02) • Zhang RSP’03 • Two-level caches becoming popular • More transistors on-chip available • Bigger gap between on-chip and off-chip accesses • Need automated tuning for L1+L2

Challenge for Two-Level Cache Tuning • One level: 10s of configurations • Two levels: 100s/1000s of configurations • Need efficient heuristic • Especially if used with simulation-based search Level 2 Level 1 - Total size - Line size - Associativity 2500 configs - Total size - Line size - Associativity * Say 50 configs. 50 configs.

Tune Instruction Cache Hierarchy Tune Data Cache Hierarchy Two-Level Cache Tuning Goal • Develop fast, good-quality heuristic for tuning two-level caches to embedded applications for reduced energy consumption • Presently focus on separate I and D cache in both levels Level 1 Caches Level 2 Caches I-cache I-cache Main Memory Microprocessor D-cache D-cache

Level One Cache Level One Cache Level One Cache 8 KB 4 KB 4 KB 4 KB 2 KB 2 KB 2 KB Way shutdown and way concatenation can be combined to offer a direct-mapped 4 KB cache Way concatenation offers a 2-way or a directed-mapped variation Way concatenation offers a 2-way or a directed-mapped variation Way shutdown offers a 2-way 4 KB cache or a direct-mapped 2 KB cache Way shutdown offers a 2-way 4 KB cacheand a direct-mapped 2 KB cache Configurable Cache Architecture • Our target configurable cache architecture is based on Zhang/Vahid/Najjar’s “Highly-Configurable Cache Architecture for Embedded Systems,” ISCA 2003 Base Level One Cache 2KB 2KB 2KB 2KB 8 KB cache consisting of 4 2KB banks that can operate as 4 ways

Configuration Space • Cache parameters • Size - L1 cache: 2, 4, and 8 KBytes. L2 cache: 16, 32, and 64 KBytes • Line size (L1 or L2) - 16, 32, and 64 Bytes • 16 byte physical base line size • Associativity (L1 or L2) - Direct-mapped, 2-way, and 4-way • 432 possible configurations • For two levels, with separate I and D

Exhaustive search Took days. For comparison purposes Experimental Environment MediaBench EEMBC Chosen cache configuration Hit and miss ratios for each configuration SimpleScalar Cache exploration heuristic Cache energy - Cacti Main memory energy - Samsung memory CPU stall energy - 0.18 micron MIPS uP

Microprocessor First Heuristic: Tune Levels One-at-a-Time • Tune L1, then L2 • Initial L2: 64 KByte, 4-way, 64 byte line size • For best L1 found, tune L2 cache • Tuned each cache using Zhang’s heuristic for one-level cache tuning (RSP’03) L1 Cache L2 Cache Main Memory

Level One Cache Level One Cache Level One Cache Level One Cache Level One Cache Level One Cache Level One Cache First search size If the size increase yields energy improvements, increase the cache size to 8KB. First search size Begin with a 2 KByte, direct-mapped cache with a 16 Byte line size First search size Increase size to 4 KB. Next search line size If the increase in line size yields a decrease in energy, increase the line size to 64 Bytes Next search line size For the lowest energy cache size, increase the line size to 32 Bytes Finally, search associativity If increasing the associativity yields a decrease in energy, increase the associativity to 4 Finally, search associativity For the lowest energy line size, increase the associativity to 2 First Heuristic: Tune Levels One-at-a-Time • Zhang’s heuristic: Search parameters in order of importance (RSP’03)

Results of First Heuristic • Base cache configuration • Level 1 - 8 KByte, 4-way, 32 byte line • Level 2 - 64 KByte, 4-way, 64 byte line

First Heuristic • Did not find optimal in most cases • Sometimes 200% or 300% worse • The two levels should not be explored separately • Too much interdependence among L1 and L2 cache parameters • E.g., high L1 associativity decreases misses and thus reduces need for large L2 • Dozens of other such interdependencies

Determine the best size of level one cache Determine the best size of level two cache Improved Heuristic – Basic Interlacing • To more fully explore the dependencies between the two levels, we interlaced the exploration of the level one and level two caches L1 Cache L2 Cache

Determine the best line size of level one cache Determine the best line size of level two cache Improved Heuristic – Basic Interlacing • To more fully explore the dependencies between the two levels, we interlaced the exploration of the level one and level two caches L1 Cache L2 Cache

Determine the best associativity of level one cache Determine the best associativity of level two cache } } { { Improved Heuristic – Basic Interlacing • To more fully explore the dependencies between the two levels, we interlaced the exploration of the level one and level two caches L1 Cache L2 Cache Basic interlacing performed better than the initial heuristic but there was still much room for improvement

16KB 16KB 16KB 16KB 16KB 16KB 16KB Final Heuristic: Interlaced with Local Search • Performed well, but some cases sub-optimal • Manually examined those cases • Determined small local search needed • Final heuristic called: TCaT - The Two Level Cache Tuner However, the application may require the increased associativity. During the associativity search step, the cache size is allowed to increase so that larger associativities may be explored. Because of the bank arrangements, if a 16KB cache is determined to be the best size, the only associativity option is direct-mapped

TCaT Results: Energy • Energy consumption (normalized to the base cache configuration) • 53% energy savings in cache/memory access sub-system vs. base cache

TCaT Results: Performance • Execution time for the TCaT cache configuration and the optimal cache configuration (normalized to the execution time of the benchmark running with the base cache configuration) • TCaT finds near-optimal configuration, nearly 30% improvement over base cache

TCaT Exploration Time Improvements • Searches only 28 of 432 possible configurations • 6% of space • Simulation-based approach • 500 MHz Sparc • 50 hrs vs. 3 hrs • Hardware-based approach • 434 sec vs. 28 sec

TCaT in Presence of Hw/Sw Partitioning • Hardware/software partitioning may become common in SOC platforms • On-chip FPGA • Program kernels moved to FPGA • Greatly reduces temporal and spatial locality of program • Does TCaT still work well on programs with very low locality?

TCaT With Hardware/Software Partitioning • Energy consumption (normalized to the base cache configuration) • 55% energy savings in cache/memory access sub-system vs. base cache

Conclusions • TCaT is an effective heuristic for two-level cache tuning • Prunes 94% of search space for a given two-level configurable cache architecture • Near-optimal performance results, 30% improvement vs. base cache • Near-optimal energy results, 53% improvement vs. base cache • Robust in presence of hw/sw partitioning • Future work • More cache parameters, unified 2L cache • Even larger search space • Dynamic in-system tuning • Must avoid cache flushes

Automatic Tuning of Two-Level Caches to Embedded Applications

Automatic Tuning of Two-Level Caches to Embedded Applications

Presentation Transcript

Automatic Performance Tuning of Sparse Matrix Kernels

Using Cache Models and Empirical Search in Automatic Tuning of Applications

Automatic Performance Tuning of Sparse Matrix Kernels

Automatic Synthesis of Embedded Software

Automatic Tuning of Collective Communications in MPI

Automatic Tuning of the MULTIPROGRAMMING LEVEL in Sybase SQL Anywhere

Adaptive embedded systems Two applications to society

Performance Tuning of Scientific Applications

Automatic Tuning for Parallel FFTs

Two Ways to Exploit Multi-Megabyte Caches

Architecture Tuning in Embedded Systems

Energy-efficient Phase-based Cache Tuning of Multimedia Applications in Embedded Systems

Automatic Online Tuning

Automatic Performance Analysis and Tuning

Automatic Performance Tuning of Sparse Matrix Kernels

Applications of Embedded System

Automatic Performance Tuning of SpMV on GPGPU

Tuning systemd for Embedded

Automatic Performance Tuning of SpMV on GPGPU

Automatic Performance Tuning