190 likes | 324 Vues
A First Look at the Interplay of Code Reordering and Configurable Caches. Nikil Dutt Center for Embedded Computer Systems School for Information and Computer Science University of California, Irvine. Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering
 
                
                E N D
A First Look at the Interplay of Code Reordering and Configurable Caches Nikil Dutt Center for Embedded Computer Systems School for Information and Computer Science University of California, Irvine Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported by the U.S. National Science Foundation, and by the Semiconductor Research Corporation
Optimizations • Optimization is an important part of the design of an application or system Area Performance Power and/or energy
Instruction Cache Optimizations • The instruction cache is a good candidate for optimizations • Gordon-Ross ‘04 Instruction caches have predictable spatial and temporal locality. 90% of execution time is spent in 10% of the code • ARM920T(Segars ‘01) Power hungry - 29% of power consumption
int x; x = 5; … int x; x = 5; … int x; x = 5; … Instruction Cache Tuning - Code Reordering • Tune the instruction stream for increased cache utilization and thus increased performance • Reorder the code so that infrequently executed regions of code do not pollute the instruction cache. Download Compile Link obj file App Code reordering is typically applied during link time however runtime methods do exist but incur undesirable runtime overhead. Execute
Instruction Cache Tuning - Code Reordering while (input) while (input) Read input Read input no 100 Is the input valid? Is the input valid? Code Reordering yes yes 1 no Process input Error handling routine Process input Done Done Error handling routine
Instruction Cache Tuning - Configurable Cache Tuning • Tune the cache to the instruction stream for decreased energy and/or increased performance • Cache tuning can be performed during application/platform design or even in system during runtime incurring no runtime overhead (Zhang - DATE’04) OR
} { Instruction Cache Tuning - Configurable Cache Tuning • Tunable parameters include: Cache Associativity Cache Line Size Total Cache Size L1 Cache L1 Cache L1 Cache
int x; x = 5; … int x; x = 5; … int x; x = 5; … Motivation - Code Reordering + Cache Configuration Cache configuration tunes the cache to the instruction stream How do these optimizations affect each other? Complement? Obviate? Instruction Cache Degrade? Code reordering tunes the instruction stream for the cache
Pettis and Hansen Code Reordering • Many current code reordering techniques are based heavily off of the Pettis and Hansen code reordering algorithm - 1990 • Reorder basic blocks using edge profiling to increase locality • Orders basic blocks so that the most frequently executed path through the basic blocks is placed as straight-line code
Pettis and Hansen Bottom-up Positioning Algorithm Control Flow Graph • Process arc weights in decreasing order • For each arc, merge basic blocks at the source and destination of each arc to form a chain • If one of the blocks is already in the middle of a chain, form a new chain Reordered basic block chains Execution frequencies Basic Blocks
Configurable Cache Architecture • We used the configurable cache architecture proposed by Zhang - ISCA’03
Configurable Cache Architecture • The base cache consists of 4 2KByte banks that may individually be shutdown for size configuration • Way concatenation allows for configurable associativity Way shutdown 8 KBytes 4 KBytes 8 KBytes 2-way
} { Configurable Cache Heuristic L1 Cache …then tune cache line size… 16, 32, and 64 bytes …and finally tune cache associativity L1 Cache Direct-mapped, 2-way and 4-way L1 Cache First tune cache size… 2, 4, and 8 KBytes
Powerstone MediaBench EEMBC Evaluation Framework Cache Exploration Heuristic No code reordering Chosen cache configuration Exhaustive search for comparison purposes Instrument the executable to gather edge profiles Execute the application Code reordered executable PLTO* Pentium Link Time Optimizer Hit and miss ratios for each configuration Provide edge profiles to perform code reordering Cache energy - Cacti Main memory energy - Samsung memory Execute the application to gather edge profiles *Provided by the University of Arizona
Results - Energy Savings Base cache = 2KB, d-m, 16 byte line size Base Cache With Code Reordering Base Cache Without Code Reordering Configured Cache Without Code Reordering Configured Cache With Code Reordering 1.5 1.5 • Code reordering alone = 3.5% energy reduction • Cache configuration alone = 15% energy reduction • Cache configuration + code reordering = 17% energy reduction
Results - Performance Benefits Base Cache Without Code Reordering Base Cache With Code Reordering Configured Cache Without Code Reordering Configured Cache With Code Reordering 1.5 1.6 • Code reordering alone = 3.5% performance benefit • Cache configuration alone = 17% performance benefit • Cache configuration + code reordering = 18.5% performance benefit • On average, code reordering gives little additional benefit over cache configuration alone. However a few benchmarks see added benefits.
Change in Cache Requirements Due to Code Reordering x x x x * x * * x * * x x * *Powerstone **Mediabench ***EEMBC x - reduction in cache area - larger line size - smaller cache size *
Conclusions • We explore the interplay of two instruction cache optimization techniques - code reordering and cache configuration • Cache configuration largely obviates the need for code reordering with respect to energy and performance • Cache configuration applied dynamically during runtime eliminates the need for designer applied code reordering • Code reordering improved cache utilization in 52% of the benchmarks • Reduced instruction cache size by an average of 13% and as high as 90% - beneficial for small custom synthesized embedded systems where area is critical
Future Work • We plan to use a more advanced code reordering methodology that will take into account set assiociativity or multiple levels of cache • We plan to study the iterative interplay of code reordering and cache configuration using a code reordering technique that takes the cache configuration into consideration