Power Savings in Embedded Processors through Decode Filter Cache

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alexandru Nicolau Presented by Fei Hong

Overview • Introduction • Related Work • Decode Filter Cache • Results and Conclusion 1

Introduction • Instruction delivery is a major power consumer in embedded systems • Instruction fetch • 27% processor power in StrongARM • Instruction decode • 18% processor power in StrongARM • Goal • Reduce power in instruction delivery with minimal performance penalty 2

Related Work • Architectural approaches to reduce instruction fetch power • Store instructions in small and power efficient storages • Examples: • Line buffers • Instruction Filter Cache 3

Related Work • Architectural approaches to reduce instruction decode power • Avoid unnecessary decoding by saving decoded instructions in a separate cache • Trace cache • Store decoded instructions in execution order • Fixed cache access order • Instruction cache is accessed on trace cache misses • Targeted for high-performance processors • Increase fetch bandwidth • Require sophisticated branch prediction mechanisms • Drawbacks • Not power efficient as the cache size is large 4

Related Work • Micro-operation cache • Store decoded instructions in program order • Instruction cache and micro-op cache are accessed in parallel to minimize micro-op cache miss penalty • Drawbacks • Need extra stage in the pipeline, which increases misprediction penalty • Require a branch predictor • Per access power is large • Micro-op cache size is large • Power consumption from both micro-operation cache and instruction cache 5

Decode Filter Cache • Targeted processors • Single issue, In-order execution • Research goal • Use a small (and power efficient) cache to save decoded instructions • Reduce instruction fetch power and decode power simultaneously • Reduce power without sacrificing performance • Problems to deal with • What kind of cache organization to use • Where to fetch instructions as instructions can be provided from multiple sources • How to minimize decode filter cache miss latency 6

Decode Filter Cache Figure 1. Processor Pipeline 7

Decode Filter Cache • Decode filter cache organization • Problems with traditional cache organization • The decoded instruction width varies • Save all the decoded instructions will waste cache space • Our approach • Instruction classification • Classify instructions into cacheable and uncacheable depending on instruction width distribution • Use a “cacheable ratio” to balance the cache utilization vs. the number of instructions that can be cached • Sectored cache organization • Each instruction can be cached independently of neighboring lines • Neighboring lines share a tag to reduce cache tag store cost 8

Decode Filter Cache Table 1. Decode width frequency table Figure 2. Sector format 9

Decode Filter Cache • Where to fetch instructions • Instructions can be provided from one of the following sources • Decode filter cache • Line buffer • Instruction cache • Predictive order for instruction fetch • For power efficiency, either the decode filter cache or the line buffer is accessed first when an instruction is likely to hit • To minimize decode filter cache miss penalty, the instruction cache is accessed directly when the decode filter cache is likely to miss 10

Decode Filter Cache • Prediction mechanism • When next fetch address and current address map to the same cache line • If current fetch source is line buffer, the next fetch source remain the same • If current fetch source is decode filter cache and the corresponding instruction is valid, the next fetch source remain the same • Otherwise, the next fetch source is instruction cache 11

Decode Filter Cache • When fetch address and current address map to different cache lines • Predict based on next fetch prediction table, which utilizes control flow predictability • If the tag of current fetch address and the tag of the predicted next fetch address are same, next fetch source is decode filter cache • Otherwise, next fetch source is instruction cache 12

Next Fetch Prediction Table Figure 3. The predictor 13

Next Fetch Prediction Table Update NFPT: • The entry to update in the NFPT is pointed by last_table_entry. • If last_decode_addr and decode_addr map to different lines in the DFC, last_table_entry is updated using last_decode_addr. • last_decode_addr is set to decode_addr. 14

Next Fetch Prediction Table If fetch_addr and fetch_addr +4 map to different cache lines: • Equal: sector_valid of the corresponding entry is sent to cur_sector_valid. The next_fetch_src is updated as DFC. If the valid bit corresponding to fetch_addr +4 is 1, then the next fetch source is DFC. Otherwise the next fetch source is I-cache • Not equal: The predicted next fetch source is I-cache. However, next_fetch_src is updated as line buffer as the line in the I-cache will be forwarded to line buffer. 15

Next Fetch Prediction Table Misspredictions: • Conflict access–Fields partial_tag and sector_valid in the NFPT have been replaced by conflicting sectors. • Taken branch–If a taken branch is not in the sector, the prediction for the target address is not available. 16

Results • Simulation setup • Media Benchmark • Cache size • 512B decode filter cache, 16KB instruction cache, 8KB data cache. • Configurations investigated 17

Results:% reduction in I-cache fetches 18

Results: % reduction in instruction decodes 19

Results: normalized delay 20

Results: % reduction in processor power 21

Conclusion & Result • The results show • Average 34% reduction in processor power • 50% more effective in power savings than an instruction filter cache • Less than 1% performance degradation due to effective prediction mechanism 22

Questions? 23

Power Savings in Embedded Processors through Decode Filter Cache

Power Savings in Embedded Processors through Decode Filter Cache

Presentation Transcript

Hiding Cache Miss Penalty Using Priority-based Execution for Embedded Processors

Embedded Processors

CECS 347 Embedded Processors

Vulnerabilities in Embedded Harvard Architecture Processors

Smart Cache Cleaning : Energy Efficient Vulnerability Reduction in Embedded Processors

Low Power Processors

A Loop Accelerator for Low Power Embedded VLIW Processors

Block Cache for Embedded Systems

Hiding cache miss penalty using priority based execution for embedded processors

Power and Frequency Analysis for Data and Control Independence in Embedded Processors

Multiple Sleep Mode Leakage Control for Cache Peripheral Circuits in Embedded Processors

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors

Miss Reduction in Embedded Processors Through Dynamic, Power-Friendly Cache Design

Embedded Processors are Everywhere

Compiler Issues for Embedded Processors

Energy Savings Through Power Factor Correction WWW.VenergyGroup

Cache Replacement in Modern Processors

Compressed Tag Architecture for Low-Power Embedded Cache Systems

Power Savings in Embedded Processors through Decode Filter Cache

Storage Allocation for Embedded Processors

Processors for Embedded Systems

Processors for Embedded Systems