Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Addressing Instruction Fetch Bottlenecksby Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida State University June 8-16, 2007

Instruction Packing • Store frequently occurring instructions as specified by the compiler in a small, low-power Instruction Register File (IRF) • Allow multiple instruction fetches from the IRF by packing instruction references together • Tightly packed – multiple IRF references • Loosely packed – piggybacks an IRF reference onto an existing instruction • Facilitate parameterization of some instructions using an Immediate Table (IMM) Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

insn3 insn4 insn2 insn3 insn2 insn4 imm3 imm3 IRF IMM Instruction Cache insn1 insn1 Execution of IRF Instructions Instruction Fetch Stage First Half of Instruction Decode Stage IF/ID PC packed instruction packed instruction To Instruction Decoder IRWP Executing a Tightly Packed Param4c Instruction Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Outline • Introduction • IRF and Instruction Packing Overview • Integrating an IRF with an L0 I-Cache • Decoupling Instruction Fetch • Experimental Evaluation • Related Work • Conclusions & Future Work Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

6 bits 5 bits 5 bits 5 bits 5 bits 1 bit 5 bits opcode inst4param inst5param 6 bits 5 bits 5 bits 5 bits 6 bits 5 bits opcode rs shamt 6 bits 5 bits 5 bits 11 bits 5 bits opcode 6 bits 2 bits 24 bits opcode MIPS+IRF Instruction Formats inst1 inst2 inst3 s T-type rt rd function inst R-type rs rt immediate inst I-type win immediate J-type Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Previous Work in IRF • Register Windowing + Loop Cache (MICRO 2005) • Compiler Optimizations (CASES 2006) • Instruction Selection • Register Renaming • Instruction Scheduling Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Integrating an IRF with an L0 I-Cache • L0 or Filter Caches • Small and direct-mapped • Fast hit time • Low energy per access • Higher miss rate than L1 • 256B L0 I-cache 8B line size [Kin97] • Fetch energy reduced 68% • Cycle time increased 46%!!! • IRF reduces code size, while L0 only focuses on energy reduction at the cost of performance • IRF can alleviate performance penalty associated with L0 cache misses, due to overlapping fetch Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

L0 Cache Miss Penalty L0 Cache Miss Cycle 1 2 3 4 5 6 7 8 9 Insn1 IF ID EX M WB Insn2 IF ID EX M WB Insn3 IF ID M EX WB Insn4 IF ID EX M WB Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Overlapping Fetch with an IRF L0 Cache Miss Cycle 1 2 3 4 5 6 7 8 9 Insn1 IF ID EX M WB Pack2a IFab IDa EXa Ma WBa Pack2b IDb EXb Mb WBb Insn3 IF ID EX M WB Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Decoupling Instruction Fetch • Instruction bandwidth in a pipeline is usually uniform (fetch, decode, issue, commit, …) • Artificially limits the effective design space • Front-end throttling improves energy utilization by reducing the fetch bandwidth in areas of low ILP • IRF can provide virtual front-end throttling • Fetch fewer instructions every cycle, but allow multiple issue of packed instructions • Areas of high ILP are often densely packed • Lower ILP for infrequently executed sections of code Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Out-of-order Pipeline Configurations Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Experimental Evaluation • MiBench embedded benchmark suite – 6 categories representing common tasks for various domains • SimpleScalar MIPS/PISA architectural simulator • Wattch/Cacti extensions for modeling energy consumption (inactive portions of pipeline only dissipate 10% of normal energy when using cc3 clock gating) • VPO – Very Portable Optimizer targeted for SimpleScalar MIPS/PISA Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

L0 Study Configuration Data Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Execution Efficiency for L0 I-Caches Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Energy Efficiency for L0 I-Caches Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Decoupled Fetch Configurations Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Execution Efficiency for Asymmetric Pipeline Bandwidth Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Energy Efficiency for Asymmetric Pipeline Bandwidth Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Energy-Delay2 for Asymmetric Pipeline Bandwidth Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Related Work • L-caches – subdivide instruction cache, such that one portion contains the most frequently accessed code • Loop Caches – capture simple loop behaviors and replay instructions • Zero Overhead Loop Buffers (ZOLB) • Pipeline gating / Front-end throttling – stall fetch when in areas of low IPC Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Conclusions and Future Work • Future Topics • Can we pack areas where L0 is likely to miss? • IRF + encrypted or compressed I-Caches • IRF + asymmetric frequency clustering (of pipeline backend functional units) • IRF can alleviate fetch bottlenecks from L0 I-Cache misses or branch mispredictions • Increased IPC of L0 system by 6.75% • Further decreased energy of L0 system by 5.78% • Decoupling fetch provides a wider spectrum of design points to be evaluated (energy/performance) Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

The End Questions ??? Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Energy Consumption Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Static Code Size Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Conclusions & Future Work • Compiler optimizations targeted specifically for IRF can further reduce energy (12.2%15.8%), code size (16.8%28.8%) and execution time • Unique transformation opportunities exist due to IRF, such as code duplication for code size reduction and predication • As processor designs become more idiosyncratic, it is increasingly important to explore the possibility of evolving existing compiler optimizations • Register targeting and loop unrolling should also be explored with instruction packing • Enhanced parameterization techniques Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Instruction Redundancy • Profiled largest benchmark in each of six MiBench categories • Most frequent 32 instructions comprise 66.5% of total dynamic and 31% of total static instructions Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Compilation Framework Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File