Distributed Reorder Buffer Schemes for Low Power *

ICCD’03 Distributed Reorder Buffer Schemes for Low Power * Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 21st International Conference on Computer Design (ICCD’03), October 14th 2003 *supported in part by DARPA through the PAC-C program and NSF

ICCD’03 Outline • Reorder Buffer (ROB) complexities • Motivation for the low-complexity ROB • Low-complexity ROB designs • Fully Distributed ROB • Retention Latches (RLs) revisited (ICS’02) • Combined Scheme • Results • Concluding remarks

ICCD’03 P6-style Superscalar Datapath Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch EX Instruction dispatch Result/status forwarding buses

ICCD’03 PPC 620-style Superscalar Datapath Function Units Architectural Register File Instruction Issue RB IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch EX Instruction dispatch Result/status forwarding buses

ICCD’03 ROB Port Requirements for a W-way CPU Decode/Dispatch W write ports to setup entries Writeback W write ports to write results ROB Dispatch/Issue 2W read ports to read the source operands Commit W read ports for instruction commitment

ICCD’03 What This Work is All About • ROB complexity reduction is important for reducing power and improving performance • ROB dissipates a non-trivial fraction of the total chip power • ROB accesses stretch over several cycles • Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance

ICCD’03 Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell Area Reduction – 71% Shorter bit and wordlines

ICCD’03 P6-style Superscalar Datapath Instruction dispatch Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch EX Result/status forwarding buses

ICCD’03 Reorder Buffer Distribution Instruction dispatch ROB Components (ROBCs) Function Units Architectural Register File Instruction Issue IQ ROBC 1 FU1 F1 F2 D1 D2 ROBC 2 FU2 ARF FUm Fetch Decode/Dispatch ROBC m EX ROB Result/status forwarding buses Holds pointers to entries within ROBCs

ICCD’03 Impact of Distributing the ROB • Each ROBC is effectively is a small Rename Buffer • Smaller read/write access energy • Faster access time • Distributing physical storage in this manner allows FUs to use shorter buses to write their respective ROBCs • Lower energy dissipation on the wires (We have NOT accounted for energy savings from using shorter wires) • Fits in naturally with a multi-clustered datapath design

ICCD’03 Problems with the earlier Multi-banked RF Schemes • Port conflicts result in performance penalty • Interconnection network is more complex

ICCD’03 Problems with the earlier Multi-banked RF Schemes and some good news! • Port conflicts result in performance penalty • Totally avoid write port conflicts • Minimize read port conflicts at commitment • Interconnection network is more complex

ICCD’03 Problems with the earlier Multi-banked RF Schemes and some good news! • Port conflicts result in performance penalty • Totally avoid write port conflicts • Minimize read port conflicts at commitment • Interconnection network is more complex • Completely remove source read ports

ICCD’03 Problems with the earlier Multi-banked RF Schemes and some good news! • Port conflicts result in performance penalty • Totally avoid write port conflicts • Minimize read port conflicts at commitment • Totally avoid source read port conflicts • Interconnection network is more complex • Completely remove source read ports

ICCD’03 ROBCs Assigned to Each Function Unit FU_id offset 1 1 1 1 ROBC #1 FU #1 2 2 m 1 3 3 2 1 1 4 FU #2 ROBC #2 2 3 4 FU #m ROBC #m n 1 Centralized ROB Distributed ROBCs

ICCD’03 Good News:Write port conflicts are avoided 1 write port FU_id offset 1 1 1 1 ROBC #1 FU #1 2 2 m 1 3 3 2 1 1 4 FU #2 ROBC #2 2 3 4 FU #m ROBC #m n 1 Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 1 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time ADD FU_id offset instruction Int ADDROBC #1 1 1 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time ADD FU_id offset instruction Int ADDROBC #1 1 1 reserved 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time ADD FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 SUB 2 2 3 Int ADDROBC #2 1 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 SUB 2 2 3 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 SUB 2 2 SUB 2 1 3 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 2 2 SUB 2 1 AND 3 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 2 2 SUB 2 1 AND 3 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 reserved 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction Int ADDROBC #1 1 ADD 1 reserved 1 1 2 2 SUB 2 1 AND 3 AND 3 1 Int ADDROBC #2 1 reserved 4 2 5 Int ADDROBC #3 1 reserved 2 Int ADDROBC #4 1 n 2 Centralized ROB Distributed ROBCs

ICCD’03 Good News:Avoiding Read Port Conflicts 1 read port FU_id offset instruction 1 ADD 1 reserved 1 1 2 2 SUB 2 1 3 AND 3 1 1 reserved 4 2 To commitment 5 1 reserved 2 1 n 2 Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 MUL 1 4 2 5 n Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 MUL 1 reserved 4 2 5 n Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 MUL 1 reserved 4 MUL 5 1 2 5 n Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 1 reserved 4 MUL 5 1 2 DIV 5 n Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 1 reserved 4 MUL reserved 5 1 2 DIV 5 n Centralized ROB Distributed ROBCs

ICCD’03 Round Robin Scheduling at Dispatch Time FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 3 AND 3 1 1 reserved 4 MUL reserved 5 1 2 DIV DIV 5 5 2 n Centralized ROB Distributed ROBCs

ICCD’03 Read Port Conflicts at Commitment FU_id offset instruction IntMUL/DIVROBC #5 1 ADD 1 1 2 SUB 2 1 1 read port 3 AND 3 1 1 reserved To commitment 4 MUL reserved 5 1 2 DIV DIV 5 5 2 CONFLICT: If MUL and DIV wants to commit in the same cycle n Centralized ROB Distributed ROBCs

ICCD’03 Distributed ROB Design 1 Writeback 1 write port to write results ROBC

ICCD’03 Distributed ROB Design 1 Writeback 1 write port to write results ROBC Commit 1 read port for instruction commitment

ICCD’03 Distributed ROB Design 1: with source read ports Writeback 1 write port to write results ROBC Dispatch/Issue1 read port to read the source operands Commit 1 read port for instruction commitment

ICCD’03 Experimental Setup: the AccuPower (DATE’02) Compiled SPEC benchmarks Performance stats Microarchitectural Simulator (Rooted in SimpleScalar) Datapath specs Transition counts, Context information Energy/Power Estimator VLSI layout data Power/energy stats SPICE SPICE deck SPICE measures of energy per transition

ICCD’03 Configuration of the Simulated System Machine width 4-way Issue Queue 32 entries Reorder Buffer 96 entries 32 entries Load/Store Queue Simulated the execution of SPEC2000 benchmarks

ICCD’03 Peak/Average demands on the number of ROBC entries peak avg. peak avg. peak avg. peak avg. peak avg.

ICCD’03 Peak/Average demands on the number of ROBC entries peak avg. peak avg. peak avg. peak avg. peak avg. Number of entriesassigned to eachROBC 8 8 8 8 4 4 4 4 4 4 16

ICCD’03 Peak/Average demands on the number of ROBC entries peak avg. peak avg. peak avg. peak avg. peak avg. Number of entriesassigned to eachROBC 72entry 8 + 8 + 8 + 8 + 4 + 4 + 4 + 4 + 4 + 4 + 16 = 8_4_4_4_16 configuration

ICCD’03 Percentage of cycles when dispatch blocks for 8_4_4_4_16 Average IPC drop% with 8_4_4_4_16 configuration = 4.8%

ICCD’03 Percentage of cycles when dispatch blocks for 8_4_4_4_16 Number of entriesassigned to eachROBC 72entry 8 + 8 + 8 + 8 + 4 + 4 + 4 + 4 + 4 + 4 + 16 =

ICCD’03 Reducing performance penalty: 12_6_4_6_20 Configuration Number of entriesassigned to eachROBC 96entry 12 + 12 + 12 + 12 + 6 + 4 + 4 + 4 + 4 + 6 + 20 = 12_6_4_6_20 configuration

ICCD’03 Performance Results for 12_6_4_6_20 Configuration gap gcc gzip parser perl twolf vortex vpr Int Avg. IPC applu art mesa mgrid swim wupwise FP Avg. Average IPC drop% with 12_6_4_6_20 configuration = 2.4%

ICCD’03 Distributed ROB Design 1: with source read ports Writeback 1 write port to write results ROBC Dispatch/Issue1 read port to read the source operands Commit 1 read port for instruction commitment

ICCD’03 Eliminating All Source Read Ports Writeback 1 write port to write results ROBC Dispatch/Issue1 read port to read the source operands Commit 1 read port for instruction commitment

ICCD’03 Eliminating All Source Read Ports Writeback 1 write port to write results ROBC Commit 1 read port for instruction commitment

ICCD’03 Where are the Source Values Coming From? Function Units Architectural Register File Instruction Issue 1 2 IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch EX 3 Instruction dispatch Result/status forwarding buses

Distributed Reorder Buffer Schemes for Low Power *

Distributed Reorder Buffer Schemes for Low Power *

Presentation Transcript

Lecture 7 : Speculative Execution and Recovery using Reorder Buffer

Low – power testing

Some Distributed Coordination Schemes for Wireless Sensor Networks

SOFTWARE DESIGN FOR LOW POWER

Power Delivery Network Optimization for Low Power SoC

Low Power Processors

Low Power WiFi

Low Power RF

Reorder List

Tomasulo With Reorder buffer:

Price Support and Buffer Stock Schemes

Low Power Clocking

Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

Reorder Buffer Implementation (Pentium Pro)

Low Voltage Low Power Dram

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Reorder Buffer: register renaming and in-order completion

Reorder point

Low-power high-slew-rate CMOS buffer amplifier for flat panel display drivers

Low-Complexity Reorder Buffer Architecture*

A Low-Power High-Speed Class-AB Buffer Amplifier for Flat-Panel-Display Application

Power Delivery Network Optimization for Low Power SoC