Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors*

Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science State University of New York Binghamton, NY 13902-6000 {gurhan, ghose, dima}@cs.binghamton.edu Peter M. Kogge Dept. of Computer Science and Engineering University of Notre Dame Notre Dame, IN 46556 kogge@cse.nd.edu International Symposium on Low Power Electronics and Design (ISLPED’01) * supported in part by DARPA through the PAC-C program and NSF

Motivation/Goals Current Trends in Microarchitecture: • Aggressive out-of-order execution, use of register renaming, multiple FUs, sizable on-chip caches, large register files, ROB etc. Impact on Energy/Power Dissipation: • Absolute power dissipation of processor is high • Areal energy/power density of high-end superscalar processors is becoming an immediate, serious concern - will soon become comparable to that of nuclear reactors Consequences: • Intermittent and permanent failures on the die and serious challenges for the cooling facilities/packaging Goal: • Limit energy dissipation through technology independent techniques with no impact on performance

Typical Superscalar Datapath

The Dispatch Buffer • Instruction Dispatch Buffer (a.k.a. Issue Queue) is one of the major source of power dissipation in modern superscalar processors: up to 22% of total chip power • Major components of power dissipation in Dispatch Buffer are: • Dispatch (Entry Setup = Locating free entries + writing to them) • Issue (FU arbitration + Reading selected instr. From the DB) • Forwarding (Tag comparison + latching) From Decode/Dispatch Stage D 17.3 % D 24.2 % I 50.1 % I 53.8 % F 28.9 % F 25.7 % SPECint 95 SPECfp 95 To function units (issue) From function units (forwarding)

Main Results • 60% plus energy savings within the DB achieved using three relatively independent techniques: • Replacing traditional comparators with dissipate-on-match comparator • Not reading or writing leading zero bytes • Using bit-line segmentation to reduce bit line capacitance and dissipations during reads and writes • No impact on cycle time • Only 12% increase in layout area of DB – for 4 metal layer, 0.5 micron layout: smaller increase with additional metal layers

Low Power Comparator • Traditional comparators dissipate power on mismatches • Only 5% of total comparisons matches • This is a major source for power dissipation in Dispatch Buffer LSB = least significant bits Dispatch Buffer Comparator Statistics

Low Power Comparator • Traditional comparators dissipate power on mismatches • Only 5% of total comparisons matches • This is a major source for power dissipation in Dispatch Buffer 8 bit phys. reg. number X X X DB Forwarding Bus LSB = least significant bits X Dispatch Buffer Comparator Statistics Inactive slot Waiting slot Matching slot

Low Power Comparator Idea: Design of a new comparator that dissipates power only on matches LSB = least significant bits Dispatch Buffer Comparator Statistics New dissipate-on-match comparator: Domino logic with pass-transistor at the front end

Zero-Byte Encoding Observation: • The simulated execution of the SPEC 95 benchmarks show that about half of the byte fields within operands are all zeros Reasons: • Use of small integer literals (address offsets, literal operands, flags, byte ops, etc.) • Consequence of byte packing and unpacking operations and usage of the bit or byte masks to isolate parts of the operands • Some floating point operands may not use all of the bits allowed in the mantissa field • Use of lower-precision data may not make use of full datapath width 32 and 64-bit Integer Operands (90%of all operands)

Zero-Byte Encoding Idea: • Instead of driving byte with all-zeroes, encode it using the ZI (Zero Indicator) bit and only drive this bit, thus achieving power savings during writes. Associated Circuit Techniques: Readout Logic

Zero-Byte Encoding • Stored ZI bit disables reading of associated byte- avoiding bitline discharge and sense-amp dissipation Associated Circuit Techniques: Readout Logic

Zero-Byte Encoding Associated Circuit Techniques: Encoding Logic for bytes of all zeroes

Bitline Segmentation • The DB is essentially a Register File with additional associative logic for data forwarding. • For each instruction dispatched in a cycle a write port is needed for entry setup process • For each instruction issued in a cycle a read port is needed to move the instruction from DB to FU. • The bitlines associated with each read and write port present a high capacitive load, which consists of a component that varies linearly with the number of rows in the DB. • This component is due to the wire capacitances of the bitlines and the diffusion capacitance of the pass transistors that connect the bitcells to the bitlines From Decode/Dispatch Stage WRITE PORTS To function units (issue) READ PORTS

Bitline Segmentation Idea: The DB is reconstructed into segments. • Capacitive loading on each segment is lowered: Each segment is connected to only 16 pass devices • Wire capacitance is lowered: Wire length of the bitline segment is one fourth of the original bitline Bitline segmented DB

Evaluation Methodology • Used a true cycle-by-cycle register-level simulator for a typical superscalar pipeline. Simplescalar has been substantially modified for this purpose to mimic real superscalar datapaths • Simulated the execution of SPEC 95 benchmarks. • Collected transition counts for each major datapath component • Used SPICE measurements for the VLSI layout of dispatch buffer and reorder buffer in a 0.5 micron, 4-metal layer process to estimate the power dissipated for each type of transition within each major component (migrating to 0.18 micron soon!)

Evaluation Methodology Datapath Power Estimator

Results Traditional vs. New Comparator mW 49% 44% 45% Power Dissipation Power dissipation within the DB during forwarding

Results Traditional vs. New Comparator mW 14% 11% 12% Power Dissipation Total power dissipation within the DB

Results Zero-Byte Encoding and Bitline Segmentation mW 54%, 26%, 61% 53%, 21%, 59% 53%, 23%, 60% Power Dissipation Power dissipation within the DB during instruction dispatch

Results Zero-Byte Encoding and Bitline Segmentation mW 40%, 41%, 62% 41%, 32%, 58% 41%, 35%, 60% Power Dissipation Power dissipation within the DB during instruction issue

Results Zero-Byte Encoding, Bitline Segmentation and New Comparators mW 33%, 50%, 61% 31%, 44%, 59% 32%, 46%, 60% Power Dissipation Total power dissipation within the DB

Related Work • Zero byte encoding of function unit results (Brooks and Martonosi, 1999) • Zero-byte compression on buses, register files, DB and ROB in superscalar datapath (Ponomarev, Ghose, Kucuk, Kogge and Toomarian, 2000) • Zero-byte compression for I-caches (Villa, Zhang, Asanovic, 2000) • Zero-byte compression in simple scalar datapath (Canal, Gonzalez and Smith, 2000) • Dynamic resizing of issue queue (Buyuktosunoglu, Albonesi, Schuster, Brooks, Bose and Cook, 2001 + Folegnani and Gonzalez, 2001) • Dynamic resizing of dispatch buffer and reorder buffer (Ponomarev, Kucuk and Ghose, 2001)

Conclusion • We studied three relatively independent techniques to reduce the energy dissipation in the instruction dispatch buffers of modern superscalar processors: • New comparators that dissipate the energy mainly on the tag matches • Zero-Byte encoding to reduce the number of bitlines that have to be driven during instruction dispatch and issue as well as during forwarding of the results to the waiting instructions in the DB • Bitline segmentation to reduce the length of bitlines (to reduce wire and diffusion capacitances) • Total power reduction is about 60% • The DB power reductions are achieved without compromising the cycle time and only through a modest growth in the area of the DB (about 12%) • Our studies also show that the use of all the techniques that reduce the DB power can also be used to achieve reductions of a similar scale in other datapath artifacts that use associative addressing (such as the Reorder Buffer and LOAD/STORE Queue.)

Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors*