Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters J. Nelson Amaral

Tomasulo Algorithm

IBM 360/91 Floating Point Arithmetic Unit Tomasulo Algorithm: A reservation station for each functional unit. Free/Occupied bit Flag = on → Data = value Flag = off → Data = tag A tag (pointer) to the ROB entry that will store result. Baer, p. 97

Decode-rename Stage Reservation Station Available? Structural Hazard Stall incoming instructions No Yes Free ROB Entry? Structural Hazard Stall incoming instructions No Yes Assign reservation station and tail of ROB to instruction Baer p. 97

Dispatch Stage ROB Entry Map for each source operand? ROB Entry Flag? Tag Logical Register Value Forward ROB tag to RS. ReadyBit(RS) ← 0 Forward value to Reservation Station (RS) ReadyBit(RS) ← 1 Map result register to tag Enter tag into RS Enter instruction at tail of ROB ResultFlag(tail of ROB) ←0 Baer p. 98

Issue Stage Both Flags in RS are on? No Yes Function unit stalled? (waiting for CDB) Yes No CDB = Common Data Bus Issue instruction to functional unit to start execution If multiple functional units of the same type are available, use a scheduling algorithm Baer p. 98

Execute Last cycle of execution? No Got ownership of CDB No Yes Yes Broadcast result and associated tag If multiple functional units request ownership of the Common Data Bus (CDB) on the same cycle a hardwired priority protocol picks the winner. ROB stores result in entry identified by tag. Set corresponding ReadyBit. RSs with same tag store result and set corresponding flag. Baer p. 98

Commit Stage Is there a result at the head of ROB? No Yes Store result in logical register Delete ROB entry Baer p. 97

Operation Timings Assuming no dependencies Addition: broadcast Time: 0 1 2 3 4 5 6 7 commit (if head of ROB) broadcast commit (if head of ROB) finish execution Dispatched Decoded Decoded finish execution Issued Issued Dispatched Multiplication: Time: 0 1 2 3 4 5 6 7 Baer, p. 98

Example i1: R4 ← R0 * R2 # use reservation station 1 of multiplier i2: R6 ← R4 * R8 # use reservation station 2 of multiplier i3: R8 ← R2 + R12 # use reservation station 1 of adder i4: R4 ← R14 + R16 # use reservation station 2 of adder

ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R16 Executing Flag Data Log. Reg Dispatched 0 E1 R4 head 0 E2 R6 tail Register Map Index ⋅⋅⋅ 4 5 6 7 8 Multiplier Reservation Stations Free Flag1 Oper1 Flag2 Oper2 Tag (R8) E2 E2 0 0 1 0 E1 1 0 E1 0 Free Flag1 Oper1 Flag2 Oper2 Tag Adder Reservation Stations i2 is in this res. station. Time: 8 0 1 2 3 4 5 6 7

ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R16 Executing Flag Data Log. Reg Dispatched Ready to Broadc. 0 E1 R4 head Dispatched 0 E2 R6 0 E3 R8 0 E4 R4 tail Register Map Index ⋅⋅⋅ 4 5 6 7 8 E3 “register R4, which was renamed as ROB entry E1 and tagged as such in the reservation station Mult2, is now mapped to ROB entry E4.” (Baer, p. 102) Multiplier Reservation Stations Free Flag1 Oper1 Flag2 Oper2 Tag (R8) E2 (R16) E4 E2 0 0 1 0 E1 1 1 1 (R14) 1 E4 0 Free Flag1 Oper1 Flag2 Oper2 Tag Adder Reservation Stations Time: 8 0 1 2 3 4 5 6 7

ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R16 Ready to Broadc. Flag Data Log. Reg Dispatched Broadcast 0 E1 R4 head Ready to Broadc. 0 E2 R6 1 (i3) R8 Assume Adder has priority to broadcast. 0 E4 R4 tail Register Map Index ⋅⋅⋅ 4 5 6 7 8 E3 Multiplier Reservation Stations Free Flag1 Oper1 Flag2 Oper2 Tag (R8) E2 E2 0 0 1 0 E1 1 0 E4 0 Free Flag1 Oper1 Flag2 Oper2 Tag Adder Reservation Stations Time: 8 0 1 2 3 4 5 6 7

ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R16 Ready to Broadc. Flag Data Log. Reg Dispatched 0 E1 R4 head Broadcast 0 E2 R6 1 (i3) R8 Assume Adder has priority to broadcast. 1 (i4) R4 tail Register Map Index ⋅⋅⋅ 4 5 6 7 8 E3 Multiplier Reservation Stations Free Flag1 Oper1 Flag2 Oper2 Tag (R8) E2 E2 0 0 1 0 E1 1 0 E4 0 Free Flag1 Oper1 Flag2 Oper2 Tag Adder Reservation Stations Time: 8 0 1 2 3 4 5 6 7

ROB i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R16 Broadcast Flag Data Log. Reg Dispatched 1 (i1) R4 head 0 E2 R6 1 (i3) R8 1 (i4) R4 tail Register Map Index ⋅⋅⋅ 4 5 6 7 8 E3 Multiplier Reservation Stations Free Flag1 Oper1 Flag2 Oper2 Tag (R8) E2 E2 0 0 1 1 (i1) 1 0 E4 0 Free Flag1 Oper1 Flag2 Oper2 Tag Adder Reservation Stations Time: 8 0 1 2 3 4 5 6 7

ROB Commit i1: R4 ← R0 * R2 i2: R6 ← R4 * R8 i3: R8 ← R2 + R12 i4: R4 ← R14 + R16 Flag Data Log. Reg Executing 0 (i1) R4 0 E2 R6 head 0 (i3) R8 0 (i4) R4 tail Register Map Index ⋅⋅⋅ 4 5 6 7 8 E3 Multiplier Reservation Stations Free Flag1 Oper1 Flag2 Oper2 Tag E2 0 0 0 0 E4 0 Free Flag1 Oper1 Flag2 Oper2 Tag Adder Reservation Stations Time: 8 0 1 2 3 4 5 6 7

IBM 360/91 – unveiled in 1966

Some variant of the Tomasulo algorithm is the basis for thedesign of all out-of-order processors. Baer p. 97

Data dependency between instruction Several instructions get to the end of the front end and have to wait for operands. Where should these instructions wait? How do they become ready for issue? Baer p. 177

Wakeup Stage Detects instruction readiness. We hope forminstructions to be woken up on each cycle. Baer p. 177

Select Step • or Scheduling step: Arbitrates between multiple instructions vieing for the same instruction unit. • Variations of fist-come-first-serve (of FIFO) • Bypassing (or forwarding) of operands to units allows earlier selection. • Critical instructions may have preference for selection. Baer p. 177

Out-of-Order Architectures Key idea: allow instructions following a stalled one to start execution out of order. A FIFO schedule is not a good idea! Where to store stalled instructions? Baer p. 178

Two Extreme Solutions Intel P6 architecture Tomasulo: a separate reservation station for each functional unit. (distributed window) Instruction Window: a centralized reservation station for all functional units (centralized window) IBM PowerPC series Baer p. 178

A Hybrid Solution • MIPS R10000: • 3 sets of reservation • stations: • address calculations • floating-point units • load-store units Reservation stations are shared among groups of functional units (hybrid window). Baer p. 178

How a design team selects between a centralized, distributed or hybrid window? What are the compromises? Baer p. 179

Window design • Resource allocation: centralized is better • static partitioning of resources is worse than dynamic allocation • Large windows: speed and power come into play Baer p. 179

Two-Step Instruction Issue Wakeup: instruction is ready for execution Select: instruction is assigned to an execution unit.

Wakeup Step Window entries f w Functional units Baer p. 180

Window entry with buses from 8 exec units

Wakeup Step Window entries Thus we need 2fw comparators We need one bus from each functional unit to each window entry If we separate the functional units and window slots into two equal-size groups, we only need. fw/2 comparators f w We also need two comparators for each functional unit in each window entry We will also need fewer (shorter) buses from units to slots. Functional units Baer p. 180

Select Step • Priority encoder: a circuit that receives several requests and issues one grant • woken up instructions vying for the same unit send requests. • priority related to position in window • Smaller window → smaller priority encoder Baer p. 181

When should a centralized window be replaced by a distributed or hybrid one? When the wakeup-select step are on the critical path. Threshold appears to be windows with around 64 entries on a 4-wide superscalar processor Baer p. 182

Intel Pentium 4: 2 large windows 2 schedulers per window Intel Pentium III and Intel core: Smaller centralized window AMD Opteron: 4 sets of reservation stations Baer p. 182

Relation between Select and Wake Up Example: i: R51 ← R22 + R33 i+1: R43 ← R27 – R51 The name given to the result of instruction i (R51) must be broadcast as soon as instruction i is selected. Broadcasting the tag of R51 wakes up instruction i+1. For single-cycle latency instructions, the start of the execution is too late to broadcast the tag. Baer p. 183

Speculative Wake Up and Select Example: i: R51 ← load(R22) i+1: R43 ← R27 – R51 i+2: R35 ← R51 + R28 In this case the tag of the destination of instruction i is broadcast. Instructions i+1 and i+2 are speculatively woken up and selected based on a cache-hit latency. In the case of a cache miss all dependent instructions that have been woken up and selected must be aborted. Baer p. 183

Speculative Selection and the Reservation Stations • An instruction must remain in a reservation station after it is scheduled • A bit indicates that the instruction has been selected • Station is free once it is sure that the instruction selection is not speculative anymore • Windows are large in comparison with the number of functional units • accommodate many instructions in flight, some speculatively. Baer p. 183

What happens upon selection of an instruction? Opcode Reservation Station Functional Unit Operands Tomasulo Reservation Stations Instruction Window Opcode Functional Unit Operands Physical Register File Integrated Register File Baer p. 183

The complexity of Bypassing Example: i: R51 ← R22 + R33 i+1: R43 ← R27 – R51 Functional Unit A Functional Unit B Compute i+1 Output of A must be forwarded to B bypassing storage. Computei Baer p. 183

The complexity of Bypassing Example: i: R51 ← R22 + R33 i+1: R43 ← R27 – R51 Functional Unit A Functional Unit B Now the bypass must forward the output to the input of A. Computei Compute i+1 But the hardware has to implement both buses. Baer p. 183

The complexity of Bypassing In general, given k functional units we may need k2 buses. Example: i: R51 ← R22 + R33 i+1: R43 ← R27 – R51 Functional Unit A Functional Unit B Buses become long to avoid crossing each other. Forwarding may limit the number of functional units in a processor. Also, we need buses to forward the output of B. Computei Compute i+1 Forwarding may need more than one cycle to complete. Baer p. 184

Load Speculation • Load Address Speculation • Used for data prefetching • Memory dependence prediction • Used to speculate data flow from a store to a subsequent load. Baer p. 185

Store Buffer • Store Buffer: A circular queue • Entry allocated when store instruction is decoded • Entry removed when store is committed • Keep data for stores that have not yet committed Baer p. 185

What happens with store buffer on a branch misprediction? States of a Store Buffer Entry Address Computation AV: Available AD: Address is known Data written to cache • Data to be stored is still to be computed by another instruction CO: Committed RE: Result and Address known Store instruction reaches top of ROB Baer p. 185

Handling Store Buffer on Branch Misprediction and Exceptions. • Entries preceeding the mispredicted branch: • are in COMMIT state • must be written to cache • Entries following misprediction • become AVAILABLE • Exceptions: similar • Must write the COMMIT entries to cache before handling exception Baer p. 186

Load Instructions and Load Speculation Baer p. 187

Load /Store Window Implementation – Most Restricted Loads/Stores inserted in program order. Load/Store Window (FIFO) Single window for loads and stores. Loads/Stores removed in same order – at mot one per cycle. Baer p. 187

Load Bypassing • Compare address of load with all addresses in store buffer • Load bypassing: If there is no match → load can proceed • What happens if the operand address of any entry in store buffer is not yet computed? • load cannot proceed • What happens if there is a match to an entry that is not committed? • load cannot access cache • “match” is the last match in program order. • Need associative search of operand addresses in store buffer Baer p. 187

Load Forwarding • If these conditions are true: • A load match a store buffer entry AND • The result is available for the entry ( entry is in RE or CO state) • Then the result can be sent to the register specified by the load • If the match is with an entry in AD state then: • Load waits for entry to reach RE state

Load Speculation in Out-of-Order Architectures Dynamic Memory Disambiguation Problem: Loads are issued speculatively ahead of preceding stores in program order. How to ensure that data dependences are not violated?

Three approaches Pessimistic: Wait until certain that load can proceed. (like load forwarding and bypassing) Optimistic: Load always proceeds speculatively. Need a recovery mechanism. Dependence prediction: use a predictor to decide to speculate or not. Try to have fewer recoveries.

Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters