Tomasulo Algorithm and Dynamic Branch Prediction

Tomasulo Algorithm and Dynamic Branch Prediction Ali Azarpeyvand Advanced Computer Architecture

Review: Summary • Instruction Level Parallelism (ILP) in SW or HW • HW exploiting ILP • Works when can’t know dependence at run time • Code for one machine runs well on another • Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands) • Enables out-of-order execution => out-of-order completion • ID stage checked both for structural and WAW

Review: Three Parts of the Scoreboard 1.Instruction status—which of 4 steps the instruction is in 2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy—Indicates whether the unit is busy or not Op—Operation to perform in the unit (e.g., + or –) Fi—Destination register Fj, Fk—Source-register numbers Qj, Qk—Functional units producing source registers Fj, Fk Rj, Rk—Flags indicating when Fj, Fk are ready 3. Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register

Review: Scoreboard Example Cycle 3 • Issue MULT? No, stall on structural hazard

Review: Scoreboard Example Cycle 9 • Read operands for MULT & SUBD? Issue ADDD?

Review: Scoreboard Example Cycle 17 • Write result of ADDD? No, WAR hazard

Review: Scoreboard Example Cycle 62 • In-order issue; out-of-order execute & commit

Different ILP Techniques

If Loop Basic Blocks & ILP • A basic block is a straight-line code segment with no branches in or out of it. • Tend to be small: • 4-7 instructions on average. • ILP within a basic block is limited. • Need ways to parallelize execution across multiple basic blocks!

  Loop-Level Parallelism (LLP) • Perform multiple loop iterations in parallel. • Works for some loops, but not others. • Examples: • for (I=1; I<=1000; I++) x[I] = x[I] + y[I]; • for (I=1; I<=1000; I++) sum = sum + x[I] • LLP  ILP is possible by loop unrolling • Statically by compiler • Dynamically by the hardware • Early vector-based supercomputers (e.g. Cray) relied on this technique extensively. • Compiling FORTRAN loops to vector operations.

Dependences • A dependence is a way in which one instruction can depend on (be impacted by) another for scheduling purposes. • Three major dependence types: • Data dependence: RAW • Name dependence: WAR, WAW • Control dependence: branch, jump, etc. • A dependency (or dependence) is a particular instance of one instruction depending on another. • The instructions can’t be effectively fully parallelized, or reordered. • Dependencies may cause hazards

A A B C B Data Dependence • Recursive definition: • Instruction B is data dependent on instruction A iff: • B uses a data result produced by instruction A, or • There is another instruction C such that B is data dependent on C, and C is data dependent on A. • Potential for a RAW hazard • Loop: LD F0,0(R1) • ADDD F4,F0,F2 • SD 0(R1),F4 • SUBI R1,R1,#8 • BNEZ R1,Loop • Overcoming the limitations caused by dependencies: • Maintaining the dependency but avoiding hazard • Scheduling the code • Removing the dependency Data dependenciesin loop example

A time B A time B Name Dependence • Occurs when two instructions both access the same data storage location due to reuse the storage (Also called storage dependence, at least one of the accesses must be a write.) • Two sub-types (for inst. B after inst. A): • Anti-dependence: A reads, then B writes. • Potential for a WAR hazard. • Output dependence: A writes, then B writes. • Potential for a WAW hazard. • Note: Name dependencies can be avoided by changing instructions to use different locations (rather than reusing a location).

Instruction Scheduling Schemes • Point: To reduce data hazards. Two types: • Static scheduling: (ch.4) • Done by compiler • Instructions reordered at compile-time to fill delay slots • Problems: • Some data dependences not known till run-time • Program binary code is more tied to architecture • Dynamic scheduling: (ch.3) • Done by the processor • Reorder instructions at execution time

Run-Time Data Dependencies • Why compiler didn’t take care of it? • Are there any data dependences in this code? • SW 100(R1),R6 • LW R7,36(R2) • Answer: It Depends • 100+R1 = 36+R2? • Cannot detect this at compile time! • Values of R1 and R2 may only be computable dynamically. • Processor could stall the LW after effective-address calculation, if addr. matches that of a previously-issued store not yet completed.

Issue Queue Read Operand Instruction Decode Review: Splitting Instruction Decode • Single “Instruction Decode” stage split into 2 parts: • Instruction Issue or dispatch (in-order) • Determine instruction type • Check for structural hazards • Read Operands (can be out-of-order) • Stall instruction until no data hazards • Read operands • Release instruction to begin execution • Need some sort of queue or buffer to hold instructions till their operands are ready. • Note: Out-of-order completion makes precise exception handling difficult!

Imprecise Exceptions • Out-of-order execution/completion may generate imprecise exceptions • Later instruction has been completed and a previous instruction has generated an exception • ADD  exception • SUB  completed • Later instruction has generated an exception when previous instructions are in pipeline • ADD  in the pipeline • SUB  exception • How to handle these imprecise exceptions? • Later discussions: using in-order completion.

Where are the name dependencies? 1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F0,-8(R1) 2 ADDD F4,F0,F2 3 SD -8(R1),F4 ;drop SUBI & BNEZ 7 LD F0,-16(R1) 8 ADDD F4,F0,F2 9 SD -16(R1),F4 ;drop SUBI & BNEZ 10 LD F0,-24(R1) 11 ADDD F4,F0,F2 12 SD -24(R1),F4 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 15 NOP How can remove them?

Where are the name dependencies? 1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 15 NOP Called “register renaming”

Perspectives on Code Movement • Again Name Dependencies are Hard for Memory Accesses • Does 100(R4) = 20(R6)? • From different loop iterations, does 20(R6) = 20(R6)? • Our example required compiler to know that if R1 doesn’t change then:0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1) There were no dependencies between some loads and stores so they could be moved by each other

Control Dependence • Final kind of dependence called control dependence • Example if p1 {S1;}; if p2 {S2;}; S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1. • Occurs when the execution of an instruction depends on a conditional branch instruction. • Program control flow must follow for correct executions.

Perspectives on Code Movement • Two (obvious) constraints on control dependences: • An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. • An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch. • Control dependencies relaxed to get parallelism; • get same effect of exceptions if preserve order • And data flow (value in register depends on branch) • Exception: • BEQZ R2, L1 • LW R1, 0(R2) (if LW is moved before BEQ should not generate exceptions) • Dataflow: • ADD R1,R2,R3 • BEQZ R4,L • SUB R1,R5,R6 (change on R1 valid only if branch not taken)

Control Dependence – Another Example • Example: (for data flow) DADDU R2, R3, R4 BEQZ R5, L1 DSUBU R2, R6, R7 L1: OR R8, R2, R9 • OR depends DADDU and DSUBU. Maintaining data dependences is not enough • Control flow decides where the correct R2 comes from

Control Dependence vs. Correctness • Violation of control dependence may not mean incorrect execution • Cannot affect exception behavior or data flow • Moving DSUBU before the branch • if R4 is not used after skipnext • Control dependence is preserved by • Control hazard detection that causes stalls • Stalls can be reduced by delayed branches DADDU R1, R2, R3 BEQZ R12, skipnext DSUBU R4, R5, R6 DADDU R5, R4, R9 skipnext: OR R7, R8, R9

Relaxing Control Dependence • Only two things must really be preserved: • Data flow (how a given result is produced) • Exception behavior • Some techniques permit removing control dependence from instruction execution, by dependently ignoring instruction results instead. • Speculation (betting on branches, to fill delay slots) • Make instructions unconditional if no harm done • Speculative multiple-execution • Take both paths, invalidate results of one later

Where are the control dependencies? 1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 SUBI R1,R1,8 5 BEQZ R1,exit 6 LD F0,0(R1) 7 ADDD F4,F0,F2 8 SD 0(R1),F4 9 SUBI R1,R1,8 10 BEQZ R1,exit 11 LD F0,0(R1) 12 ADDD F4,F0,F2 13 SD 0(R1),F4 14 SUBI R1,R1,8 15 BEQZ R1,exit ....

Review: Scoreboard Summary • Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache) • Limitations of 6600 scoreboard • No forwarding (First write regsiter then read it) • Limited to instructions in basic block (small window) • Number of functional units(structural hazards) • Wait for WAR hazards • Prevent WAW hazards

Tomasulo’s Algorithm • Goal: High Performance without special compilers • Tomasulo’s algorithm: • Another approach for dynamic scheduling • First used in IBM 360/91 FPU, many years ago, 3 years after CDC 6600 (1966) • Based on concept of dynamic register renaming • Like static renaming we used in loop-unroll example • Differences between IBM 360 & CDC 6600 ISA • IBM has only 2 register specifiers /instr vs. 3 in CDC 6600 • IBM has 4 FP registers vs. 8 in CDC 6600 • Some features: • Copes with long-latency operations (FPU or mem.) • Eliminates WAR & WAW hazards without stalling • Instructions issue as soon as their operands are ready, direct forwarding, bypass register • Distributed hazard detection and execution control • Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …

Tomasulo Algorithm vs. Scoreboard • Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; • FU buffers called “reservation stations”; have pending operands • Registers in instructions replaced by values or pointers to reservation stations(RS); called registerrenaming; • Renaming all destination registers • avoids WAR, WAW hazards • More reservation stations than registers, so can do optimizations compilers can’t • Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs • Load and Stores treated as FUs with RSs as well

Register Renaming • Rename the registers using temporaries • DIV.D F0 F2 F4 DIV.D F0 F2 F4 • ADD.D F6 F0 F8 ADD.D S F0 F8 • S.D F6 0(R1) S.D S 0(R1) • SUB.D F8, F10, F14 SUB.D T, F10, F14 • MUL.D F6, F10, F8 MUL.D F6, F10, T • Remember these are not true dependences • RAW is a true data dependence • WAW (output dependence) and WAR (anti-dependence) are not. WAR WAW

Components of a Tomasulo Unit • Reservation stations (RSs) • Buffer the operands to pending instructions while they are waiting for operands to enter the execution units. • Issue logic • Redirects (renames) instructions’ register outputs to reservation-station slots. • Results go directly to RSs rather than thru reg. file. • Distributed hazard detection • Handled separately by each functional unit • Load & store buffers (can be combined with RS) • Queue up memory access requests

Tomasulo Organization

Major Steps in Tomasulo • Issue • Get instruction from FP op queue • If a slot in appropriate RS (or load-store buffer) is available, send instruction there; else stall it (structural hazard). • Send operand values to RS if already available, otherwise, just note the names (RS) of the operands • Execute • While operands not yet available, monitor CDB for them. • When all operands are in RS, begin executing instruction. • Write result • When result available & CDB is free, write result to CDB, then to registers • mark reservation station available • Normal data bus: data + destination (“go to” bus). • Common data bus: data + source (“come from” bus). • 64 bits of data + 4 bits of Functional Unit source address. • Does the broadcast.

Example for Tomasulo’s Algorithm • We will go through the same code fragment that was used in the scoreboarding example: • LD F6,34(R2) • LD F2,45(R3) • MULTD F0,F2,F4 • SUBD F8,F6,F2 • DIVD F10,F0,F6 • ADDD F6,F8,F2 DataDependence Anti-Dependence OutputDependence

Reservation Station Fields • In each slot: • Op - The operation to perform on operands S1 & S2 • Qj, Qk - The RS slots that will produce S1, S2 • Vj, Vk - The values of S1 & S2. • Busy - RS & its execution unit are occupied • In register file entries & store buffer slots: • Qi - The RS slot containing the op whose result should be stored here. • In load and store buffers (combined in RS): • Address – for load and store.

Details of The Algorithm D/S1/S2=dest./srcs; r/x=station of instruction/any; Register/Store=register/store data structs; Import source operands Tell register it shouldexpect to hear from us later Write dest. reg. (if still expecting) Write to expectant RS slots Write to expectant Store buffer slots

Tomasulo Example Cycle 0

Tomasulo Example Cycle 1 Yes

Tomasulo Example Cycle 2 Note: Unlike 6600, can have multiple loads outstanding

Tomasulo Example Cycle 3 • Note: registers names are removed (“renamed”) in Reservation Stations; • Load1 completing; what is waiting for Load1?

Tomasulo Example Cycle 4 • Load2 completing; what is waiting for it?

Tomasulo Example Cycle 6 • Issue ADDD here vs. scoreboard?

Tomasulo Example Cycle 7 • Add1 completing; what is waiting for it?

Tomasulo Example Cycle 11 • Write result of ADDD here vs. scoreboard?

Tomasulo Example Cycle 12 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No 0 Add3 No 3 Mult1 Yes MULTD M(45+R3) R(F4) 0 Mult2 Yes DIVD M(34+R2) Mult1 Register result status F0 F2 F4 F6 F8 F10 F12 ... F30 Clock 12 FU Mult1 M(45+R3) (M-M)+M() M()–M() Mult2 • Note: all quick instructions complete already

Tomasulo Algorithm and Dynamic Branch Prediction