680 likes | 688 Vues
Out-of-Order Execution Scheduling. Instruction Level Parallel Processing. Sequential Execution Semantics Out-of-Order Execution How it can help Issues: Maintaining Sequential Semantics Scheduling Scoreboard Register Renaming Initially, we’ll focus on Registers, Memory later on.
E N D
Out-of-Order ExecutionScheduling ECE1773 - Fall ‘07 ECE Toronto
Instruction Level Parallel Processing • Sequential Execution Semantics • Out-of-Order Execution • How it can help • Issues: • Maintaining Sequential Semantics • Scheduling • Scoreboard • Register Renaming • Initially, we’ll focus on Registers, Memory later on ECE1773 - Fall ‘07 ECE Toronto
Sequential Semantics - Review • Instructions appear as if they executed: • In the order they appear in the program • One after the other Program Order Pipelining Superscalar Out-of-Order ECE1773 - Fall ‘07 ECE Toronto
fetch decode add fetch decode ld fetch decode add fetch decode sub fetch decode bne Out-of-Order Execution loop: add r4, r4, 1 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop do { sum += a[++m]; i--; } while (i != 0); Superscalar out-of-order fetch decode add fetch decode ld fetch decode add fetch decode sub fetch decode bne ECE1773 - Fall ‘07 ECE Toronto
fetch decode add fetch decode ld fetch decode add fetch decode sub fetch decode bne Sequential Semantics? • Execution does NOT adhere to sequential semantics • To be precise: Eventually it may • Simplest solution: Define problem away • Not acceptable today: e.g., Virtual Memory • Three-phase Instruction execution • In-Progress, Completed and Committed inconsistent consistent ECE1773 - Fall ‘07 ECE Toronto
Out-of-order Execution Issues • Preserving Sequential Semantics • Stalling Instructions w/ dependences • Issuing Instructions when dependences are satisfied ECE1773 - Fall ‘07 ECE Toronto
Back to Sequential Semantics • Instr. exec. in 3 phases: • In-progress, Completed, Committed • OOO for in-progress and Completed • In-order Commits • Completed - out-of-order: ”Visible only inside” • Results visible to subsequent instructions • Results not visible to outsiders • On interrupts completed results are discarded • Committed - in-order: ”Visible to all” • Results visible to subsequent instructions • Results visible to outsiders • On interrupt committed results are preserved ECE1773 - Fall ‘07 ECE Toronto
fetch decode add fetch decode ld fetch decode add fetch decode sub fetch decode bne How Completes Help w/ Performance in-order completes out-of-order completes in-order commits DIV R3, _, _ ADD R1, _, _ ADD _, R1, _ Time In-order commits commit commit commit commit commit complete ECE1773 - Fall ‘07 ECE Toronto
Implementing Completes/Commits • Key idea: • Maintain sufficient state around to be able to roll-back when necessary • Roll-back: • Discard (aka Squash) all not committed • One solution (conceptual): • Upon Complete instruction records previous value of target register • Upon Discard, instruction restores target value • Upon Commit, nothing to do • We will return to this shortly • Focus on scheduling mechanisms ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order Execution Overview Program Form Processing Phase Static program dynamic inst. Stream (trace) execution window completed instructions In-Progress Dispatch/ dependences inst. Issue inst execution inst. Reorder & commit Completed Committed ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order Execution: Stages • Fetch: get instruction from memory • Decode/Dispatch: what is it? What are the dependences • Issue: Go – all dependences satisfied • Execute: perform operation • Complete: result available to other insts. • Commit: result available to outsiders • We’ll start w/ Decode/Dispatch • Then we’ll consider Issue ECE1773 - Fall ‘07 ECE Toronto
OOO Scheduling • Instruction @ Decode: • Do I have dependences yet to be satisfied? • Yes, stall until they are • No, clear to issue • Wakeup Instructions Stalled: • Dependences satisfied • Allow instruction to issue • Dependence: • (later instruction, earlier instruction) & type • We’ll first consider RAW and then move on to WAW and WAR ECE1773 - Fall ‘07 ECE Toronto
Stalling @ Decode for RAW • Are there unsatisfied dependences? • RAW: have to wait for register value • We don’t really care who is producing the value • Only whether it is available • Can use the Register Availability Vector as in pipelining/superscalar • Also known as scoreboard • At Decode • Reset bit corresponding to your target • At writeback set • Check all bits for source regs: if any is 0 stall ECE1773 - Fall ‘07 ECE Toronto
Issuing Instructions: Scheduling • Determine when an instruction can issue • Ignore resources for the time being • Stalled because of RAW w/ preceding instruction • Concept: • Producer (write) notifies consumers (read) • Requirements: • Consumers need to be able to identify producer • The register name is one possible link • Mechanism • Consumer placed in a reservation station • Producers on complete broadcasts identity • Waiting instructions observe • Update Operand Availability • Issue if all operands now available ECE1773 - Fall ‘07 ECE Toronto
Reservation Station • State pertaining to an instruction • What registers it reads • Whether they are available • What is the destination register • What state is the instruction in • Waiting • Executing ECE1773 - Fall ‘07 ECE Toronto
1 1 1 1 Out-Of-Order Exec. Example loop: add r4, r4, 4 ld r2, 10(r4) 4 cycles lat add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 Cycle 0 ECE1773 - Fall ‘07 ECE Toronto
1 1 1 0 Out-Of-Order Exec. Example: Cycle 0 loop: add r4, r4, 4 ld r2, 10(r4) 5 cycles lat add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Ready to be executed RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4/0 Rdy Cycle 0 ECE1773 - Fall ‘07 ECE Toronto
1 0 1 1 Cycle 1 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Notify those waiting for R4 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Exec ld r4/1 NA/1 r2 Rdy R4 gets produced now ECE1773 - Fall ‘07 ECE Toronto
1 0 0 1 Cycle 2 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait ECE1773 - Fall ‘07 ECE Toronto
0 0 0 1 Cycle 3 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait No dependences sub r1/1 NA/1 r1 Rdy ECE1773 - Fall ‘07 ECE Toronto
1 0 0 1 Cycle 4 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait r1 produced now Notify consumers sub r1/1 NA/1 r1 Exec bne r1/1 r0/1 NA Rdy r1 will be available next cycle ECE1773 - Fall ‘07 ECE Toronto
1 0 0 1 Cycle 5 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait Completed sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Exec executing ECE1773 - Fall ‘07 ECE Toronto
1 1 0 1 Cycle 6 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 Notify consumers RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/1 r3 Rdy Completed sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Exec executing ECE1773 - Fall ‘07 ECE Toronto
1 1 1 1 Cycle 7 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Notify consumers RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Cmtd Executing add r3/1 r2/1 r3 Exec sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Compl Completed ECE1773 - Fall ‘07 ECE Toronto
1 1 1 1 Cycle 8 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Cmtd add r3/1 r2/1 r3 Cmtd sub r1/1 NA/1 r1 Cmtd bne r1/1 r0/1 NA Cmtd ECE1773 - Fall ‘07 ECE Toronto
Notifying Consumers • Identity of Producer • Uniquely Identify the Instruction • Easily retrievable @ decode by others • Target Register • Recall we stall on WAR or WAW • Functional Unit • If not pipelined • Place in instruction window • PC? not. Why? ECE1773 - Fall ‘07 ECE Toronto
Name Dependences and OOO • WAW or WAR: We need to update register but others are still using it • add r1, r1, 10 • sw r1, 20(r2) • add r1, r3, 30 • sub r2,r1, 40 • There is only one r1 • sw needs to see the value of 1st add • sub needs to wait for 2nd add and not 1st • Solution: Stall decode when WAW or WAR ECE1773 - Fall ‘07 ECE Toronto
Detecting WAW and WAR • WAW? Look at Scoreboard • If bit is 0 then there is a pending write • Stall • WAR? Need to know whether all preceding consumers have read the value • Keep a count per register • Increase at decode for all reads • Decrease on issue • More elegant solution via register renaming • Soon ECE1773 - Fall ‘07 ECE Toronto
Window vs. Scheduler • Window • Distance between oldest and youngest instruction that can co-exist inside the CPU • Larger window Potential for more ILP • Scheduler • Number of instructions that are waiting to be issued • Window • Instructions enter at Fetch • Exit at Commit • Scheduler • Instructions enter at Decode • Leave at writeback • Window >= Scheduler • Can be the same structure • In window but not in scheduler completed instructions ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding • Schedule based on RAW dependences • WAW and WAR cause stalls • WAW at decode • WAR at writeback • Optimization: Why is this OK? • Implemented in the CDC 6600 in ‘64 • 18 non-pipelined FUs • 4 FP: 2 mul, 1 add, 1 div • 7 MEM: 5 load, 2 store • 7 INT: add, shift, logical etc. • Centralized Control Scheme • Controls all Instruction Issue • Detects all hazards ECE1773 - Fall ‘07 ECE Toronto
FP mul FP mul FP divide FP add FP integer Register File scoreboard MIPS/DLX w/ Scoreboarding ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Overview • Ignore IF and MEM for simplicity • 4-stage execution • Issue Check for structural hazards Check for WAW hazards Stall until all clear • ReadOp Check for RAW hazards Wait until all operands ready Read Registers • Execute Execute Operations Notify scoreboard when complete • Write Check for WAR hazards Stall Write until all clear • A completing instruction cannot write dest if an earlier instruction has not read dest. ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Optimizations/Tricks • WAW as in original OOO • WAR is optimized • Second Producer is allowed to execute up to complete • It is stalled there until preceding consumers complete • No Commit • No precise interrupts • Window is implemented in the scoreboard • One entry per Functional Unit • Recall not pipelined • Instructions identified by FU id ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Organization • Three structures • Instruction Status • Functional Unit Status • Register Result Status • Instruction Status • Which stage the instruction is currently in • Functional Unit Status: scheduling • Busy • OP Operation • Fi Dest. Reg. • Fj, Fk Source Regs • Qj, Qk FUs producing sources • Rj, Rk Ready bits for sources • Register Result Status: dep. determination • Which FU will produce a register ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding explained • Register status reg: • Which FU produces the register • Use at decode • Source reg match is a RAW • Target reg macth is a WAW stall ECE1773 - Fall ‘07 ECE Toronto
Functional Unit Status • Busy: • resource allocation • OP: • what to do once issued (e.g., add, sub) • Dest. Reg.: • Where to write result • To find WAR • Fj, Fk Source Regs • for WAR: can’t write if consumers pending for previous value of register (if FU not the same) • Qj, Qk FUs producing sources • To wait for appropriate producer • Rj, Rk Ready bits for sources • To determine when ready: all ready ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Example ECE1773 - Fall ‘07 ECE Toronto
Example: Cycle 0 ECE1773 - Fall ‘07 ECE Toronto
Example, contd. • The rest you’ll find on the web site • Go through it • Source: Patterson • Summary: • Execution proceeds in an order dictated by dependences • RAW, WAR and WAW force ordering • Tricks may be possible ECE1773 - Fall ‘07 ECE Toronto
A B D C E Beyond Simple OoO A: LF F6, 34(R2) B: LF F2, 45(R3) C: MULF F0, F2, F4 D: SUBF F8, F2, F6 E: ADDF F2, F7, F4 • E will wait for B, C and D. • WAR w/ C and D • WAW w/ B • Can we do better? ECE1773 - Fall ‘07 ECE Toronto
What if we had infinite registers A: LF F6, 34(R2) B: LF F2, 45(R3) C: MULF F0, F2, F4 D: SUBF F8, F2, F6 E: ADDF F2, F7, F4 A: LF F6, 34(R2) B: LF F2, 45(R3) C: MULF F0, F2, F4 D: SUBF F8, F2, F6 E: ADDF F9, F7, F4 No false dependences anymore Since we do not reuse a name we can’t have WAW and WAR ECE1773 - Fall ‘07 ECE Toronto
Why we can’t have Infinite Registers • False/Name dependences (WAR and WAW) • Artifact of having finite registers • There is no such thing as infinite • There is no such thing as large enough • Well there is (in a sec.) • Computers execute Billions of Instructions per sec. Even a multi-billion register file would soon be exhausted • Want to exploit parallelism across several instances of the same code • Loops, recursive functions (most frequent part) ECE1773 - Fall ‘07 ECE Toronto
Yes, there is “large enough” • At any given point there will be a finite number of instructions in the window • if each instruction has a single register target • if there are N instructions • How many registers do we need? • N? • N + X? ECE1773 - Fall ‘07 ECE Toronto
Register Renaming • Register Version • Every Write creates a new version • Uses read the last version • Need to keep a version until all uses have read it. • Register Renaming: • Architecturalvs. Physical Registers • more phys. than arch. • Maintain a map of arch. to phys. regs. • Use in-order decoding to properly identify dependences. • Instructions wait only for input op. availability. • Only last version is written to reg. file. ECE1773 - Fall ‘07 ECE Toronto
Register Renaming A: DIVF F3, F1, F0 r1, -, - B: SUBF F2, F1, F0 r2, -, - C: MULF F0, F2, F4 r3, r2, - D: SUBF F6, F2, F3r4, r2, r1 E: ADDF F2, F5, F4 r5, -, - F: ADDF F0, F0, F2r6, r3, r5 Need more physical registers than architectural Ignore control flow for the time being. ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Process • Only need to remember last producer of each architectural register • Vector • At decode • Find the most recent producers for all source registers • After: declare self as most recent producer of target register • Complication: • May have to retract • Speculative Execution, e.g., interrupts • Need to be able to restore the mapping state ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Support Structures • Register Rename Table • f(aR) = pR • one entry per architectural Register • Free Register List • Lists not used Physical Registers • At Decode • grab a new register from the free list • Change mapping in rename table • At Commit • Release Register? Not… Why? • Could release previous version ECE1773 - Fall ‘07 ECE Toronto
How Many Physical Registers? • Correctness: • At least as many architectural plus? • Performance: • As many as possible • Not correctness • Recall not all instructions produce register results • stores and branches ECE1773 - Fall ‘07 ECE Toronto
Name Value Dynamic Scheduling A: DIVF F3, F1, F0 r1, -, - B: SUBF F2, F1, F0 r2, -, - C: MULF F0, F2, F4 r3, r2, - D: SUBF F6, F2, F3 r4, r2, r1 E: ADDF F2, F5, F4 r5, -, - F: ADDF F0, F0, F2 r6, r3, r5 - Values and Names flow together - Writeback specifies both value and name - A waiting instruction inspects all results - It is allowed to execute when all inputs are available ECE1773 - Fall ‘07 ECE Toronto
Physical Registers • Physical register file is just one option • What we need is separate storage • Consumers could keep values in their reservation station • Tomasulo’s next ECE1773 - Fall ‘07 ECE Toronto