680 likes | 692 Vues
Explore how out-of-order execution can enhance performance and the issues surrounding maintaining sequential semantics and scheduling in ECE1773 Fall ‘07 at ECE Toronto. Learn about the benefits of register renaming and instructions execution phases.
E N D
Out-of-Order ExecutionScheduling ECE1773 - Fall ‘07 ECE Toronto
Instruction Level Parallel Processing • Sequential Execution Semantics • Out-of-Order Execution • How it can help • Issues: • Maintaining Sequential Semantics • Scheduling • Scoreboard • Register Renaming • Initially, we’ll focus on Registers, Memory later on ECE1773 - Fall ‘07 ECE Toronto
Sequential Semantics - Review • Instructions appear as if they executed: • In the order they appear in the program • One after the other Program Order Pipelining Superscalar Out-of-Order ECE1773 - Fall ‘07 ECE Toronto
fetch decode add fetch decode ld fetch decode add fetch decode sub fetch decode bne Out-of-Order Execution loop: add r4, r4, 1 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop do { sum += a[++m]; i--; } while (i != 0); Superscalar out-of-order fetch decode add fetch decode ld fetch decode add fetch decode sub fetch decode bne ECE1773 - Fall ‘07 ECE Toronto
fetch decode add fetch decode ld fetch decode add fetch decode sub fetch decode bne Sequential Semantics? • Execution does NOT adhere to sequential semantics • To be precise: Eventually it may • Simplest solution: Define problem away • Not acceptable today: e.g., Virtual Memory • Three-phase Instruction execution • In-Progress, Completed and Committed inconsistent consistent ECE1773 - Fall ‘07 ECE Toronto
Out-of-order Execution Issues • Preserving Sequential Semantics • Stalling Instructions w/ dependences • Issuing Instructions when dependences are satisfied ECE1773 - Fall ‘07 ECE Toronto
Back to Sequential Semantics • Instr. exec. in 3 phases: • In-progress, Completed, Committed • OOO for in-progress and Completed • In-order Commits • Completed - out-of-order: ”Visible only inside” • Results visible to subsequent instructions • Results not visible to outsiders • On interrupts completed results are discarded • Committed - in-order: ”Visible to all” • Results visible to subsequent instructions • Results visible to outsiders • On interrupt committed results are preserved ECE1773 - Fall ‘07 ECE Toronto
fetch decode add fetch decode ld fetch decode add fetch decode sub fetch decode bne How Completes Help w/ Performance in-order completes out-of-order completes in-order commits DIV R3, _, _ ADD R1, _, _ ADD _, R1, _ Time In-order commits commit commit commit commit commit complete ECE1773 - Fall ‘07 ECE Toronto
Implementing Completes/Commits • Key idea: • Maintain sufficient state around to be able to roll-back when necessary • Roll-back: • Discard (aka Squash) all not committed • One solution (conceptual): • Upon Complete instruction records previous value of target register • Upon Discard, instruction restores target value • Upon Commit, nothing to do • We will return to this shortly • Focus on scheduling mechanisms ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order Execution Overview Program Form Processing Phase Static program dynamic inst. Stream (trace) execution window completed instructions In-Progress Dispatch/ dependences inst. Issue inst execution inst. Reorder & commit Completed Committed ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order Execution: Stages • Fetch: get instruction from memory • Decode/Dispatch: what is it? What are the dependences • Issue: Go – all dependences satisfied • Execute: perform operation • Complete: result available to other insts. • Commit: result available to outsiders • We’ll start w/ Decode/Dispatch • Then we’ll consider Issue ECE1773 - Fall ‘07 ECE Toronto
OOO Scheduling • Instruction @ Decode: • Do I have dependences yet to be satisfied? • Yes, stall until they are • No, clear to issue • Wakeup Instructions Stalled: • Dependences satisfied • Allow instruction to issue • Dependence: • (later instruction, earlier instruction) & type • We’ll first consider RAW and then move on to WAW and WAR ECE1773 - Fall ‘07 ECE Toronto
Stalling @ Decode for RAW • Are there unsatisfied dependences? • RAW: have to wait for register value • We don’t really care who is producing the value • Only whether it is available • Can use the Register Availability Vector as in pipelining/superscalar • Also known as scoreboard • At Decode • Reset bit corresponding to your target • At writeback set • Check all bits for source regs: if any is 0 stall ECE1773 - Fall ‘07 ECE Toronto
Issuing Instructions: Scheduling • Determine when an instruction can issue • Ignore resources for the time being • Stalled because of RAW w/ preceding instruction • Concept: • Producer (write) notifies consumers (read) • Requirements: • Consumers need to be able to identify producer • The register name is one possible link • Mechanism • Consumer placed in a reservation station • Producers on complete broadcasts identity • Waiting instructions observe • Update Operand Availability • Issue if all operands now available ECE1773 - Fall ‘07 ECE Toronto
Reservation Station • State pertaining to an instruction • What registers it reads • Whether they are available • What is the destination register • What state is the instruction in • Waiting • Executing ECE1773 - Fall ‘07 ECE Toronto
1 1 1 1 Out-Of-Order Exec. Example loop: add r4, r4, 4 ld r2, 10(r4) 4 cycles lat add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 Cycle 0 ECE1773 - Fall ‘07 ECE Toronto
1 1 1 0 Out-Of-Order Exec. Example: Cycle 0 loop: add r4, r4, 4 ld r2, 10(r4) 5 cycles lat add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Ready to be executed RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4/0 Rdy Cycle 0 ECE1773 - Fall ‘07 ECE Toronto
1 0 1 1 Cycle 1 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Notify those waiting for R4 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Exec ld r4/1 NA/1 r2 Rdy R4 gets produced now ECE1773 - Fall ‘07 ECE Toronto
1 0 0 1 Cycle 2 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait ECE1773 - Fall ‘07 ECE Toronto
0 0 0 1 Cycle 3 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait No dependences sub r1/1 NA/1 r1 Rdy ECE1773 - Fall ‘07 ECE Toronto
1 0 0 1 Cycle 4 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait r1 produced now Notify consumers sub r1/1 NA/1 r1 Exec bne r1/1 r0/1 NA Rdy r1 will be available next cycle ECE1773 - Fall ‘07 ECE Toronto
1 0 0 1 Cycle 5 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait Completed sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Exec executing ECE1773 - Fall ‘07 ECE Toronto
1 1 0 1 Cycle 6 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 Notify consumers RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/1 r3 Rdy Completed sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Exec executing ECE1773 - Fall ‘07 ECE Toronto
1 1 1 1 Cycle 7 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Notify consumers RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Cmtd Executing add r3/1 r2/1 r3 Exec sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Compl Completed ECE1773 - Fall ‘07 ECE Toronto
1 1 1 1 Cycle 8 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Cmtd add r3/1 r2/1 r3 Cmtd sub r1/1 NA/1 r1 Cmtd bne r1/1 r0/1 NA Cmtd ECE1773 - Fall ‘07 ECE Toronto
Notifying Consumers • Identity of Producer • Uniquely Identify the Instruction • Easily retrievable @ decode by others • Target Register • Recall we stall on WAR or WAW • Functional Unit • If not pipelined • Place in instruction window • PC? not. Why? ECE1773 - Fall ‘07 ECE Toronto
Name Dependences and OOO • WAW or WAR: We need to update register but others are still using it • add r1, r1, 10 • sw r1, 20(r2) • add r1, r3, 30 • sub r2,r1, 40 • There is only one r1 • sw needs to see the value of 1st add • sub needs to wait for 2nd add and not 1st • Solution: Stall decode when WAW or WAR ECE1773 - Fall ‘07 ECE Toronto
Detecting WAW and WAR • WAW? Look at Scoreboard • If bit is 0 then there is a pending write • Stall • WAR? Need to know whether all preceding consumers have read the value • Keep a count per register • Increase at decode for all reads • Decrease on issue • More elegant solution via register renaming • Soon ECE1773 - Fall ‘07 ECE Toronto
Window vs. Scheduler • Window • Distance between oldest and youngest instruction that can co-exist inside the CPU • Larger window Potential for more ILP • Scheduler • Number of instructions that are waiting to be issued • Window • Instructions enter at Fetch • Exit at Commit • Scheduler • Instructions enter at Decode • Leave at writeback • Window >= Scheduler • Can be the same structure • In window but not in scheduler completed instructions ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding • Schedule based on RAW dependences • WAW and WAR cause stalls • WAW at decode • WAR at writeback • Optimization: Why is this OK? • Implemented in the CDC 6600 in ‘64 • 18 non-pipelined FUs • 4 FP: 2 mul, 1 add, 1 div • 7 MEM: 5 load, 2 store • 7 INT: add, shift, logical etc. • Centralized Control Scheme • Controls all Instruction Issue • Detects all hazards ECE1773 - Fall ‘07 ECE Toronto
FP mul FP mul FP divide FP add FP integer Register File scoreboard MIPS/DLX w/ Scoreboarding ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Overview • Ignore IF and MEM for simplicity • 4-stage execution • Issue Check for structural hazards Check for WAW hazards Stall until all clear • ReadOp Check for RAW hazards Wait until all operands ready Read Registers • Execute Execute Operations Notify scoreboard when complete • Write Check for WAR hazards Stall Write until all clear • A completing instruction cannot write dest if an earlier instruction has not read dest. ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Optimizations/Tricks • WAW as in original OOO • WAR is optimized • Second Producer is allowed to execute up to complete • It is stalled there until preceding consumers complete • No Commit • No precise interrupts • Window is implemented in the scoreboard • One entry per Functional Unit • Recall not pipelined • Instructions identified by FU id ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Organization • Three structures • Instruction Status • Functional Unit Status • Register Result Status • Instruction Status • Which stage the instruction is currently in • Functional Unit Status: scheduling • Busy • OP Operation • Fi Dest. Reg. • Fj, Fk Source Regs • Qj, Qk FUs producing sources • Rj, Rk Ready bits for sources • Register Result Status: dep. determination • Which FU will produce a register ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding explained • Register status reg: • Which FU produces the register • Use at decode • Source reg match is a RAW • Target reg macth is a WAW stall ECE1773 - Fall ‘07 ECE Toronto
Functional Unit Status • Busy: • resource allocation • OP: • what to do once issued (e.g., add, sub) • Dest. Reg.: • Where to write result • To find WAR • Fj, Fk Source Regs • for WAR: can’t write if consumers pending for previous value of register (if FU not the same) • Qj, Qk FUs producing sources • To wait for appropriate producer • Rj, Rk Ready bits for sources • To determine when ready: all ready ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Example ECE1773 - Fall ‘07 ECE Toronto
Example: Cycle 0 ECE1773 - Fall ‘07 ECE Toronto
Example, contd. • The rest you’ll find on the web site • Go through it • Source: Patterson • Summary: • Execution proceeds in an order dictated by dependences • RAW, WAR and WAW force ordering • Tricks may be possible ECE1773 - Fall ‘07 ECE Toronto
A B D C E Beyond Simple OoO A: LF F6, 34(R2) B: LF F2, 45(R3) C: MULF F0, F2, F4 D: SUBF F8, F2, F6 E: ADDF F2, F7, F4 • E will wait for B, C and D. • WAR w/ C and D • WAW w/ B • Can we do better? ECE1773 - Fall ‘07 ECE Toronto
What if we had infinite registers A: LF F6, 34(R2) B: LF F2, 45(R3) C: MULF F0, F2, F4 D: SUBF F8, F2, F6 E: ADDF F2, F7, F4 A: LF F6, 34(R2) B: LF F2, 45(R3) C: MULF F0, F2, F4 D: SUBF F8, F2, F6 E: ADDF F9, F7, F4 No false dependences anymore Since we do not reuse a name we can’t have WAW and WAR ECE1773 - Fall ‘07 ECE Toronto
Why we can’t have Infinite Registers • False/Name dependences (WAR and WAW) • Artifact of having finite registers • There is no such thing as infinite • There is no such thing as large enough • Well there is (in a sec.) • Computers execute Billions of Instructions per sec. Even a multi-billion register file would soon be exhausted • Want to exploit parallelism across several instances of the same code • Loops, recursive functions (most frequent part) ECE1773 - Fall ‘07 ECE Toronto
Yes, there is “large enough” • At any given point there will be a finite number of instructions in the window • if each instruction has a single register target • if there are N instructions • How many registers do we need? • N? • N + X? ECE1773 - Fall ‘07 ECE Toronto
Register Renaming • Register Version • Every Write creates a new version • Uses read the last version • Need to keep a version until all uses have read it. • Register Renaming: • Architecturalvs. Physical Registers • more phys. than arch. • Maintain a map of arch. to phys. regs. • Use in-order decoding to properly identify dependences. • Instructions wait only for input op. availability. • Only last version is written to reg. file. ECE1773 - Fall ‘07 ECE Toronto
Register Renaming A: DIVF F3, F1, F0 r1, -, - B: SUBF F2, F1, F0 r2, -, - C: MULF F0, F2, F4 r3, r2, - D: SUBF F6, F2, F3r4, r2, r1 E: ADDF F2, F5, F4 r5, -, - F: ADDF F0, F0, F2r6, r3, r5 Need more physical registers than architectural Ignore control flow for the time being. ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Process • Only need to remember last producer of each architectural register • Vector • At decode • Find the most recent producers for all source registers • After: declare self as most recent producer of target register • Complication: • May have to retract • Speculative Execution, e.g., interrupts • Need to be able to restore the mapping state ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Support Structures • Register Rename Table • f(aR) = pR • one entry per architectural Register • Free Register List • Lists not used Physical Registers • At Decode • grab a new register from the free list • Change mapping in rename table • At Commit • Release Register? Not… Why? • Could release previous version ECE1773 - Fall ‘07 ECE Toronto
How Many Physical Registers? • Correctness: • At least as many architectural plus? • Performance: • As many as possible • Not correctness • Recall not all instructions produce register results • stores and branches ECE1773 - Fall ‘07 ECE Toronto
Name Value Dynamic Scheduling A: DIVF F3, F1, F0 r1, -, - B: SUBF F2, F1, F0 r2, -, - C: MULF F0, F2, F4 r3, r2, - D: SUBF F6, F2, F3 r4, r2, r1 E: ADDF F2, F5, F4 r5, -, - F: ADDF F0, F0, F2 r6, r3, r5 - Values and Names flow together - Writeback specifies both value and name - A waiting instruction inspects all results - It is allowed to execute when all inputs are available ECE1773 - Fall ‘07 ECE Toronto
Physical Registers • Physical register file is just one option • What we need is separate storage • Consumers could keep values in their reservation station • Tomasulo’s next ECE1773 - Fall ‘07 ECE Toronto