Instruction Level Parallel Processing

Instruction Level Parallel Processing • Sequential Execution Semantics • Superscalar Execution • Interruptions • Out-of-Order Execution • How it can help • Issues: • Maintaining Sequential Semantics • Scheduling • Scoreboard • Register Renaming • Initially, we’ll focus on Registers, Memory later on ECE1773 - Spring ‘02 ECE Toronto

Sequential Semantics - Review • Instructions appear as if they executed: • In the order they appear in the program • One after the other • Pipelining: Partial Overlap of Instructions • Initiate one instruction per cycle • Subsequent instructions overlap partially • Commit one instruction per cycle ECE1773 - Spring ‘02 ECE Toronto

Can we do better than pipelining? loop: ld r2, 10(r1) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Pipelining: sum += a[i--] time fetch decode ld fetch decode add fetch decode sub fetch decode bne Superscalar: fetch decode ld fetch decode add fetch decode sub fetch decode bne ECE1773 - Spring ‘02 ECE Toronto

Superscalar - In-order (initial def.) • Two or more consecutive instructions (in the original program order) can execute in parallel • Is this much better than pipelining? • What if all instructions were dependent? • Superscalar buys us nothing • Again key is typical program behavior • Some parallelism exists • Pipelining “drains” on dependences • Superscalar consumes “fill-up” time ECE1773 - Spring ‘02 ECE Toronto

Practicalities • Issue mechanism • At decode check: • Dependences • Input operand availability • Check against Instructions: • Simultaneously Decoded • In-progress in the pipeline (i.e., previously issued) • Recall the register vector from pipelining • Increasingly Complex with degree of superscalarity • 2-way, 3-way, …, n-way ECE1773 - Spring ‘02 ECE Toronto

Issue Rules • Stall at decode if: • RAW dependence and no data available • WAR or RAW dependence • No resource available • This check is done in program order ECE1773 - Spring ‘02 ECE Toronto

Issue Mechanism • Assume 2 source & 1 target max per instr. • comparators for 2-way: • 3 for tgt and 2 for src • comparators for 4-way: • 2nd instr: 3 tgt and 2 src • 3rd instr: 6 tgt and 4 src • 4th instr: 9 tgt and 6 src tgt src1 src1 • simplifications • may be possible • resource checking • not shown tgt src1 src1 Program order tgt src1 src1 ECE1773 - Spring ‘02 ECE Toronto

Implications • Need to multiport some structures • Register File • Multiple Reads and Writes per cycle • Register Availability Vector • Multiple Reads and Writes per cycle • From Decode and Commit! • Also need to worry about WAR and WAW • Resource tracking • Additional issue conditions • Many Superscalars had additional restrictions • E.g., execute one integer and one floating point op • one branch, or one store/load ECE1773 - Spring ‘02 ECE Toronto

Preserving Sequential Semantics • In principle not much different than pipelining • Program order is preserved in the pipeline • Some instructions proceed in parallel • But order is clearly defined • Defer interrupts to commit stage (i.e., writeback) • Flush all subsequent instructions • may include instructions committing simultaneously • Allow all preceding instructions to commit • Recall comparisons are done in program order • Must have sufficient time in clock cycle to handle these ECE1773 - Spring ‘02 ECE Toronto

Interrupts Example Exception raised Exception taken fetch decode ld fetch decode add fetch decode div fetch decode bne fetch decode bne Exception raised Exception raised Exception taken fetch decode ld fetch decode add fetch decode div fetch decode bne fetch decode bne ECE1773 - Spring ‘02 ECE Toronto

Superscalar vs. Pipelining • In principle they are orthogonal • Superscalar non-pipelined machine • Pipelined non-superscalar • Superscalar and Pipelined (common) • Additional functionality needed by Superscalar: • Another bound on clock cycle • At some point it limits the number of pipeline stages ECE1773 - Spring ‘02 ECE Toronto

Superscalar vs. Superpipelining • Superpipelining: • Vaguely defined as deep pipelining, i.e., lots of stages • Superscalar issue may prevent superpipelining • How do they compare? fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst ECE1773 - Spring ‘02 ECE Toronto

Superscalar vs. Superpipelining fetch decode inst fetch decode inst fetch decode inst superscalar fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst superpipelining fetch decode inst fetch decode inst fetch decode inst fetch decode inst Limit: All instructions independent Difference is initiation interval ECE1773 - Spring ‘02 ECE Toronto

fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst Superscalar vs. Superpipelining Every stall introduces an additional “initiation interval” difference between the two ECE1773 - Spring ‘02 ECE Toronto

Case Study: Alpha 21164 ECE1773 - Spring ‘02 ECE Toronto

21164: Int. Pipe ECE1773 - Spring ‘02 ECE Toronto

21164: Memory Pipeline ECE1773 - Spring ‘02 ECE Toronto

21164: Floating-Point Pipe ECE1773 - Spring ‘02 ECE Toronto

80486 Pipeline • Fetch • Load 16-bytes from into prefetch buffer • Decode 1 • Determine instruction length and type • Decode 2 • Computer memory address • Generate immediate operands • Execute • Register Read • ALU • Memory read/write • Write-back • Update register file • (source: CS740 CMU, ’97) ECE1773 - Spring ‘02 ECE Toronto

80486 Pipeline detail • Fetch • Moves 16 bytes of instruction stream into code queue • Not required every time • About 5 instructions fetched at once (avg. length 2.5 bytes) • Only useful if don’t branch • Avoids need for separate instruction cache • D1 • Determine total instruction length • Signals code queue aligner where next instruction begins • May require two cycles • When multiple operands must be decoded • About 6% of “typical” DOS program ECE1773 - Spring ‘02 ECE Toronto

80486 Pipeline • D2 • Extract memory displacements and immediate operands • Compute memory addresses • Add base register, and possibly scaled index register • May require two cycles • If index register involved, or both address & immediate operand • Approx. 5% of executed instructions • EX • Read register operands • Compute ALU function • Read or write memory (data cache) • WB • Update register result ECE1773 - Spring ‘02 ECE Toronto

Out-of-Order Execution • Also known as dynamic scheduling • Compilers do static scheduling • We will start by considering register only • Register interface helps a lot • Later on we will expand to memory • Tricky: Memory interface is more powerful than registers • Makes it harder to figure out dependences • In principle the same method will be used for both ECE1773 - Spring ‘02 ECE Toronto

Beyond Superscalar Execution loop: ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Superscalar: sum += a[m); m-- fetch decode ld ld fetch decode add add fetch decode sub sub fetch decode bne bne out-of-order fetch decode ld ld fetch decode add add fetch decode sub sub fetch decode bne bne ECE1773 - Spring ‘02 ECE Toronto

Sequential Semantics? • Execution does NOT adhere to sequential semantics • To be precise: Eventually it may • Simplest solution: Define problem away • Imprecise interrupts • On interrupt some instr. committed some not • software we’ll have to figure out what is going on • Horrible for debugging and programming inconsistent fetch decode ld ld fetch decode add add fetch decode sub sub fetch decode bne bne ECE1773 - Spring ‘02 ECE Toronto

Interrupt • Recall we use the term interrupt to signify the need to observe the machine’s state after any instruction • This can be indeed the result of interrupt in the classical sense • But, it could be a debugger (still uses interrupts) ECE1773 - Spring ‘02 ECE Toronto

Out-of-Order vs. Pipelining and Superscalar • Definition: two or more instructions can execute in any order if they have no dependences (RAW, WAW, WAR) • beware of transitive dependences • Is this better than pipelining or superscalar exec? • If all are independent: not • if all dependent: not • Programs have some parallelism • Pipelining “drain” and “fill-up” overheads • Superscalar, parallelism only when adjacent • OoO exploits par. even when not adjacent • OoO Orthogonal to pipelining and Superscalar ECE1773 - Spring ‘02 ECE Toronto

Out-of-order Execution Issues • Preserving Sequential Semantics • Stalling Instructions w/ dependences • Issuing Instructions when dependences are satisfied ECE1773 - Spring ‘02 ECE Toronto

Back to Sequential Semantics • Instr. exec. in 3 phases: • In-progress, Completed (NEW), Committed • OOO for in-progress and Completed • In-order Commits • Completed - out-of-order: • Results visible to subsequent instructions • Results not visible to outsiders • On interrupts completed results are discarded • Committed - in-order: • Results visible to subsequent instructions • Results visible to outsiders • On interrupt committed results are preserved ECE1773 - Spring ‘02 ECE Toronto

How Completes Help w/ Performance in-order completes out-of-order completes in-order commits DIV R3, _, _ ADD R1, _, _ ADD _, R1, _ Time In-order commits commit fetch decode ld ld fetch decode add add fetch decode sub sub fetch decode bne bne complete ECE1773 - Spring ‘02 ECE Toronto

Implementing Completes/Commits • Key idea: • Maintain sufficient state around to be able to roll-back when necessary • Roll-back: • Discard (aka Squash) all not committed • One solution (conceptual): • Upon Complete instruction records previous value of target register • Upon Discard, instruction restores target value • Upon Commit, nothing to do • We will return to this shortly • Focus on scheduling mechanisms ECE1773 - Spring ‘02 ECE Toronto

Out-of-Order Execution the big Picture Program Form Processing Phase Static program dynamic inst. Stream (trace) execution window completed instructions Dispatch/ dependences inst. Issue inst execution inst. Reorder & commit ECE1773 - Spring ‘02 ECE Toronto

Out-of-Order Execution: Stages • Fetch: get instruction from memory • Decode/Dispatch: what is it? What are the dependences • Issue: Go – all dependences satisfied • Execute: perform operation • Complete: result available to other insts. • Commit: result available to outsiders • We’ll start w/ Decode/Dispatch • Then we’ll consider Issue ECE1773 - Spring ‘02 ECE Toronto

OOO Scheduling • Instruction @ Decode: • Do I have dependences yet to be satisfied? • Yes, stall until they are • No, clear to issue • Wakeup Instructions Stalled: • Dependences satisfied • Allow instruction to issue • Dependence: • (later instruction, earlier instruction) & type • We’ll first consider RAW and then move on to WAW and WAR ECE1773 - Spring ‘02 ECE Toronto

Stalling @ Decode for RAW • Are there unsatisfied dependences? • RAW: have to wait for register value • We don’t really care who is producing the value • Only whether it is available • Can use the Register Availability Vector as in pipelining/superscalar • Also known as scoreboard • On decode • Reset bit corresponding to your target • At writeback set • Check all bits for source regs: if any is 0 stall ECE1773 - Spring ‘02 ECE Toronto

Issuing Instructions: Scheduling • Determine when an instruction can issue • Ignore Resources for the time being • Stalled because of RAW w/ preceding instruction • Concept: • Producer (write) notifies consumers (read) • Requirements: • Consumers need to be able to identify producer • The register name is one possible link • Mechanism • Consumers placed in a reservation station • Producers on complete broadcasts identity • Waiting instructions observe • Update Operand Availability • Issue if all operands now available ECE1773 - Spring ‘02 ECE Toronto

Reservation Station • State pertaining to an instruction • What registers it reads • Whether they are available • What is the destination register • What state is the instruction in • Waiting • Executing ECE1773 - Spring ‘02 ECE Toronto

1 1 1 1 Out-Of-Order Exec. Example loop: ld r2, 10(r4) 4 cycles lat add r3, r3, r2 sub r1, r1, 1 add r4, r4, 4 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 Cycle 0 ECE1773 - Spring ‘02 ECE Toronto

1 0 1 1 Out-Of-Order Exec. Example loop: ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 add r4, r4, 4 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 ld r4/1 NA/1 r2/0 Rdy Cycle 0 ECE1773 - Spring ‘02 ECE Toronto

1 0 0 1 Out-Of-Order Exec. Example loop: ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 add r4, r4, 4 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 ld r4/1 NA/1 r2/0 Exec add r3/1 r2/0 r3/0 Wait Cycle 1 ECE1773 - Spring ‘02 ECE Toronto

0 0 0 1 Out-Of-Order Exec. Example loop: ld r2, 10(r4) add r5, r3, r2 sub r1, r1, 1 add r4, r4, 4 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 ld r4/1 NA/1 r2/0 Exec add r3/1 r2/0 r3/0 Wait sub r1/1 NA/1 r1/0 Rdy Cycle 2 ECE1773 - Spring ‘02 ECE Toronto

0 0 0 0 Out-Of-Order Exec. Example loop: ld r2, 10(r4) 5 cycles lat add r3, r3, r2 sub r1, r1, 1 add r4, r4, 4 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 ld r4/1 NA/1 r2/0 Exec add r3/1 r2/0 r3/0 Wait sub r1/1 NA/1 r1/0 Exec Cycle 3 add r4/1 NA/1 r4/0 Rdy ECE1773 - Spring ‘02 ECE Toronto

1 1 0 1 Out-Of-Order Exec. Example loop: ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 add r4, r4, 4 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 ld r4/1 NA/1 r2/1 Comp add r3/1 r2/1 r3/0 Exec sub r1/1 NA/1 r1/1 Comp Cycle 4 add r4/1 NA/1 r4/0 Exec bne r1/1 NA/1 NA/1 Rdy ECE1773 - Spring ‘02 ECE Toronto

1 1 1 1 Out-Of-Order Exec. Example loop: ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 add r4, r4, 4 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 ld r4/1 NA/1 r2/1 Comm add r3/1 r2/1 r3/1 Comp sub r1/1 NA/1 r1/1 Comp Cycle 5 add r4/1 NA/1 r4/1 Comp bne r1/1 NA/1 NA/1 exec ECE1773 - Spring ‘02 ECE Toronto

1 1 1 0 Out-Of-Order Exec. Example loop: ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 add r4, r4, 4 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 ld r4/1 NA/1 r2/1 Comm add r3/1 r2/1 r3/1 Comm sub r1/1 NA/1 r1/1 Comm Cycle 7 add r4/1 NA/1 r4/1 Comm bne r1/1 NA/1 NA/1 Comp ECE1773 - Spring ‘02 ECE Toronto

1 1 1 1 Out-Of-Order Exec. Example loop: ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 add r4, r4, 4 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 ld r4/1 NA/1 r2/1 Comm add r3/1 r2/1 r3/1 Comm sub r1/1 NA/1 r1/1 Comm Cycle 8 add r4/1 NA/1 r4/1 Comm bne r1/1 NA/1 NA/1 Comm ECE1773 - Spring ‘02 ECE Toronto

Notifying Consumers • Identity of Producer • Uniquely Identify the Instruction • Easily retrievable @ decode by others • Target Register • Recall we stall on WAR or WAW • Functional Unit • If not pipelined • Place in instruction window • PC? not. Why? ECE1773 - Spring ‘02 ECE Toronto

Name Dependences and OOO • WAW or WAR: We need to update register but others are still using it • add r1, r1, 10 • sw r1, 20(r2) • add r1, r3, 30 • sub r2,r1, 40 • There is only one r1 • sw needs to see the value of 1st add • Sub needs to wait for 2nd add and not 1st • Solution: Stall decode when WAW or WAR ECE1773 - Spring ‘02 ECE Toronto

Detecting WAW and WAR • WAW? Look at Scoreboard • If bit is 0 then there is a pending write • Stall • WAR? Need to know whether all preceding consumers have read the value • Keep a count per register • Increase at decode for all reads • Decrease on issue • More elegant solution via register renaming • soon ECE1773 - Spring ‘02 ECE Toronto

Scoreboarding • Schedule based on RAW dependences • WAW and WAR cause stalls • WAW at decode • WAR at writeback • Implemented in the CDC 6600 in ‘64 • 18 non-pipelined FUs • 4 FP: 2 mul, 1 add, 1 div • 7 MEM: 5 load, 2 store • 7 INT: add, shift, logical etc. • Centralized Control Scheme • Controls all Instruction Issue • Detects all hazards ECE1773 - Spring ‘02 ECE Toronto

FP mul FP mul FP divide FP add FP integer Register File scoreboard DLX w/ Scoreboarding ECE1773 - Spring ‘02 ECE Toronto

Instruction Level Parallel Processing