330 likes | 453 Vues
Dynamic ILP. Multiple Issue - Speculation. Outline. Multiple Issue Superscalar VLIW EPIC Speculation Re-order buffers Limits to ILP. Getting CPI < 1: Issuing Multiple Instructions/Cycle. Two variations
E N D
Dynamic ILP Multiple Issue - Speculation
Outline • Multiple Issue • Superscalar • VLIW • EPIC • Speculation • Re-order buffers • Limits to ILP
Getting CPI < 1: IssuingMultiple Instructions/Cycle • Two variations • Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) • IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 • (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates • Joint HP/Intel agreement in 1999/2000? • Intel Architecture-64 (IA-64) 64-bit address • Style: “Explicitly Parallel Instruction Computer (EPIC)” • Anticipated success lead to use of Instructions Per Clock cycle (IPC) vs. CPI
Limits to Multi-issue Machines • Inherent limitations of ILP • 1 branch in 5: How to keep a 5-way VLIW busy? • Latencies of units: many operations must be scheduled • Need about Pipeline Depth x No. Functional Units of independent operations to keep machines busy, e.g. 5 x 4 = 15–20 independent instructions? • Difficulties in building HW • Easy: More instruction bandwidth • Easy: Duplicate FUs to get parallel execution • Hard: Increase ports to Register File (bandwidth) • VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg • Harder: Increase ports to memory (bandwidth) • Decoding Superscalar and impact on clock rate, pipeline depth? • Statically scheduled superscalar • Dynamically scheduled superscalar • Statically scheduled VLIW
The Challenge • Hazard detection in issue packet • With issue packet • With previously issued instruction • Split and pipelined • Not in one clock cycle • Division? • First stage: how many in issue packet • Second stage: hazard detection with previously issued instructions • Fetching multiple instructions if not from a block in cache?
Getting CPI < 1: IssuingMultiple Instructions/cycle • Superscalar MIPS: 2 instructions, 1 FP & 1 integer – Fetch 64-bits/clock cycle; Int on left, FP on right (static scheduling) – Can only issue 2nd instruction if 1st instruction issues – More ports for FP registers to do FP load & FP op in a pair Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX EX EX WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX EX EX WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX EX EX WB 1 cycle load delay expands to 3 instructions in SS • instruction in right half can’t use it, nor instructions in next slot • Also the branch delay becomes two or three instructions
Dynamic Scheduling in Superscalar • How to issue two instructions and keep in-order instruction issue for Tomasulo? • Assume 1 integer + 1 floating point • 1 Tomasulo control for integer, 1 for floating point • Issue 2X Clock Rate, so that issue remains in order • Only FP loads might cause dependency between integer and FP issue: • operands must be read in the order they are fetched • Load checks addresses in Store Queue to avoid RAW violation • Store checks addresses in Load Queue to avoid WAR,WAW
Performance of Dynamic SS Iteration Instructions Issues Executes Writes result no. clock-cycle number 1 LD F0,0(R1) 1 2 4 1 ADDD F4,F0,F2 1 5 8 1 SD 0(R1),F4 2 9 1 SUBI R1,R1,#8 3 4 5 1 BNEZ R1,LOOP 4 5 2 LD F0,0(R1) 5 6 8 2 ADDD F4,F0,F2 5 9 12 2 SD 0(R1),F4 6 13 2 SUBI R1,R1,#8 7 8 9 2 BNEZ R1,LOOP 8 9 4 clocks per iteration; only 1 FP instr/iteration Branches, Decrements issues still take 1 clock cycle How get more performance? Eliminating Loop overhead More integer operations per cycle
Speculation Branch Prediction – Out of Order Execution
Control Dependence Ignored • If CPU stalls on branches, how much would CPI increase? • Control dependence need not be preserved in the whole execution • willing to execute instructions that should not have been executed, thereby violating the control dependences, if can do so without affecting correctness of the program • Two properties critical to program correctness are: • data flow • exception behavior
Speculation is to run instructions on prediction – predictions could be wrong. Branch prediction: cannot be avoided, could be very accurate Misprediction is less frequent event – but can we ignore? Example: for (i=0; i<1000; i++) C[i] = A[i]+B[i]; Branch prediction: predict the execution as accurate as possible (frequent cases) Speculative execution recovery: if prediction is wrong, roll the execution back Branch Prediction and Speculative Execution
Exception Behavior • Preserving exception behavior -- exceptions must be raised exactly as in sequential execution • Same sequence as sequential • No “extra” exceptions • Example: DADDU R2,R3,R4 BEQZ R2,L1 LW R1,0(R2)L1: • Problem with moving LW before BEQZ? • Again, a dynamic execution must look like a sequential execution, any time when it is stopped
Exceptions in Order • Solutions: • Early detection of FP exceptions • The use of software mechanisms to restore a precise exception state before resuming execution, • Delaying instruction completion until we know an exception is impossible
Precise Interrupts • An interrupt is precise if the saved process state corresponds with a sequential model of program execution where one instruction completes before the next begins. • Tomasulo had:In-order issue, out-of-order execution, and out-of-order completion • Need to “fix” the out-of-order completion aspect so that we can find precise breakpoint in instruction stream.
Branch Prediction Vs. Precise Interrupt • Mis-prediction is “exception” on the branch inst • Execution “branches out” on exceptions • Every instruction is “predicted” not to take the “branch” to interrupt handler • Same technique for handling both issue: • in-order completion or commit: change register/memory only in program order (sequential) • How does it ensure the correctness?
HW Support for More ILP • Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences (including exceptions) if branch is not actually taken (“HW undo”); • Combine branch prediction with dynamic scheduling to execute before branches resolved • Separate speculative bypassing of results from real bypassing of results • When instruction no longer speculative, write boosted results (instruction commit)or discard boosted results • execute out-of-order but commit in-order to prevent irrevocable action (update state or exception) until instruction commits
HW support for More ILP • Need HW buffer for results of uncommitted instructions: reorder buffer • 3 fields: instr, destination, value • Reorder buffer can be operand source => more registers like RS • Use reorder buffer number instead of reservation station when execution completes • Supplies operands between execution complete & commit • Once operand commits, result is put into register • Instructions commit in order • As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions
Result Shift Register • Result Shift Register" is used to control the result bus • N is the length of the longest functional unit pipeline • An instruction that takes i clock periods reserves stage i • If the stage already contains valid control information, then issue is held until the next clock period • Issuing instruction places control information in the result shift register. • the functional unit that will be supplying the result • the destination register • This control information is also marked "valid" • Each clock period, the control information is shifted down one stage toward stage one. • When it reaches stage one, it is used during the next clock period to control the result bus
IM Fetch Unit Reorder Buffer Decode Rename Regfile S-buf L-buf RS RS DM FU1 FU2 The Hardware: Reorder Buffer • If inst write results in program order, reg/memory always get the correct values • Reorder buffer (ROB) – reorder out-of-order inst to program order at the time of writing reg/memory (commit) • If some inst goes wrong, handle it at the time of commit – just flush inst afterwards • Inst cannot write reg/memory immediately after execution, so ROB also buffer the results No such a place in Tomasulo original
Four Steps of Speculative Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slotfree, issue instr & send operands &reorder buffer no. for destination (this stage sometimes called “dispatch”) 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)
Program Counter Branch or L/W? Exceptions? Dest reg Ready? Result Reorder Buffer Reorder Buffer Details • Holds Instruction type: branch, store, ALU register operation • Holds branch valid and exception bits • Flush pipeline when any bit is set • Holds dest, result and PC • Write results to dest at the time of commit • A ready bit indicates if the instruction has completed execution and the value is ready • Supplies operands between execution complete and commit • ROB replaces the Store Buffer also
Flush the pipeline on mis-prediction MIPS 5-stage pipeline used flushing on taken branches Where is the flush signal from? When to flush? IM Fetch Unit Reorder Buffer Decode Rename Regfile S-buf L-buf RS RS DM FU1 FU2 Speculative Execution Recovery
Summary • Reservations stations: implicit register renaming to larger set of registers + buffering source operands • Prevents registers as bottleneck • Avoids WAR, WAW hazards of Scoreboard • Not limited to basic blocks when compared to static scheduling (integer units gets ahead, beyond branches) • Today, helps cache misses as well • Don’t stall for L1 Data cache miss • Can support memory-level parallelism • Lasting Contributions • Dynamic scheduling • Register renaming • Load/store disambiguation • 360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264
Dynamic Scheduling: The Only Choice? • Most high-performance processors today are dynamically scheduled superscalar processors • With deeper and n-way issue pipeline • Other alternatives to exploit instruction-level parallelism • Statically scheduled superscalar • VLIW • Mixed effort: EPIC – Explicit Parallel Instruction Computing • Example: Intel Itanium processors Why is dynamic scheduling so popular today? • Technology trends: increasing transistor budget, deeper pipeline, wide issue
Limits to ILP • Conflicting studies of amount of parallelism available in late 1980s and early 1990s. • How much ILP is available using existing mechanisms with increasing HW budgets? • Do we need to invent new HW/SW mechanisms to keep on processor performance curve? • Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints • Intel SSE2: 128 bit, including 2 64-bit FP per clock • Motorola AltaVec: 128 bit ints and FPs • Supersparc Multimedia ops, etc.
Limits to ILP • Assumptions for ideal/perfect machine to start: • 1. Register renaming – infinite virtual registers=> all register WAW & WAR hazards are avoided. • 2. Branch prediction – perfect; no mispredictions. • 3. Jump prediction – all jumps perfectly predicted2 & 3 => machine with perfect speculation & an unbounded buffer of instructions available. • 4. Memory-address alias analysis – addresses are known & a load can be moved before a store provided addresses not equal. • Also:unlimited number of instructions issued/clock cycle; perfect caches;1 cycle latency for all instructions (FP *,/);
Study Strategy First, observe ILP on the ideal machine using simulation Then, observe how ideal ILP decreases when • Add branch impact • Add register impact • Add memory address alias impact More restrictions in practice • Functional unit latency: floating point • Memory latency: cache hit more than one cycle, cache miss penalty
160 150.1 FP: 75 - 150 140 Integer: 18 - 60 118.7 120 100 75.2 IPC 80 62.6 Instruction Issues per cycle 54.8 60 40 17.9 20 0 gcc espresso li fpppp doducd tomcatv Programs Upper Limit to ILP: Ideal Machine(Figure 3.35, page 242)
More Realistic HW: Branch Impact FP: 15 - 45 Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle Integer: 6 - 12 IPC Perfect Tournament BHT(512) Profile No prediction
More Realistic HW: Renaming Register Impact FP: 5 - 49 Change 2000 instr window, 64 instr issue, 8K 2 level Prediction Integer: 5 - 15 IPC Infinite 256 128 64 32 None