330 likes | 350 Vues
Learn about pipelining challenges in computer architecture, including structural, data, and control hazards. Discover solutions for stalls and the impact of branching on CPI.
E N D
Lecture 15:Pipelining: Branching & Complications Michael B. Greenwald Computer Architecture CIS 501 Fall 1999
Administration • Solutions to midterm up on web page. • HW #4 posted on web page
Visualizing PipeliningFigure 3.3, Page 133 Time (clock cycles) Register Bank Data Memory Instruction Memory ALU Register Bank I n s t r. O r d e r Instruction Level Parallelism (ILP) Which resources active during each clock cycle?
Its Not That Easy for Computers • Decreasing cycle time increases demands on components • Limits to pipelining:Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: HW cannot support this combination of instructions • Data hazards: Instruction depends on result of prior instruction still in the pipeline • Control hazards: Pipelining of branches & other instructions. Common solution is to stall the pipeline until the hazard “bubbles” in the pipeline
Visualizing Stalls Time (clock cycles) Load I n s t r. O r d e r Instr 1 Simplest way to view a stall is as an “empty instruction”. This delays the start of the next real instruction. Instr 2 stall Instr 3
Visualizing Stalls Time (clock cycles) Load I n s t r. O r d e r Instr 1 Simplest way to view a stall is as an “empty instruction”. This delays the start of the next real instruction. Point is to detect resource conflicts Instr 2 Inaccurate when stall isn’t introduced at beginning of pipeline stall Instr 3
Visualizing Stalls Time (clock cycles) Load I n s t r. O r d e r Instr 1 Alternative is a bubble propagating through all later instructions (stop all earlier stages). Instr 2 stall Instr 3
Visualizing Stalls Time (clock cycles) I n s t r. O r d e r lwr1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Pipeline interlock or r8,r1,r9
Visualizing Stalls • What if multiple cycle stall? CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 IF ID1 ID2 EX MEM WB IF ID1 stall stall IF ID1 ID2 EX MEM WB IF stall stall stall IF ID1 ID2 EX MEM IF ID1 ID2 EX IF ID1 ID2
Branch Stall Impact • Recall: If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! • Key point: penalty to throughput only applies to instruction that first caused the stall, even though all subsequent instructions are stalled equally • We only care about throughput • Instruction completion time relative to previous instruction • If we are interested in avg. instruction time (latency), then penalty applies to all subsequent instr.s and we must know what stage stall begins to calculate how many subsequent instr.s it affects.
Speed Up Equation for Pipelining Speedup from pipelining = Avg Instr Time unpipelined Avg Instr Time pipelined = CPIunpipelined x Clock Cycleunpipelined CPIpipelined x Clock Cyclepipelined = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined Basic speedup equation. Two ways of looking at this: • For a given clock-cycle, pipelining decreases CPI • Treat CPI as 1 for both machines, pipelining decreases clock cycle time. Yields 2 different speedup equations, depending on definition of cycle. Always be consistent within problem! x
F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W Speed Up Equation for Pipelining Speedup from pipelining = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined Basic speedup equation. Two ways of looking at this: • For a given clock-cycle, pipelining decreases CPI • Treat CPI as 1 for both machines, pipelining decreases clock cycle time. • In what sense is CPI reduced? • Each cycle pipelineDepth instructions are active, so if each instruction takes n cycles, CPI = n/pipelineDepth. • CPI counts cycles-per-instruction completion. X 1 2 3 4 5 1
Speed Up Equation for Pipelining:Impact of Stalls Speedup from pipelining = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined Basic speedup equation. Two ways of looking at this: • For a given clock-cycle, pipelining decreases CPI • Treat CPI as 1 for both machines, pipelining decreases clock cycle time. CPIunpipelined = Ideal CPI x Pipeline depth CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined Ideal CPI + Pipeline stall CPI Clock Cyclepipelined X Simplification: all insts = len. Cycles/stage: may as well be 1 X
Speed Up Equation for Pipelining:Impact of Stalls Speedup from pipelining = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined Basic speedup equation. Two ways of looking at this: • For a given clock-cycle, pipelining decreases CPI • Treat CPI as 1 for both machines, pipelining decreases clock cycle time. Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined Ideal CPI + Pipeline stall CPI Clock Cyclepipelined Speedup = 1 x Pipeline depth Clock Cycleunpipelined 1 + Pipeline stall CPI Clock Cyclepipelined X X X Assume balance Clock Cycleunpipelined + penalty
F D X M W F D X M W F D X M W F D X M W F D X M W Speed Up Equation for Pipelining Speedup from pipelining = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined Basic speedup equation. Two ways of looking at this: • For a given clock-cycle, pipelining decreases CPI • Treat CPI as 1 for both machines, pipelining decreases clock cycle time. • In what sense is CPI 1 for both machines? • Amount of time to complete an instruction • Pipelining’s main effect is to decrease clock cycle time. X 1 2 3 4 5 1
Speed Up Equation for Pipelining:Impact of Stalls Speedup from pipelining = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined Basic speedup equation. Two ways of looking at this: • For a given clock-cycle, pipelining decreases CPI • Treat CPI as 1 for both machines, pipelining decreases clock cycle time. Clock Cyclepipelined = Clock Cycleunpipelined/pipeDepth + penalty Speedup = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cycleunpipelined/pipeDepth+penalty Speedup = 1 Clock Cycleunpipelined 1 + StallCPIClock Cycleunpipelined + penalty Pipeline Depth X X X
Its Not That Easy for Computers • Decreasing cycle time increases demands on components • Limits to pipelining:Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: HW cannot support this combination of instructions • Data hazards: Instruction depends on result of prior instruction still in the pipeline • Control hazards: Pipelining of branches & other instructions. Common solution is to stall the pipeline until the hazard “bubbles” in the pipeline
Control Hazards • When a branch is executed it may (or may not!) change the PC to something other than Current PC+4. • Simple solution: • After detecting branch instruction, stall pipeline until target address is computed. • This introduces 3 cycles of stalls. • DIfferent implementation than data stall, since IF cycle must be repeated.
Pipelined DLX DatapathFigure 3.4, page 137 Need Result Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Write Back Memory Access Pipeline registers Latch for later • Data stationary control • local decode for each instruction phase / pipeline stage
Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken • Execute successor instructions in sequence • “Flush” instructions in pipeline if branch actually taken • Advantage of late pipeline state update, since don’t need to “Undo” • 47% DLX branches not taken on average • PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken • 53% DLX branches taken on average • But haven’t calculated branch target address in DLX • DLX still incurs 1 cycle branch penalty
Four Branch Hazard Alternatives #4: Delayed Branch • Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken • 1 slot delay allows proper decision and branch target address in 5 stage pipeline • DLX uses this Branch delay of length n
Delayed Branch • Where to get instructions to fill branch delay slot? • Before branch instruction • From the target address: only valuable when branch taken • From fall through: only valuable when branch not taken • Cancelling branches allow more slots to be filled because become no-op if prediction is wrong. • Compiler effectiveness for single branch delay slot: • Fills about 60% of branch delay slots • About 80% of instructions executed in branch delay slots useful in computation • About 50% (60% x 80%) of slots usefully filled
Evaluating Branch Alternatives Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline 3 1.42 3.5 1.0 Predict taken 1 1.14 4.4 1.26 Predict not taken 1 1.09 4.5 1.29 Delayed branch 0.5 1.07 4.6 1.31 Conditional & Unconditional = 14%, 65% change PC
Compiler “Static” Prediction ofTaken/Untaken Branches • Improves strategy for placing instructions in delay slot • Two strategies • Backward branch predict taken, forward branch not taken • Profile-based prediction: record branch behavior, predict branch based on prior run Taken backwards Not Taken Forwards (common case both directions) Always taken
Evaluating Static Branch Prediction • Misprediction ignores frequency of branch • “Instructions between mispredicted branches” is a better metric since it weights by frequency.
Interrupts • Synchronous vs. Asynch. • User requested vs. coerced. • Maskable vs. unmaskable. • Within vs. between • Resume vs. Terminate • External interrupts are easier to implement (Asynch, maskable, between), because can be handled after completion of all instructions in the pipe. Harder to perform well (coerced), because not predictable (like a branch, with no support).
Pipelining Complications • Interrupts: 5 instructions executing in 5 stage pipeline • How to stop the pipeline? • How to restart the pipeline? • Who caused the interrupt? Stage Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic interrupt MEM Page fault on data fetch; misaligned memory access; memory-protection violation
Pipelining Complications • “Precise exceptions”: • all instructions before the exception complete before the exception is taken, and all subsequent instructions can be restarted from scratch. • Performance penalty (> 10x on some machines) • Consequently: 2 modes (precise for debugging). • Flushing pipeline so that instr. is restartable • Delayed branch complicates, since can’t just save one PC, since pipeline is not sequentially ordered. May need to save a PC per pipeline stage (of delayed branch) + 1. • Must retrieve source operands • DST = SRC • SRC may be overwritten by other instructions.
Pipelining Complications • Simultaneous exceptions in more than one pipeline stage, e.g., • Load with data page fault in MEM stage • Add with instruction page fault in IF stage • Add fault will happen BEFORE load fault • Solution #1 • Interrupt status vector per instruction • Defer check til last stage, kill state update if exception • Solution #2 • Interrupt ASAP • Restart everything that is incomplete Another advantage for state update late in pipeline!
Pipelining Complications • Simultaneous exceptions in more than one pipeline stage, e.g., • Load with data page fault in MEM stage • Add with instruction page fault in IF stage • Add fault will happen BEFORE load fault • Solution #1 • Interrupt status vector per instruction • Defer check til last stage, kill state update if exception • Solution #2 • Interrupt ASAP • Restart everything that is incomplete Another advantage for state update late in pipeline!
Implementing Precise Exceptions • Straightforward for DLX integer pipeline • Interrupt Status Vector is a special register • ISV stored in pipeline latch • At time of any interrupt/exception all writes are disabled. • Actual check for interrupt and trap to interrupt handler defered until WB. • ISV is checked, and interrupts are posted in instruction order (unpipelined order).
Pipelining Complications • Complex Addressing Modes and Instructions • Address modes: Autoincrement causes register change during instruction execution • Interrupts? Need to restore register state • Adds WAR and WAW hazards since writes no longer last stage • Memory-Memory Move Instructions • Must be able to handle multiple page faults • Long-lived instructions: partial state save on interrupt • Condition Codes: set by previous instruction (what if interrupt in between?)
Pipelining Complications: When not all instructions take the same time... • Floating Point: long execution time • Impractical to require that FP operations complete in one clock cycle: • Slow clock • So we have at least one stage of the pipeline that will complete with different latencies for different instructions.