Pipelining

Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what Icall dirty laundry Read Chapter 4.5-4.8

Forget 411… Let’s Solve a “Relevant Problem” INPUT: dirty laundry Device: Washer Function: Fill, Agitate, Spin WasherPD = 30 mins OUTPUT: 4 more weeks Device: Dryer Function: Heat, Spin DryerPD = 60 mins

Everyone knows that the real reason that UNC students put off doing laundry so long is *not* because they procrastinate, are lazy, or even have better things to do. The fact is, doing laundry one load at a time is not smart. (Sorry Mom, but you were wrong about this one!) One Load at a Time Step 1: Step 2: Total = WasherPD + DryerPD = _________ mins 90

Here’s how they do laundry at Duke, the “combinational” way. (Actually, this is just an urban legend. No one at Duke actually does laundry. The butler’s all arrive on Wednesday morning, pick up the dirty laundry and return it all pressed and starched by dinner) Step 1: Step 3: Doing N Loads of Laundry Step 2: Step 4: … Total = N*(WasherPD + DryerPD) = ____________ mins N*90

UNC students “pipeline” the laundry process. That’s why we wait! Step 1: Doing N Loads… the UNC way Step 2: Step 3: … Actually, it’s more like N*60 + 30 if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs. Total = N * Max(WasherPD, DryerPD) = ____________ mins N*60

Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available. Even though we increase latency, it takes less time per load Recall Our Performance Measures • Latency:The delay from when an input is established until the output associated with that input becomes valid. • (Duke Laundry = _________ mins) • ( UNC Laundry = _________ mins) • Throughput: • The rate at which inputs or outputs are processed. • (Duke Laundry = _________ outputs/min) • ( UNC Laundry = _________ outputs/min) 90 120 1/90 1/60

F X P(X) H G Okay, Back to Circuits… For combinational logic: latency = tPD, throughput = 1/tPD. We can’t get the answer faster, but are we making effective use of our hardware at all times? X F(X) G(X) P(X) F & G are “idle”, just holding their outputs stable while H performs its computation

F 15 H X P(X) 25 G 20 50 1/25 worse better Pipelined Circuits use registers to hold H’s input stable! Now F & G can be working on input Xi+1 while H is performing its computation on Xi. We’ve created a 2-stage pipeline : if we have a valid input X during clock cycle j, P(X) is valid during clock j+2. Suppose F, G, H have propagation delays of 15, 20, 25 ns and we are using ideal zero-delay registers (ts = 0, tpd = 0): Pipelining uses registers to improve the throughput of combinational circuits unpipelined 2-stage pipeline latency 45 ______ throughput 1/45 ______

This is an exampleof parallelism. At any instant we are computing 2 results. F 15 Xi+1 Xi+2 Xi+3 … H X P(X) 25 F(Xi) F(Xi+1) F(Xi+2) G … 20 G(Xi) G(Xi+1) G(Xi+2) H(Xi) H(Xi+1) H(Xi+2) Pipeline Diagrams Clock cycle i i+1 i+2 i+3 Input Xi F Reg Pipeline stages G Reg H Reg The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle.

Pipelining Summary • Advantages: • Higher throughput than combinational system • Different parts of the logic work on different parts of the problem… • Disadvantages: • Generally, increases latency • Only as good as the *weakest* link(often called the pipeline’s BOTTLENECK)

Freq MIPS = CPI Review of CPU Performance MIPS = Millions of Instructions/Second Freq = Clock Frequency, MHz CPI = Clocks per Instruction • To Increase MIPS: • 1. DECREASE CPI. • - RISC simplicity reduces CPI to 1.0. • - CPI below 1.0? State-of-the-art multiple instruction issue • 2. INCREASE Freq. • - Freq limited by delay along longest combinational path; hence • -PIPELINING is the key to improving performance.

0x80000000 PC<31:29>:J<25:0>:00 0x80000040 0x80000080 JT BT 00 PCSEL 6 5 4 3 2 1 0 PC +4 Instruction A Memory D Rs: <25:21> Rt: <20:16> WASEL J:<25:0> Rd:<15:11> 0 1 2 3 Register RA1 RA2 WD Rt:<20:16> “31” WA WA File “27” WERF RD1 RD2 WE Imm: <15:0> BSEL 1 0 RESET SEXT SEXT JT N V Z C IRQ x4 shamt:<10:6> + Control Logic “16” ASEL 0 1 2 BSEL PCSEL WDSEL BT WASEL ALUFN A B SEXT Wr ALU Wr WD R/W ALUFN Data Memory N V C Z Adr RD WERF ASEL 0 1 2 WDSEL PC+4 Where Are the Bottlenecks? • Pipelining goal: • Break LONG combinational paths •  memories, ALU in separate stages

add $4, $5, $6 beq $1, $2, 40 lw $3, 30($0) jal 20000 sw $2, 20($4) miniMIPS Timing • Different instructions use various parts of the data path. 1 instr every 14 nS, 14 nS, 20 nS, 9 nS, 19 nS Program execution order Time CLK The above scenario is possible only if the system could vary the clock period based on the instruction being executed. This leads to complicated timing generation, and, in the end, slower systems, since it is not very compatible with pipelining! 6 nS 2 nS 2 nS 5 nS 4 nS 6 nS 1 nS Instruction Fetch Instruction Decode Register Prop Delay ALU Operation Branch Target Data Access Register Setup

Isn’t the net effect just a slower CPU? Uniform miniMIPS Timing • With a fixed clock period, we have to allow for the worse case. 1 instr EVERY 20 nS Program execution order Time CLK add $4, $5, $6 beq $1, $2, 40 lw $3, 30($0) jal 20000 sw $2, 20($4) By accounting for the “worse case” path (i.e. allowing time for each possible combination of operations) we can implement a fixed clock period. This simplifies timing generation, enforces a uniform processing order, and allows for pipelining! 6 nS 2 nS 2 nS 5 nS 4 nS 6 nS 1 nS Instruction Fetch Instruction Decode Register Prop Delay ALU Operation Branch Target Data Access Register Setup

Instruction Fetch stage: Maintains PC, fetches one instruction per cycle and passes it to IF Instruction Decode/Register File stage: Decode control lines and select source operands ID/RF ALU stage: Performs specified operation, passes result to Memory stage: If it’s a lw, use ALU result as an address, pass mem data (or ALU result if not lw) to MEM Write-Back stage: writes result back into register file. ALU WB Goal: 5-Stage Pipeline GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU) APPROACH: structure processor as 5-stage pipeline:

00 00 00 00 00 +4 PCWB PCREG PCALU PCMEM IRWB IRALU A B WDALU YWB IRMEM YMEM WDMEM IRREG Rs: <25:21> Rt: <20:16> BSEL 1 0 x4 0 1 2 WDSEL 0x80000000 PC<31:29>:J<25:0>:00 0x80000040 5-Stage miniMIPS 0x80000080 JT BT PCSEL 6 5 4 3 2 1 0 Instruction PC Memory • Omits some details A D Instruction • NO bypass or interlock logic Fetch J:<25:0> Register RA1 RA2 WA File RD1 RD2 = JT Imm: <15:0> SEXT SEXT BZ shamt:<10:6> + “16” ASEL Register 0 1 2 File BT Address is available right after instruction enters Memory stage A B ALU ALUFN N V C Z ALU Wr R/W Adr WD PC+4 almost 2 clock cycles Memory Data Memory RD Rt:<20:16> “27” “31” Rd:<15:11> Data is needed just before rising clock edge at end of Write Back stage 0 1 2 3 WASEL Write Register WD WA Back WA WERF File WE

Pipelining • Improve performance by increasing instruction throughput • Ideal speedup is number of stages in the pipeline. Do we achieve this?

Pipelining • What makes it easy • all instructions are the same length • just a few instruction formats • memory operands appear only in loads and stores • What makes it hard? • structural hazards: suppose we had only one memory • control hazards: need to worry about branch instructions • data hazards: an instruction depends on a previous instruction • Individual Instructions still take the same number of cycles • But we’ve improved the through-put by increasing the number of simultaneously executing instructions

Structural Hazards

Data Hazards • Problem with starting next instruction before first is finished • dependencies that “go backward in time” are data hazards

Software Solution • Have compiler guarantee no hazards • Where do we insert the “nops” ? sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) • Problem: this really slows us down!

Forwarding • Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

Can't always forward • Load word can still cause a hazard: • an instruction tries to read a register following a load instruction that writes to the same register. • Thus, we need a hazard detection unit to “stall” the instruction

Stalling • We can stall the pipeline by keeping an instruction in the same stage

Branch Hazards • When we decide to branch, other instructions are in the pipeline! • We are predicting “branch not taken” • need to add hardware for flushing instructions if we are wrong

NOP 00 00 00 00 00 broke the sequential semantics of ISA by adding a branch delay-slot and early branch resolution logic +4 PCMEM PCWB PCALU PCREG IRWB YMEM A YWB B IRMEM WDMEM IRALU IRREG WDALU Rs: <25:21> Rt: <20:16> added A/B bypass muxes to get data before it’s written to regfile BSEL 1 0 x4 • added CLK EN to freeze IF/RF stages so we can wait for lw to reach WB stage 0 1 2 WDSEL 0x80000000 PC<31:29>:J<25:0>:00 0x80000040 5-Stage miniMIPS 0x80000080 JT BT PCSEL 6 5 4 3 2 1 0 We wanted a simple, clean pipeline but… Instruction PC Memory A D Instruction Fetch J:<25:0> Register RA1 RA2 WA File RD1 RD2 = JT Imm: <15:0> SEXT SEXT BZ shamt:<10:6> + “16” ASEL Register 0 1 2 File BT A B ALU ALUFN N V C Z ALU Wr R/W Adr WD PC+4 Memory Data Memory RD Rt:<20:16> “27” “31” Rd:<15:11> 0 1 2 3 WASEL Write Register WD WA Back WA WERF File WE

Pipeline Summary (I) • • Started with unpipelined implementation • – direct execute, 1 cycle/instruction • – it had a long cycle time: mem + regs + alu + mem + wb • • We ended up with a 5-stage pipelined implementation • – increase throughput (3x???) • – delayed branch decision (1 cycle) • Choose to execute instruction after branch • – delayed register writeback (3 cycles) • Add bypass paths (6 x 2 = 12) to forward correct value • – memory data available only in WB stage • Introduce NOPs at IRALU, to stall IF and RF stages until LD result was ready

Pipeline Summary (II) • Fallacy #1: Pipelining is easy • Smart people get it wrong all of the time! • Fallacy #2: Pipelining is independent of ISA • Many ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes). • Fallacy #3: Increasing Pipeline stages improves performance • Diminishing returns. Increasing complexity.

Pipelining

Pipelining

Presentation Transcript

Pipelining

PIPELINING

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining