ECE232: Hardware Organization and Design

ECE232: Hardware Organization and Design Part 11: Pipelining Chapter 4/6 http://www.ecs.umass.edu/ece/ece232/

CPI Calculation • CPI stands for average number of Cycles Per Instruction • Assume an instruction mix of 24% loads, 12% stores, 44% R-format, 18% branches, and 2% jumps • CPI = 0.24 * 5 + 0.12 * 4 + 0.44 * 4 + 0.18 * 3 + 0.02 * 3 = 4.04 • Speedup? • Question: Can we achieve a CPI of 1???

Speeding up through pipelining • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes • Dryer takes 30 minutes • “Folder” takes 30 minutes • “Stasher” takes 30 minutesto put clothes into drawers A B C D

A B C D Sequential Laundry • Sequential laundry takes 8 hours for 4 loads • If they learned pipelining, how long would laundry take? 2 AM 12 6 PM 1 8 7 11 10 9 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 T a s k O r d e r Time

30 30 30 30 30 30 30 A B C D Pipelined Laundry: Start work ASAP • Pipelined laundry takes 3.5 hours for 4 loads! 2 AM 12 6 PM 1 8 7 11 10 9 Time T a s k O r d e r

Pipelining doesn’t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Numberpipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup 30 30 30 30 30 30 30 A B C D Pipelining Lessons 6 PM 7 8 9 Time T a s k O r d e r

Pipelining Instructions Time (in cycles) Fetch = 10 ns Decode = 6 ns Execute = 8 ns Memory = 10 ns Write back = 6 ns F D EX M W F D EX M W F D EX M W Instruction F D EX M W W F D EX M M W F D EX

Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Wr Single Cycle, Multiple Cycle, vs. Pipeline Cycle 1 Cycle 2 Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Pipeline Implementation: Load Store R-type

Why Pipeline? • Suppose we execute 100 instructions • Single Cycle Machine • 45 ns/cycle x 1 CPI x 100 inst = 4500 ns • Multicycle Machine • 10 ns/cycle x 4.04 CPI (for the given inst mix) x 100 inst = 4040 ns • Instruction mix of 24% loads, 12% stores, 44% R-format, 18% branches, and 2% jumps • Ideal pipelined machine (with 5 stages) • 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns • Speedup=4.33 vs. single-cycle • 3.88 vs. multi-cycle (for the given inst mix)

Time (clock cycles) I n s t r. O r d e r Inst 1 Im Dm Reg Reg ALU Inst 2 Im Dm Reg Reg ALU Inst 3 Im Dm Reg Reg ALU Inst 4 Im Dm Reg Reg ALU Inst 5 Im Dm Reg Reg ALU Why Pipeline? Because the resources are there!

IMem DMem Reg Reg ALU Pipelining Rules • Forward traveling signals at each stage are latched • Only perform logic on signals in the same stage • signal labeling useful to prevent errors, • e.g., IRR, IRA, IRM, IRW • Backward travelling signals at each stage represent hazards Inst 5 Inst 4 Inst 3 Inst 2 Inst 1

MIPS Pipelined Datapath • State registers between pipeline stages to isolate them IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WriteBack Inst 5 Inst 4 Inst 3 Inst 2 Inst 1 Add Add 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 IFetch/Dec Read Address PC Read Data Dec/Exec Address Exec/Mem Write Addr ALU Read Data 2 Mem/WB Write Data Write Data Sign Extend 16 32 System Clock

Pipeline Hazards • Data hazards: an instruction uses the result of a previous instruction (RAW) ADD R1, R2, R3 or SW R1, 4(R2) SUB R4, R1, R5 LW R3, 4(R2) • Control hazards: the address of the next instruction to be executed depends on a previous instruction BEQ R1,R2,CONT SUB R6,R7,R8 … CONT: ADD R3,R4,R5 • Structural hazards: two instructions need access to the same resource • e.g., single memory shared for instruction fetch and load/store

Reading data from memory Mem Mem Mem Mem Mem Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg Mem Mem Mem Mem Mem ALU ALU ALU ALU ALU Reading instruction from memory Structural Hazard Time (clock cycles) lw I n s t r. O r d e r Inst 1 Inst 2 Inst 3 Inst 4 • Fix with separate instruction and data memories (I$ and D$)

Data Hazards (RAW) Time (in cycles) F D EX M W Write Data to R1 Here F D EX M W Instruction Get data from R1 Here ADD R1, R2, R3 SUB R4, R1, R5

DM DM Reg Reg Reg Reg stall IM IM ALU ALU stall One Way to handle a Data Hazard By waiting – introducing stalls – but impacts CPI add $1,… I n s t r. O r d e r stall sub $4,$1,$5

Time (clock cycles) I n s t r. O r d e r Inst 1 Im Dm Reg Reg ALU Inst 2 Im Dm Reg Reg ALU Inst 3 Im Dm Reg Reg ALU Inst 4 Im Dm Reg Reg ALU Inst 5 Im Dm Reg Reg ALU Must allow Wr/Rd in REG in same cycle Split cycle into two halves

DM DM DM Reg Reg Reg Reg Reg Reg stall IM IM IM ALU ALU ALU stall sub $4,$1,$5 and $6,$1,$7 Only two stall cycles Write in 1st half, Read in 2nd half add $1,… I n s t r. O r d e r

DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM ALU ALU ALU ALU Register File (write and then read) Time (clock cycles) Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half add $1, I n s t r. O r d e r Inst 1 Inst 2 or $8,$1,$9 clock edge that controls loading of pipeline state registers

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Forwarding with Load-use Data Hazards • sub needs to stall • Will still need one stall cycle even with forwarding lw $1,4($2) I n s t r. O r d e r sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5

Injecting Bubbles IF ID EX MEM WB and sub lw Inst -1 Inst -2 and sub bubble lw Inst -1 Add Add 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 IFetch/Dec Read Address PC Read Data Dec/Exec Address Exec/Mem Write Addr ALU Read Data 2 Mem/WB Write Data Write Data Inst –2 Inst –1 lw sub and Sign Extend 16 32 System Clock

3 Types of Data Hazards • RAW (read after write) • only hazard for ‘fixed’ pipelines • later instruction must read after earlier instruction writes • WAW (write after write) • variable-length pipeline • later instruction must write after earlier instruction writes • WAR (write after read) • instruction with late read (e.g., waiting for an execution unit) • later instruction must write after earlier instruction reads add $1,$2,$3 sub $4,$1,$5 div $1,$4,$3 add $1,$2,$5 mlt $4,$1,$3 add $1,$2,$5

Control Hazard Time (in cycles) F D EX M W Destination Available Here F D EX M W Instruction Need Destination Here JR R25 ... XX: ADD ... Simple solution: Flush Instruction fetch until branch resolved

ECE232: Hardware Organization and Design