CpE 242 Computer Architecture and Engineering Designing a Pipeline Processor

CpE 242Computer Architecture and EngineeringDesigning a Pipeline Processor

ALU Control A Single Cycle Processor RegDst Branch <31:26> Main Control Instruction<31:0> op ALUSrc Jump : Instruction Fetch Unit Zero <16:20> <21:25> <11:15> <0:15> Clk ALUop 3 Rd Rt Rd <5:0> Imm16 RegDst 1 0 func Mux Rs Rt 3 RegWr 5 5 5 ALUctr MemtoReg busA MemWr Zero Rw Ra Rb busW 32 32 32-bit Registers 0 ALU 32 busB 32 0 Clk Mux 32 Mux 32 1 WrEn Adr 1 Data In 32 Data Memory Extender imm16 32 16 Instr<15:0> Clk ALUSrc ExtOp

Drawbacks of this Single Cycle Processor • Long cycle time: • Cycle time must be long enough for the load instruction: • PC’s Clock -to-Q + • Instruction Memory Access Time + • Register File Access Time + • ALU Delay (address calculation) + • Data Memory Access Time + • Register File Setup Time + • Clock Skew • Cycle time is much longer than needed for all other instructions. Examples: • R-type instructions do not require data memory access • Jump does not require ALU operation nor data memory access

Overview of a Multiple Cycle Implementation • The root of the single cycle processor’s problems: • The cycle time has to be long enough for the slowest instruction • Solution: • Break the instruction into smaller steps • Execute each step (instead of the entire instruction) in one cycle • Cycle time: time it takes to execute the longest step • Keep all the steps to have similar length • This is the essence of the multiple cycle processor • The advantages of the multiple cycle processor: • Cycle time is much shorter • Different instructions take different number of cycles to complete • Load takes five cycles • Jump only takes three cycles • Allows a functional unit to be used more than once per instruction

Target 32 0 Mux 0 Mux 1 ALU 0 1 Mux 32 1 ALU Control Mux 1 0 << 2 Extend 16 Multiple Cycle Processor PCWr • MCP: A functional unit to be used more than once per instruction PCWrCond PCSrc BrWr Zero ALUSelA IorD MemWr IRWr RegDst RegWr 1 Mux 32 PC 0 Zero 32 Rs Ra 32 RAdr 5 32 Rt Rb busA 32 Ideal Memory 32 Instruction Reg Reg File 5 32 4 Rt 0 Rw 32 WrAdr 32 1 32 Rd Din Dout busW busB 32 2 32 3 Imm 32 ALUOp ExtOp MemtoReg ALUSelB

Outline of Today’s Lecture • Recap and Introduction (5 minutes) • Introduction to the Concept of Pipelined Processor (15 minutes) • Pipelined Datapath and Pipelined Control (25 minutes) • How to Avoid Race Condition in a Pipeline Design? (5 minutes) • Pipeline Example: Instructions Interaction (15 minutes) • Summary (5 minutes)

Timing Diagram of a Load Instruction Instruction Fetch Instr Decode / Address Data Memory Reg Wr Reg. Fetch Clk Clk-to-Q Old Value New Value PC Instruction Memory Access Time Rs, Rt, Rd, Op, Func Old Value New Value Delay through Control Logic ALUctr Old Value New Value ExtOp Old Value New Value ALUSrc Old Value New Value RegWr Old Value New Value Register File Access Time busA Old Value New Value Delay through Extender & Mux Register File Write Time busB Old Value New Value ALU Delay Address Old Value New Value Data Memory Access Time busW Old Value New

Ifetch Reg/Dec Exec Mem Wr The Five Stages of Load Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 • Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory • Reg/Dec: Registers Fetch and Instruction Decode • Exec: Calculate the memory address • Mem: Read the data from the Data Memory • Wr: Write the data back to the register file Load

Key Ideas Behind Pipelining • Grading the Final exam for a class of 100 students: • 5 problems, five people grading the exam • Each person ONLY grade one problem • Pass the exam to the next person as soon as one finishes his part • Assume each problem takes 12 min to grade • Each individual exam still takes 1 hour to grade • But with 5 people, all exams can be graded five times quicker • The load instruction has 5 stages: • Five independent functional units to work on each stage • Each functional unit is used only once • The 2nd load can start as soon as the 1st finishes its Ifetch stage • Each load still takes five cycles to complete • The throughput, however, is much higher

Key Ideas Behind Pipelining buffer Input Tasks Stage 1 Stage 2 K – stage pipeline Stage k • Let n be number of tasks or exams (or instructions) • Let k be number of stages for each task • Let T be the time per stage • Time per task = T . k • Total Time per n tasks for non-pipelined solution = T . k . n • Total Time per n tasks for pipelined solution = T . k + T . (n-1) • Speedup = pipelined perform/ non-pipelined performance = Total Time non-pipelined/ Total Time for pipelined = k . n / k + n-1 = k approx. when n >> k

1st lw Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Pipelining the Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock • The five independent functional units in the pipeline datapath are: • Instruction Memory for the Ifetch stage • Register File’s Read ports (bus A and busB) for the Reg/Dec stage • ALU for the Exec stage • Data Memory for the Mem stage • Register File’s Write port (bus W) for the Wr stage • One instruction enters the pipeline every cycle • One instruction comes out of the pipeline (complete) every cycle • The “Effective” Cycles per Instruction (CPI) is 1 2nd lw 3rd lw

Ifetch Reg/Dec Exec Wr The Four Stages of R-type Cycle 1 Cycle 2 Cycle 3 Cycle 4 • Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory • Reg/Dec: Registers Fetch and Instruction Decode • Exec: ALU operates on the two register operands • Wr: Write the ALU output back to the register file R-type

Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Wr Pipelining the R-type and Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock • We have a problem: • Two instructions try to write to the register file at the same time! Ops! We have a problem! R-type R-type Load R-type R-type

1 2 3 4 5 Load Ifetch Reg/Dec Exec Mem Wr 1 2 3 4 R-type Ifetch Reg/Dec Exec Wr Important Observation • Each functional unit can only be used once per instruction • Each functional unit must be used at the same stage for all instructions: • Load uses Register File’s Write Port during its 5th stage • R-type uses Register File’s Write Port during its 4th stage

Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Solution 1: Insert “Bubble” into the Pipeline Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock • Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle • The control logic can be complex • No instruction is completed during Cycle 5: • The “Effective” CPI for load is 2 Load R-type Pipeline R-type R-type Bubble

Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Solution 2: Delay R-type’s Write by One Cycle • Delay R-type’s register write by one cycle: • Now R-type instructions also use Reg File’s write port at Stage 5 • Mem stage is a NOOP stage: nothing is being done 1 2 3 4 5 R-type Mem Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type R-type Load R-type R-type

Ifetch Reg/Dec Exec Mem The Four Stages of Store Cycle 1 Cycle 2 Cycle 3 Cycle 4 • Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory • Reg/Dec: Registers Fetch and Instruction Decode • Exec: Calculate the memory address • Mem: Write the data into the Data Memory Store Wr

Ifetch Reg/Dec Exec Mem The Four Stages of Beq Cycle 1 Cycle 2 Cycle 3 Cycle 4 • Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory • Reg/Dec: Registers Fetch and Instruction Decode • Exec: ALU compares the two register operands • Adder calculates the branch target address • Mem: If the registers we compared in the Exec stage are the same, • Write the branch target address into the PC Beq Wr

ExtOp ALUOp Branch RegWr 1 0 PC+4 PC+4 PC PC+4 Imm16 Imm16 Zero Data Mem Rs busA A Ra busB Exec Unit RA Do Rb IUnit IF/ID Register Ex/Mem Register Mem/Wr Register ID/Ex Register 1 Rt WA RFile Mux Di Rw Di Rt 0 0 I Rd 1 ALUSrc RegDst MemWr MemtoReg A Pipelined Datapath Clk Ifetch Reg/Dec Exec Mem Wr

1 Mux 0 The Instruction Fetch Stage You are here! • Location 10: lw $1, 0x100($2) $1 <- Mem[($2) + 0x100] Clk Ifetch Reg/Dec Exec Mem ExtOp ALUOp Branch RegWr 1 0 PC+4 PC+4 PC+4 Imm16 Imm16 PC = 14 Zero Data Mem Rs busA A Ra busB IF/ID: lw $1, 100 ($2) Exec Unit RA Do Rb IUnit Ex/Mem Register Mem/Wr Register ID/Ex Register Rt WA RFile Di Rw Di Rt 0 I Rd 1 ALUSrc RegDst MemWr MemtoReg

Adder A Detail View of the Instruction Unit • Location 10: lw $1, 0x100($2) You are here! Clk Ifetch Reg/Dec 1 0 “4” PC = 14 10 IF/ID: lw $1, 100 ($2) Address Instruction Memory Instruction

1 Mux 0 The Decode / Register Fetch Stage You are here! • Location 10: lw $1, 0x100($2) $1 <- Mem[($2) + 0x100] Clk Ifetch Reg/Dec Exec Mem ExtOp ALUOp Branch RegWr 1 0 PC+4 PC+4 PC PC+4 Imm16 IF/ID: Imm16 Zero Data Mem Rs busA A Ra busB ID/Ex: Reg. 2 & 0x100 Exec Unit RA Do Rb IUnit Ex/Mem Register Mem/Wr Register Rt WA RFile Di Rw Di Rt 0 I Rd 1 ALUSrc RegDst MemWr MemtoReg

1 Mux 0 Load’s Address Calculation Stage You are here! • Location 10: lw $1, 0x100($2) $1 <- Mem[($2) + 0x100] Clk Ifetch Reg/Dec Exec Mem ALUOp=Add Branch RegWr ExtOp=1 1 0 PC+4 PC+4 PC PC+4 Imm16 IF/ID: Imm16 Zero Data Mem Rs busA A Ra busB Exec Unit RA Do Rb Ex/Mem: Load’s Address IUnit Mem/Wr Register ID/Ex Register Rt WA RFile Di Rw Di Rt 0 I Rd 1 ALUSrc=1 RegDst=0 MemWr MemtoReg

<< 2 Target Adder 32 PC+4 32 busA Zero ID/Ex Register 32 Ex/Mem: Load’s Memory Address ALUout ALU busB 0 32 32 Mux imm16 ALUctr 3 Extender 1 16 32 ALU Control 3 ALUSrc=1 ExtOp=1 ALUOp=Add A Detail View of the Execution Unit You are here! Clk Exec Mem

1 Mux 0 Load’s Memory Access Stage You are here! • Location 10: lw $1, 0x100($2) $1 <- Mem[($2) + 0x100] Clk Ifetch Reg/Dec Exec Mem ExtOp ALUOp Branch=0 RegWr 1 0 PC+4 PC+4 PC PC+4 Imm16 IF/ID: Imm16 Zero Data Mem Rs busA A Ra busB Exec Unit RA Do Rb Mem/Wr: Load’s Data IUnit Ex/Mem Register ID/Ex Register Rt WA RFile Di Rw Di Rt 0 I Rd 1 ALUSrc RegDst MemWr=0 MemtoReg

1 Mux 0 Load’s Write Back Stage You are somewhere out there! • Location 10: lw $1, 0x100($2) $1 <- Mem[($2) + 0x100] Clk Ifetch Reg/Dec Exec Mem Wr ExtOp ALUOp Branch RegWr=1 1 0 PC+4 PC+4 PC PC+4 Imm16 IF/ID: Imm16 Zero Data Mem Rs busA A Ra busB Exec Unit RA Do Rb IUnit Ex/Mem Register Mem/Wr Register ID/Ex Register Rt WA RFile Di Rw Di Rt 0 I Rd 1 ALUSrc RegDst MemWr MemtoReg=1

1 Mux 0 How About Control Signals? • Key Observation: Control Signals at Stage N = Func (Instr. at Stage N) • N = Exec, Mem, or Wr • Example: Controls Signals at Exec Stage = Func(Load’s Exec) Ifetch Reg/Dec Exec Mem Wr ALUOp=Add Branch RegWr ExtOp=1 1 0 PC+4 PC+4 PC+4 Imm16 IF/ID: Imm16 PC Zero Data Mem Rs busA A Ra busB Exec Unit RA Do Rb Ex/Mem: Load’s Address IUnit Mem/Wr Register ID/Ex Register Rt WA RFile Di Rw Di Rt 0 I Rd 1 ALUSrc=1 RegDst=0 MemWr MemtoReg

Pipeline Control • The Main Control generates the control signals during Reg/Dec • Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later • Control signals for Mem (MemWr Branch) are used 2 cycles later • Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc ALUOp ALUOp Main Control RegDst RegDst Ex/Mem Register IF/ID Register Mem/Wr Register ID/Ex Register MemWr MemWr MemWr Branch Branch Branch MemtoReg MemtoReg MemtoReg MemtoReg RegWr RegWr RegWr RegWr

Beginning of the Wr’s Stage: A Real World Problem Clk Clk • At the beginning of the Wr stage, we have a problem if: • RegAdr’s (Rd or Rt) Clk-to-Q > RegWr’s Clk-to-Q • Similarly, at the beginning of the Mem stage, we have a problem if: • WrAdr’s Clk-to-Q > MemWr’s Clk-to-Q • We have a race condition between Address and Write Enable! RegAdr WrAdr RegWr MemWr RegWr’s Clk-to-Q MemWr’s Clk-to-Q RegAdr’s Clk-to-Q WrAdr’s Clk-to-Q RegWr MemWr Mem/Wr Ex/Mem RegAdr WrAdr Reg File Data Memory Data Data

Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr The Pipeline Problem • Multiple Cycle design prevents race condition between Addr and WrEn: • Make sure Address is stable by the end of Cycle N • Asserts WrEn during Cycle N + 1 • This approach can NOT be used in the pipeline design because: • Must be able to write the register file every cycle • Must be able write the data memory every cycle Clock Store Store R-type R-type

Synchronize Register File & Synchronize Memory • Solution: And the Write Enable signal with the Clock • This is the ONLY place where gating the clock is used • MUST consult circuit expert to ensure no timing violation: • Example: Clock High Time > Write Access Delay Synchronize Memory and Register File Clk Address, Data, and WrEn must be stable at least 1 set-up time before the Clk edge I_Addr I_WrEn Write occurs at the cycle following the clock edge that captures the signals C_WrEn WrEn WrEn C_WrEn I_WrEn Address Reg File or Memory Data Reg File or Memory Address I_Addr Clk Data I_Data

Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr A More Extensive Pipelining Example Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Clock • End of Cycle 4: Load’s Mem, R-type’s Exec, Store’s Reg, Beq’s Ifetch • End of Cycle 5: Load’s Wr, R-type’s Mem, Store’s Exec, Beq’s Reg • End of Cycle 6: R-type’s Wr, Store’s Mem, Beq’s Exec • End of Cycle 7: Store’s Wr, Beq’s Mem 0: Load 4: R-type 8: Store 12: Beq (target is 1000) End of Cycle 4 End of Cycle 5 End of Cycle 6 End of Cycle 7

Clk 1 Mux 0 Clk MemWr=0 Pipelining Example: End of Cycle 4 • 0: Load’s Mem 4: R-type’s Exec 8: Store’s Reg 12: Beq’s Ifetch 8: Store’s Reg 4: R-type’s Exec 0: Load’s Mem 12: Beq’s Ifet ALUOp=R-type ExtOp=x Branch=0 RegWr=0 1 0 PC+4 PC+4 PC+4 Imm16 Imm16 PC = 16 Zero Data Mem Rs busA A Ra busB IF/ID: Beq Instruction Exec Unit RA Do Rb ID/Ex: Store’s busA & B Ex/Mem: R-type’s Result Mem/Wr: Load’s Dout IUnit Rt WA RFile Di Rw Di Rt 0 I Rd 1 ALUSrc=0 RegDst=1 MemtoReg=x

Clk 1 Mux 0 Clk MemWr=0 Pipelining Example: End of Cycle 5 • 0: Lw’s Wr 4: R’s Mem 8: Store’s Exec 12: Beq’s Reg 16: R’s Ifetch 12: Beq’s Reg 8: Store’s Exec 4: R-type’s Mem 0: Load’s Wr 16: R’s Ifet ALUOp=Add ExtOp=1 Branch=0 RegWr=1 1 0 PC+4 PC+4 PC+4 Imm16 Imm16 PC = 20 Zero Data Mem Rs busA A Ra busB Exec Unit RA Do Rb Mem/Wr: R-type’s Result ID/Ex: Beq’s busA & B Ex/Mem: Store’s Address IF/ID: Instruction @ 16 IUnit Rt WA RFile Di Rw Di Rt 0 I Rd 1 RegDst=x MemtoReg=1 ALUSrc=1

Clk 1 Mux 0 Clk MemWr=1 Pipelining Example: End of Cycle 6 • 4: R’s Wr 8: Store’s Mem 12: Beq’s Exec 16: R’s Reg 20: R’s Ifet 16: R-type’s Reg 12: Beq’s Exec 8: Store’s Mem 20: R-type’s Ifet 4: R-type’s Wr ALUOp=Sub ExtOp=1 Branch=0 RegWr=1 1 0 PC+4 PC+4 PC+4 Imm16 Imm16 PC = 24 Zero Data Mem Rs busA A Ra busB Exec Unit RA Do Rb ID/Ex:R-type’s busA & B Ex/Mem: Beq’s Results Mem/Wr: Nothing for St IF/ID: Instruction @ 20 IUnit Rt WA RFile Di Rw Di Rt 0 I Rd 1 RegDst=x MemtoReg=0 ALUSrc=0

Clk 1 Mux 0 Clk MemWr=0 Pipelining Example: End of Cycle 7 • 8: Store’s Wr 12: Beq’s Mem 16: R’s Exec 20: R’s Reg 24: R’s Ifet 20: R-type’s Reg 16: R-type’s Exec 12: Beq’s Mem 24: R-type’s Ifet 8: Store’s Wr ALUOp=R-type ExtOp=x Branch=1 RegWr=0 1 0 PC+4 PC+4 PC+4 Imm16 Imm16 PC = 1000 Zero Data Mem Rs busA A Ra busB Exec Unit RA Do Rb ID/Ex:R-type’s busA & B Ex/Mem: Rtype’s Results IF/ID: Instruction @ 24 Mem/Wr:Nothing for Beq IUnit Rt WA RFile Di Rw Di Rt 0 I Rd 1 RegDst=1 MemtoReg=x ALUSrc=0

Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr The Delay Branch Phenomenon Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Cycle 11 Clk • Although Beq is fetched during Cycle 4: • Target address is NOT written into the PC until the end of Cycle 7 • Branch’s target is NOT fetched until Cycle 8 • 3-instruction delay before the branch take effect • This is referred to as Branch Hazard: • Clever design techniques can reduce the delay to ONE instruction 12: Beq (target is 1000) 16: R-type 20: R-type 24: R-type 1000: Target of Br

Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr The Delay Load Phenomenon Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Clock • Although Load is fetched during Cycle 1: • The data is NOT written into the Reg File until the end of Cycle 5 • We cannot read this value from the Reg File until Cycle 6 • 3-instruction delay before the load take effect • This is referred to as Data Hazard: • Clever design techniques can reduce the delay to ONE instruction I0: Load Plus 1 Plus 2 Plus 3 Plus 4

Summary • Disadvantages of the Single Cycle Processor • Long cycle time • Cycle time is too long for all instructions except the Load • Multiple Clock Cycle Processor: • Divide the instructions into smaller steps • Execute each step (instead of the entire instruction) in one cycle • Pipeline Processor: • Natural enhancement of the multiple clock cycle processor • Each functional unit can only be used once per instruction • If a instruction is going to use a functional unit: • it must use it at the same stage as all other instructions • Pipeline Control: • Each stage’s control signal depends ONLY on the instruction that is currently in that stage

Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Wr Single Cycle, Multiple Cycle, vs. Pipeline Cycle 1 Cycle 2 Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Pipeline Implementation: Load Store R-type

Where to get more information? • Everything You Need to know about Pipeline Computer: • Peter Kogge, “The Architecture of Pipeline Computers,” McGraw Hill Book Company, 1981 • Some Classic References on RISC Pipelines: • Manolis Katevenis, “Reduced Instruction Set Computer Architectures for VLSI,” PhD Thesis, UC Berkeley, 1984. • Other references: • David. A Patterson, “Reduced Instruction Set Computers,” Communications of the ACM, January 1985. • Shing Kong, “Performance, Resources, and Complexity,” PhD Thesis, UC Berkeley, 1989.

CpE 242 Computer Architecture and Engineering Designing a Pipeline Processor