Pipelined b Processor

Pipelined b Processor

Recap: Unpipelined b ILL OP JT Instruction Memory XAdr A D 4 3 2 1 0 PCSEL ra<20:16> rb<15:11> rc<25:21> 00 PC WASEL 0 1 RA2SEL XP 1 +4 RA1 RA2 WD WA Register File 0 WERF WE rc<25:21> RD1 RD2 <PC> + 4 + C<15:0> <<2 IRQ Z Z JT Wr C<15:0> sign extended Control Logic WD <PC> + 4 + 4C R/W 0 1 1 0 BSEL ASEL PCSEL RA2SEL ASEL BSELWDSELALUFN Wr WERF WASEL Data Memory A A B ALUFN ALU RD 1 0 2 WDSEL What can we do to make it run FASTER?

CPU Performance To increase MIPS: • Decrease CPI • Instruction set simplicity reduces CPI to 1.0 • To reduce CPI below 1.0 need multiple instruction issue machines (stay tuned!) • Increase Frequency • Frequency limited by longest combinational path from register outputs to register inputs • Can therefore increase frequency by pipelining MIPS = Frequency in MHz . Clocks Per Instruction (CPI)

Pipeline Stages Goal: Maintain (nearly) 1.0 CPI, but increase clock speed Approach: Structure processor as 4-stage pipeline • Instruction Fetch: Maintains PC. Fetches one instruction per cycle. • Register File: Reads source operands from register file. • ALU: Performs indicated operation. • Write-Back: Writes result back into register file. IF RF ALU WB

Sketch of 4-Stage Pipeline IF Instruction Fetch instruction Register File RF (read) CL instruction A B ALU ALU CL instruction Y Write Back Same RF as above! CL RF (write) Need to pass through the stages for branch and jump instructions

4-Stage b Pipeline ILL OP JT XAdr • Omits some detail • No bypass or interlock logic 4 3 2 1 0 PCSEL Instruction Memory A 00 PCIF D IF +4 PCRF IRRF ra<20:16> rb<15:11> rc<25:21> + 0 1 RA2SEL RA1 RA2 C<15:0> <<2 Register File RD1 RD2 Z <PC> + 4 + 4C JT RF C<15:0> BSEL 0 ASEL 1 1 0 PCALU IRALU A B DALU A B ALU ALU ALUFN PCWB IRWB Y DWB WD A XP Data Memory rc<25:21> RD WB R/W 1 0 1 0 2 WASEL WDSEL WA WD Wr WERF Register File

... SUBC XOR MUL ... ADDC SUBC XOR MUL ... ADDC SUBC XOR MUL ADDC SUBC XOR MUL 4-Pipeline Execution Consider a sequence of instructions … ADDC(r1, 1, r2) SUBC(r1, 1, r3) XOR(r1, r5, r1) MUL(r2, r5, r0) … executed on the 4-stage pipeline: TIME (cycles) i i + 1 i + 2 i + 3 i + 4 i + 5 i + 6 IF ADDC Pipeline RF ALU WB

Branch Execution Consider a different sequence: LOOP: CMPLEC(r3, 100, r0) ADD(r1, r2, r3) SUB(r1, r2, r4) BNE(r0, LOOP) XOR(r31, r31, r3) … i i + 1 i + 2 i + 3 i + 4 i + 5 i + 6 IF CMP ADD SUB BNE ? RF CMP ADD SUB BNE ALU CMP ADD SUB BNE WB CMP ADD SUB BNE What instruction should we fetch after BNE ?

Branch Execution - II i i + 1 i + 2 i + 3 i + 4 i + 5 i + 6 IF CMP ADD SUB BNE ? RF CMP ADD SUB BNE ALU CMP ADD SUB BNE WB CMP ADD SUB BNE 4 3 2 1 0 PCSEL Instruction Memory A 00 PCIF IF D XOR +4 PCRF IRRF + 0 1 RA2SEL <PC> + 4 + 4C RF RA1 RA2 C<15:0> <<2 Register File BNE RD1 RD2 Z PCIF will output correct address in clock cycle .

Branch Delay Slots Problem: One (or more) following instructions have been prefetched by the time a branch is taken. • Possible solutions are: • “Program around it” • Follow each branch instruction with a NOP = ADD(r31, r31, r31) instruction. • Make the compiler clever enough to move useful instructions after branches. These instructions will be executed regardless of whether the branch is taken or not. • Make pipeline “annul” instruction following branch which is taken, e.g., by disabling RF write and Memory write.

Annulling Prefetched Instructions ILL OP JT XAdr 4 3 2 1 0 PCSEL Instruction Memory A 00 PCIF D NOP +4 XORC 0 1 2 AnnulIF PCRF IRRF BNE ra<20:16> rb<15:11> rc<25:21> + 0 1 RA2SEL <PC> + 4 + 4C RA1 RA2 C<15:0> <<2 Register File RD1 RD2 Z JT PCALU IRALU ADD AnnulIF = 1 if branch is taken, i.e., PCSEL = 1 Similarly for JMP instruction

Data Hazards Consider the sequence: ADD(r1, r2, r3) CMPLEC(r3, 5, r0) MULC(r1, r2, r4) SUB(r1, r2, r5) … i i + 1 i + 2 i + 3 i + 4 i + 5 i + 6 IF ADD CMP MUL SUB RF ADD CMP MUL SUB ALU ADD CMP MUL SUB WB ADD CMP MUL SUB ADD writes new value in r3 during cycle i + 3, which is available beginning of cycle i + 4. Value of r3 read by CMP during cycle i + 2. CMP reads old value of r3.

Insert how many NOPs here ? Data Hazards: Solution - I ADD(r1, r2, r3) CMPLEC(r3, 5, r0) • Solution I (software) • “Program around it” • Compiler inserts NOPs • Compiler rewrites instruction sequence Rewrite ADD(r1, r2, r3) ADD(r1, r2, r3) CMPLEC(r3, 5, r0) MUL(r1, r2, r4) MUL(r1, r2, r4) as SUB(r1, r2, r5) SUB(r1, r2, r5) CMPLEC(r3, 5, r0)

r3 read r3 written Data Hazards: Solution - II • Solution II (hardware) • Detect problem and stall the pipeline • Freeze IF, RF stages for 2 cycles and insert NOPs into IRALU for 2 cycles ADD(r1, r2, r3) CMPLEC(r3, 5, r0) i i + 1 i + 2 i + 3 i + 4 i + 5 i + 6 IF ADD CMP MUL MUL MUL SUB RF ADD CMP CMP CMP MUL SUB ALU ADD NOP1 NOP2 CMP MUL WB ADD NOP1 NOP2 CMP

Pipeline Stalls ILL OP JT XAdr 4 3 2 1 0 PCSEL Instruction Memory A 00 PCIF D MUL +4 IF PCRF IRRF ra<20:16> rb<15:11> rc<25:21> + 0 1 RA2SEL CMP RA1 RA2 C<15:0> <<2 Register File RD1 RD2 Z <PC> + 4 + 4C JT RF NOP C<15:0> BSEL 0 1 1 0 STALLALU 0 ASEL 1 PCALU IRALU A B DALU ADD A B ALU ALU ALUFN PCWB IRWB Y DWB WD A XP Data Memory rc<25:21> RD WB R/W 1 0 1 0 2 WASEL WDSEL WA WD Wr WERF Register File

r3 read r3 compared to 5 r1 + r2 computed r3 written Data Hazards: Solution - III Solution III (hardware) • Bypass paths. • Extra data paths and control logic which re-route data in problem cases. ADD(r1, r2, r3) CMPLEC(r3, 5, r0) i i + 1 i + 2 i + 3 i + 4 i + 5 i + 6 IF ADD CMP MUL SUB RF ADD CMP MUL SUB ALU ADD CMP MUL SUB WB ADD CMP MUL SUB

Bypass Paths - I IRRF Select bypass if OPCODERF = OP, OPC, ... AND OPCODEALU = OP, OPC, ... AND raRF = rcALU (Similar path for other ALU input) CMPLEC(r3, 5, r0) 0 1 RA2SEL RA1 RA2 Register File RD1 RD2 BSEL 0 ASEL 1 1 0 IRALU A B ADD(r1, r2, r3) A B ALU IRWB Y 1 0 2 WDSEL WA WD

Bypass Paths - II IRRF Select bypass if OPCODERF = OP AND WERF = 1 AND rbRF = WA (Similar path for other ALU input) SUB(r1, r2, r3) 0 1 RA2SEL RA1 RA2 Register File RD1 RD2 BSEL 0 ASEL 1 1 0 IRALU A B ??? A B ALU IRWB Y XOR(r4, r5, r2) 1 0 2 WDSEL 1 0 WASEL WA WD Register File WERF

Next Time: Pipelining Subtleties Dilbert : S. Adams

Pipelined b Processor

Pipelined b Processor

Presentation Transcript

Lecture 5. MIPS Processor Design Pipelined MIPS #1

Chapter Six- 1st Half Pipelined Processor Delayed Controls, Hazards

Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching

Chapter 4 Processor Architecture Pipelined Implementation

Lecture 5. MIPS Processor Design Pipelined MIPS #2

Pipelined Processor Design

5 Pipelined Processor

A Pipelined Processor

Frame-Level Pipelined Motion Estimation Array Processor

Pipelined Processor II (cont’d) CPSC 321

designKilla: The 32-bit pipelined processor

Pipelined Electronics

32-bit Pipelined RISC Processor

Pipelined Processor Design

Lecture 9. MIPS Processor Design – Pipelined Processor Design #2

Branch Hazards in the Pipelined Processor

32-bit Pipelined RISC Processor

32-bit Pipelined RISC Processor

Pipelined Processor Design

Pipelined Processor Design

Reducing Average CPE Time On A Y86 Pipelined Processor

Pipelined Processor Design