290 likes | 309 Vues
This presentation introduces the concept of pipelined datapath in computer architecture and assembly language. It covers the advantages and disadvantages of single cycle and multicycle implementations, as well as the basics of pipelining. The MIPS processor is used as an example to illustrate the pipelined datapath.
 
                
                E N D
14:332:331Computer Architecture and Assembly LanguageSpring 2005Week 11Introduction to Pipelined Datapath [Adapted from Dave Patterson’s UCB CS152 slides and Mary Jane Irwin’s PSU CSE331 slides]
Head’s Up • Reminders • Pipelined datapath and control • HW#6 will be handed out soon
MDR Review: Multicycle Data and Control Path PCWriteCond PCWrite PCSource IorD ALUOp MemRead Control FSM ALUSrcB MemWrite ALUSrcA MemtoReg RegWrite IRWrite RegDst PC[31-28] Instr[31-26] Shift left 2 28 Instr[25-0] 2 0 1 Address Memory 0 PC 0 Read Addr 1 A Read Data 1 IR Register File 1 1 zero Read Addr 2 Read Data (Instr. or Data) 0 ALUout ALU Write Addr Write Data 1 Read Data 2 B 0 1 Write Data 4 1 0 2 Instr[15-0] Sign Extend Shift left 2 3 32 ALU control Instr[5-0]
Review: Multicycle Datapath FSM Decode 0 IorD=0 MemRead;IRWrite ALUSrcA=0 ALUsrcB=01 PCSource,ALUOp=00 PCWrite Instr Fetch 1 Unless otherwise assigned PCWrite,IRWrite, MemWrite,RegWrite=0 others=X ALUSrcA=0 ALUSrcB=11 ALUOp=00 PCWriteCond=0 Start (Op = R-type) (Op = beq) 2 (Op = lw or sw) (Op = j) 6 8 9 ALUSrcA=1 ALUSrcB=10 ALUOp=00 PCWriteCond=0 ALUSrcA=1 ALUSrcB=00 ALUOp=01 PCSource=01 PCWriteCond ALUSrcA=1 ALUSrcB=00 ALUOp=10 PCWriteCond=0 PCSource=10 PCWrite Execute (Op = lw) (Op = sw) 3 5 7 Memory Access RegDst=1 RegWrite MemtoReg=0 PCWriteCond=0 MemRead IorD=1 PCWriteCond=0 MemWrite IorD=1 PCWriteCond=0 4 RegDst=0 RegWrite MemtoReg=1 PCWriteCond=0 Write Back
Review: FSM Implementation PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg Combinational control logic PCSource Outputs ALUOp ALUSourceB ALUSourceA RegWrite RegDst Inputs Op5 Op4 Op3 Op2 Op1 Op0 Next State State Reg Inst[31-26] System Clock
Single Cycle Disadvantages & Advantages • Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction • Is wasteful of area since some functional units must (e.g., adders) be duplicated since they can not be shared during a clock cycle but • Is simple and easy to understand Cycle 1 Cycle 2 Clk Single Cycle Implementation: lw sw Waste
Multicycle Advantages & Disadvantages • Uses the clock cycle efficiently – the clock cycle is timed to accommodate the slowest instruction step • balance the amount of work to be done in each step • restrict each step to use only one major functional unit • Multicycle implementations allow • functional units to be used more than once per instruction as long as they are used on different clock cycles • faster clock rates • different instructions to take a different number of clock cycles but • Requires additional internal state registers, muxes, and more complicated (FSM) control
IFetch Dec Exec Mem WB The Five Stages of Load Instruction • IFetch: Instruction Fetch and Update PC • Dec: Registers Fetch and Instruction Decode • Exec: Execute R-type; calculate memory address • Mem: Read/write the data from/to the Data Memory • WB: Write the data back to the register file Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw
multicycle clock slower than 1/5th of single cycle clock due to stage flipflop overhead IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Single Cycle vs. Multiple Cycle Timing Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw Waste Multiple Cycle Implementation: Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 lw sw R-type
IFetch IFetch IFetch Dec Dec Dec Exec Exec Exec Mem Mem Mem WB WB WB Pipelined MIPS Processor • Start the next instruction while still working on the current one • improves throughput - total amount of work done in a given time • instruction latency (execution time, delay time, response time) is not reduced - time from the start of an instruction to its completion Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw sw R-type
IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch wasted cycle IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB Single Cycle, Multiple Cycle, vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 Clk Load Store Waste Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw sw R-type Pipeline Implementation: lw sw R-type
Pipelining the MIPS ISA • What makes it easy • all instructions are the same length (32 bits) • few instruction formats (three) with symmetry across formats • memory operations can occur only in loads and stores • operands must be aligned in memory so a single data transfer requires only one memory access • What makes it hard • structural hazards: what if we had only one memory • control hazards: what about branches • data hazards: what if an instruction’s input operands depend on the output of a previous instruction
MIPS Pipeline Datapath Modifications • What do we need to add/modify in our MIPS datapath? • State registers between pipeline stages to isolate them IFetch Dec Exec Mem WB 1 0 Add Add 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Address Read Addr 2 IFetch/Dec Read Address PC Read Data Dec/Exec 1 Write Addr ALU Read Data 2 Mem/WB 0 Exec/Mem Write Data 0 Write Data 1 Sign Extend 16 32 System Clock
MIPS Pipeline Control Path Modifications • All control signals are determined during Decode • and held in the state registers between pipeline stages IFetch Dec Exec Mem WB 1 0 Control Add Add 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Address Read Addr 2 IFetch/Dec Read Address PC Read Data Dec/Exec 1 Write Addr ALU Read Data 2 Mem/WB 0 Exec/Mem Write Data 0 Write Data 1 Sign Extend 16 32 System Clock
DM Reg Reg IM ALU Graphically Representing MIPS Pipeline • Can help with answering questions like: • how many cycles does it take to execute this code? • what is the ALU doing during cycle 4? • is there a hazard, why does it occur, and how can it be fixed?
DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Time to fill the pipeline Why Pipeline? For Throughput! Time (clock cycles) Inst 0 Once the pipeline is full, one instruction is completed every cycle I n s t r. O r d e r Inst 1 Inst 2 Inst 3 Inst 4
Can pipelining get us into trouble? • Yes:Pipeline Hazards • structural hazards: attempt to use the same resource by two different instructions at the same time • data hazards: attempt to use item before it is ready • instruction depends on result of prior instruction still in the pipeline • control hazards: attempt to make a decision before condition is evaulated • branch instructions • Can always resolve hazards by waiting • pipeline control must detect the hazard • take action (or delay action) to resolve hazards
Reading data from memory Mem Mem Mem Mem Mem Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg Mem Mem Mem Mem Mem ALU ALU ALU ALU ALU Reading instruction from memory A Unified Memory Would Be a Structural Hazard Time (clock cycles) lw I n s t r. O r d e r Inst 1 Inst 2 Inst 3 Inst 4
DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU How About Register File Access? Time (clock cycles) Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half. add I n s t r. O r d e r Inst 1 Inst 2 add Inst 4
DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Branch Instructions Cause Control Hazards • Dependencies backward in time cause hazards add I n s t r. O r d e r beq lw Inst 3 Inst 4
DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM ALU ALU ALU ALU stall stall lw Inst 3 One Way to “Fix” a Control Hazard add Can fix branch hazard by waiting – stall – but affects throughput I n s t r. O r d e r beq
DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Register Usage Can Cause Data Hazards • Dependencies backward in time cause hazards add r1,r2,r3 I n s t r. O r d e r sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5
DM DM DM Reg Reg Reg Reg Reg Reg stall IM IM IM ALU ALU ALU stall sub r4,r1,r5 and r6,r1,r7 One Way to “Fix” a Data Hazard Can fix data hazard by waiting – stall – but affects throughput add r1,r2,r3 I n s t r. O r d e r
DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Loads Can Cause Data Hazards • Dependencies backward in time cause hazards lw r1,100(r2) I n s t r. O r d e r sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5
DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Stores Can Cause Data Hazards • Dependencies backward in time cause hazards add r1,r2,r3 I n s t r. O r d e r sw r1,100(r5) and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5
DM Reg Reg IM IM ALU ALU Other Pipeline Structures Are Possible • What about (slow) multiply operation? • let it take two cycles • What if the data memory access is twice as slow as the instruction memory? • make the clock twice as slow or … • let data memory access take two cycles (and keep the same clock rate) MUL DM2 DM1 Reg Reg
Reg EX DM Reg Reg IM IM ALU ALU Sample Pipeline Alternatives • ARM7 • StrongARM-1 • XScale PC update IM access decode reg access ALU op DM access shift/rotate commit result (write back) Reg DM2 IM1 DM1 IM2 Reg SHFT PC update BTB access start IM access decode reg 1 access DM write reg write ALU op start DM access exception shift/rotate reg 2 access IM access
Summary • All modern day processors use pipelining • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Multiple tasks operating simultaneously using different resources • Potential speedup = Number of pipe stages • Pipeline rate limited by slowest pipeline stage • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup • Must detect and resolve hazards • Stalling negatively affects throughput