Appendix C
E N D
Presentation Transcript
Appendix C Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 • CDA5155 Fall 2014, Peir / University of Florida
Limits of Pipelining • Increasing the number of pipeline stages in a given logic block by a factor of n generally allows increasing clock speed & throughput by a factor of almost n. • Usually less than n because of overheads such as latches and balance of delay in each stage. • But, pipelining has a natural limit: • At least 1 layer of logic gates per pipeline stage! • Practical minimum is usally several gates (2-10). • Commercial designs are approaching this point!!
Basic RISC Pipelining • Basic idea: • Each instruction spends 1 clock cycle in each of the 5 execution stages. • During 1 clock cycle, the pipeline can be processing (different stages of) 5 different instructions.
Pipeline Hazards • Hazards are circumstances which may lead to stalls (delays, “bubbles”) in the pipeline if not addressed. • Three major types: • Structural hazards: • Lack of HW resources to keep all instructions moving. • Data hazards • Data results of earlier instrs. not yet avail. when needed. • Control hazards • Control decisions resulting from earlier instrs. (branches) not yet made; don’t know which new instrs. to execute.
Structural Hazard Example • Suppose you had a combined instruction+data memory with only 1 read port
Three Types of Data Hazards • Let i be an earlier instruction, j a later one. • RAW (read after write) • j tries to read a value before i writes it • WAW (write after write) • i and j write to same place, but in the wrong order. • Only occurs if >1 pipeline stage can write. • WAR (write after read) • j writes a new value to a location before i has read the old one. • Only occurs if writes can happen before reads in pipeline.
Data Hazard Prevention • A clever compiler can often reschedule instructions (code motion) to avoid a stall. • A simple example: • Original code: lw r2, 0(r4) add r1, r2, r3 Note: Stall happens here! lw r5, 4(r4) • Transformed code: lw r2, 0(r4) lw r5, 4(r4) add r1, r2, r3 No stall needed!
Hazard Detection Logic for Load • NOTE, The right part of the equ. should be IF/ID.IR (Fig. C.25) • Example: Detecting whether an instruction that has just been fetched needs to be stalled because of dependence from a preceding load.
Forwarding Situations in MIPS • Same as Figure C.26
Forwarding to The ALU • Provide multiple path to the input of the ALU
ID/EX • EX/MEM • Control • IF/ID • Add • MEM/WB • Branch • Add • 4 • Shift • left 2 • Register File • Read Addr 1 • Instruction • Memory • Data • Memory • Read • Data 1 • Read Addr 2 • Read • Address • PC • Read • Data • Address • Write Addr • ALU • Read • Data 2 • Write Data • Write Data • ALU • cntrl • 16 • 32 • Sign • Extend • EX/MEM.RegisterRd • ID/EX.RegisterRt • Forward • Unit • MEM/WB.RegisterRd • ID/EX.RegisterRs Datapath with Forwarding Hardware • PCSrc
ID/EX.MemRead • 0 • ID/EX.RegisterRt Adding the Hazard Hardware • PCSrc • Hazard • Unit • ID/EX • EX/MEM • 0 • IF/ID • 1 • Control • Add • MEM/WB • Branch • Add • 4 • Shift • left 2 • Read Addr 1 • Instruction • Memory • Data • Memory • Register • File • Read • Data 1 • Read Addr 2 • Read • Address • PC • Read • Data • Address • Write Addr • ALU • Read • Data 2 • Write Data • Write Data • ALU • cntrl • 16 • 32 • Sign • Extend • Forward • Unit
Branch Hazard • Suppose the new PC value is not computed until the MEM stage. • Then we must stall 3 clocks after every branch!
Early Branch Resolution • Branch resolution at ID stage
Predict-Not-Taken • (Branch resolves in ID) • Same as Fig. C.12
Delayed Branches • Machine code sequence: • Branch instruction • Delay slot instruction(s) • Post-branch instructions • Branch is taken (if taken) at this point • Same as Fig. C.13
Filling the Branch-Delay Slot • For (b), (c) must no side-effect! • Note, dynamic branch prediction will be covered in Chap. 3
Multi-Cycle Execution • Figure C.33 The MIPS pipeline with three additional unpipelined, floating-point, functional units.
Latency & Initiation Interval • Latency: • Extra delay cycles before result is available. • Initiation interval: • Minimum number of cycles before a new input can be given to that functional unit.
Pipelined Multiple-FP Operations • Figure C.35 A pipeline that supports multiple outstanding FP operations.
Pipelining FP Instructions • Notice instructions may complete out-of-order: • MULTD IF ID M1 M2 M3 M4 M5 M6 M7 ME WB • ADDD IF ID A1 A2 A3 A4 ME WB • LD IF ID EXME WB • SD IF ID EXME WB • Raises the possibility of WAW hazards, and structural hazards in MEM & WB stages. • Structural hazards may occur especially often with non-pipelined DIV unit. • Out-of-order completion impacts exception handling.
Issues in Multi-Cycle Operations • Stall for RAW is longer and more frequent (Fig. C.37) • WAW is possible; WAR is not (why?) • Structural Hazard possible for non-pipelined unit • Multiple WBs are likely (Fig. C.38) • Handling hazards • At Issue (ID) stage: • Check structural hazards: functional unit, WB port • Check RAW hazards: Issue with forwarding • Check WAW hazards: Not issue to make sure write in order • Detect and stall instruction before MEM and WB stages • More uniform handling given in Chapter 3.
Maintaining Precise Exception • Settle for imprecise exception • Buffer and complete in order • Require large buffers and comparators • History file, future file approaches • Software trap handling when exception occurs • Hybrid scheme: Issue when certain no exception for early instruction • All instructions before can be completed • No instructions after can be completed
Real MIPS R4000 Pipeline • IF,IS - Instruction cache fetch, First & Second halves. • RF - Inst. decode, Register Fetch, hazard check… • EX - Execution (EA calc, ALU op, target calc…) • DF,DS - Data cache access, First & Second halves. • TC - Tag Check, did cache access hit? • Note, use data before resolving hit/miss. • WB - Write-Back for loads & register-register ops. • Read through C.43 – C.51