CMPS 255 Computer Architecture Pipelining,hazards , and exception PH: 4.5, 4.9

CMPS 255Computer ArchitecturePipelining,hazards, and exceptionPH: 4.5, 4.9

Instruction Times (Critical Paths) • What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except: • Instruction and Data Memory (200 ps) • ALU and adders (200 ps) • Register File access (reads or writes) (100 ps)

Cycle 1 Cycle 2 Clk lw sw Waste Single Cycle Disadvantages & Advantages • Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction • especially problematic for more complex instructions like floating point multiply • May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle but • Is simple and easy to understand

ALU 2. Decode/ Register Read 5. WriteBack 1. Instruction Fetch 4. Memory 3. Execute Single Cycle DataPath • Executing an instruction classically takes five stages of datapath: • 1. Instruction Fetch (IFetch) • 2. Instruction Decode (IDecode) • 3. ALU Computation or Execute • 4. Memory Access • 5. Write Back to Registers (WB) rd instruction memory registers PC rs Data memory rt +4 imm 4

IFetch IFetch Dec Exec Exec Mem Mem WB Multicycle Implementation Overview • Each instruction step takes 1 clock cycle • Therefore, an instruction takes more than 1 clock cycle to complete • Not every instruction takes the same number of clock cycles to complete • Multicycle implementations allow • faster clock rates • different instructions to take a different number of clock cycles • functional units to be used more than once per instruction as long as they are used on different clock cycles, as a result • only need one memory • only need one ALU/adder Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw Dec sw 5

Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw sw R-type IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Single Cycle Vs Multiple Cycle Multiple Cycle Implementation: 6

How Can We Make It Even Faster? • Start fetching and executing the next instruction before the current one has completed; i.e., multiple instructions are overlapped in execution. • Pipelining – technique in which multiple instructions are overlapped in execution • modern processors are pipelined for performance • Split the instruction execution into steps, where each step completes a part of an instruction. Each step is called a pipeline stage or a pipeline segment. • The stages or steps are connected in a linear fashion: one stage to the next to form the pipeline (or pipelined CPU datapath) -- instructions enter at one end and progress through the stages and exit at the other end. • Fetch (and execute) more than one instruction at a time. • Fetch (and execute) instructions from more than one instruction stream (multithreading (hyperthreading)). 7

Single Cycle Implementation (CC = 800 ps): Cycle 1 Cycle 2 Clk lw sw Waste Pipeline Implementation (CC = 200 ps): IFetch Dec Exec Mem WB lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type Single Cycle versus Pipeline 400 ps • To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle case). Why ? • To complete two takes 1200 ps instead of 1600 ps • How long does each take to complete 1,000,000 adds ? 8

A B C D The Laundry analogy with pipelining • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away • Washer takes 30 minutes • Dryer takes 30 minutes • “Folder” takes 30 minutes • “Stasher” takes 30 minutes to put clothes into drawers 9

2 AM 12 6 PM 1 8 7 11 10 9 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 T a s k O r d e r Time A B C D The Laundry analogy: Sequential Laundry • Sequential laundry takes 8 hours for 4 loads 10

2 AM 12 6 PM 1 8 7 11 10 9 Time 30 30 30 30 30 30 30 T a s k O r d e r A B C D The Laundry analogy with pipelining • Multiple tasks operating simultaneously using different resources • Pipelined laundry takes 3.5 hours for 4 loads! 11

Pipelining Lessons • clock cycle must be long enough to accommodate slowest operation • Clock cycle length is determined by time required for slowest pipeline stage • In our example, pipelined execution clock cycle must have the worst-case clock cycle of 200 ps • An important pipeline design consideration is to balance the length of each pipeline stage. • If all stages are perfectly balanced, then the time per instruction on a pipelined machine (assuming ideal conditions): Time between instructionsnonpipelined Number of pipeline stages 12

Pipelining Lessons • Fourfold performance improvement: • Nonpipelined: time between 1st and 4thinstrs: 3 × 800 ps or 2400 ps • Pipelined: time between the first and fourth instructions is 3 × 200 ps or 600 ps 13

Pipelining Lessons • Under ideal conditions and with a large number of instructions, the speed-up from pipelining is approximately equal to the number of pipe stages; a five-stage pipeline is nearly five times faster • Potential speedup from pipelining = the number of pipeline stages = n • Goal: One instruction is completed every cycle: • CPI = 1 • CPI is Clock Cycles per Instruction. 14

Pipelining Lessons • total execution time for three instructions: 1400 ps versus 2400 ps. • Fourfold performance ? • number of instructions is not large • if we add add 1000,000 instr to reach 1,000,003 • Pipelined total execution time: 1,000,000 × 200 ps + 1400 ps = 200,001,400 ps • Nonpipelined: 1,000,000 × 800 ps + 2400 ps = 800,002,400 ps 15

Pipelining Lessons • formula suggests that a five-stage pipeline offers nearly a fivefold improvement over the 800 psnonpipelined time, or a 800/5 = 160 ps clock cycle • however, that the stages may be imperfectly balanced • Pipelining involves some overhead (seen next) • Thus, the time per instruction in the pipelined processor will exceed the minimum possible, and speed-up will be less than the number of pipeline stages 16

IFetch IFetch IFetch Exec Exec Exec Mem Mem Mem WB WB WB Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Dec lw Dec sw Dec R-type A Pipelined MIPS Processor • Start the next instruction before the current one has completed • improves throughput - total amount of work done in a period of time • instruction latency (execution time, delay time, response time - time from the start of an instruction to its completion) is not reduced • clock cycle (pipeline stage time) is limited by the slowest stage • for some instructions, some stages are wasted cycles 17

Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw sw R-type IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Pipeline Implementation: IFetch Dec Exec Mem WB lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type Single Cycle, Multiple Cycle, vs. Pipeline Multiple Cycle Implementation: 18

Pipelining the MIPS ISA • What makes it easy • all instructions are the same length (32 bits) • can fetch in the 1st stage and decode in the 2nd stage • few instruction formats (three) with symmetry across formats • Can decode and read registers in one step • begin reading register file in 2nd stage • memory operations can occur only in loads and stores • Can calculate address in 3rd stage, access memory in 4th stage • use the execute stage to calculate memory addresses • Alignment of memory operands • each MIPS instruction writes at most one result and does so near the end of the pipeline (MEM and WB) 19

Hazards • What makes it hard • Hazards: Situations that prevent starting the next instruction in the next cycle • structural hazards: what if we had only one memory? • A required resource is busy • control hazards: what about branches? • Deciding on control action depends on previous instruction • data hazards: what if an instruction’s input operands depend on the output of a previous instruction? • Need to wait for previous instruction to complete its data read/write 20

DM Reg Reg IM ALU Graphically Representing MIPS Pipeline • Can help with answering questions like: • How many cycles does it take to execute this code? • What is the ALU doing during cycle 4? • Is there a hazard, why does it occur, and how can it be fixed? 21

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Time to fill the pipeline Why Pipeline? For Performance! Time (clock cycles) Once the pipeline is full, one instruction is completed every cycle so CPI = 1 Inst 0 I n s t r. O r d e r Inst 1 Inst 2 Inst 3 Inst 4 22

Can Pipelining Get Us Into Trouble? • Yes:Pipeline Hazards • structural hazards: attempt to use the same resource by two different instructions at the same time • data hazards: attempt to use data before it is ready • An instruction’s source operand(s) are produced by a prior instruction still in the pipeline • add $s0, $t0, $t1sub $t2, $s0, $t3 • control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated • branch and jump instructions, exceptions • Can always resolve hazards by waiting • pipeline control must detect the hazard • and take action to resolve hazards 23

Reading data from memory Mem Mem Mem Mem Mem Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg Mem Mem Mem Mem Mem ALU ALU ALU ALU ALU Reading instruction from memory A Single Memory Would Be a Structural Hazard Time (clock cycles) lw I n s t r. O r d e r Inst 1 Inst 2 Inst 3 Inst 4 • Can fix with separate instr and data memories 24

DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM ALU ALU ALU ALU How About Register File Access? Time (clock cycles) Fix register file access hazard by doing reads in the second half of the cycle andwrites in the first half add $1, I n s t r. O r d e r Inst 1 Inst 2 add $2,$1, 26

Data Hazards • An instruction depends on completion of data access by a previous instruction • add $s0, $t0, $t1sub $t2, $s0, $t3

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Register Usage Can Cause Data Hazards • Dependencies backward in time cause hazards add $1, sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 • Read before writedata hazard 28

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Loads Can Cause Data Hazards • Dependencies backward in time cause hazards lw $1,4($2) I n s t r. O r d e r sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 • Load-usedata hazard: A form of data hazard in which data being loaded by lw has not yet become available when it is needed by another instruction. 29

One Way to “Fix” a Data Hazard • add instruction doesn’t write its result until the fifth stage • Can fix data hazard by waiting – stall writes in the first half • Data hazard could severely stall the pipeline., meaning that we would have to waste several clock cycles in the pipeline reads in the second half

DM DM DM Reg Reg Reg Reg Reg Reg stall IM IM IM ALU ALU ALU stall sub $4,$1,$5 and $6,$1,$7 One Way to “Fix” a Data Hazard Can fix data hazard by waiting– stall add $1, I n s t r. O r d e r • Can we do better? • Do we need to wait for the instruction to complete before trying to resolve the data hazard? 31

Forwarding (aka Bypassing) • forwarding or bypassing: A method of resolving data hazard by retrieving missing data from internal buffers rather than waiting for it to arrive from programmer visible registers or memory. • Instead of waiting for the first instruction to complete, • results can be forwarded as soon as they are availableto where they are needed • Don’t wait for it to be stored in a register • Requires extra connections in the datapath

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Forwarding to “Fix” a Data Hazard Fix data hazards by forwardingresults as soon as they are available to where they are needed add $1, I n s t r. O r d e r sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 33

Load-Use Data Hazard • Forwarding cannot prevent all pipeline stalls • If value not computed when needed, the path from memory access stage output to execution stage input would be going backward in time • Impossible Can’t forward backward in time! • Example: desired data would be available only after the fourth stage of lw instruction, which is too late for the input of the third stage of sub

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Forwarding with Load-use Data Hazards • Will still need one stall cycle even with forwarding lw $1,4($2) I n s t r. O r d e r sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 35

Code Scheduling to Avoid Stalls • Reorder code to avoid use of load result in the next instruction • C code for A = B + E; C = B + F; lw $t1, 0($t0) lw$t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw$t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) • MIPS assembly assuming all variables are in memory and are addressable as offsets from $t0. • Both add instructions have a hazard because of their respective dependence on the immediately preceding lw instruction • Moving up the third lw instruction to become the third instruction eliminates both hazards • On a pipelined processor with forwarding, reordered sequence will complete in two fewer cycles than the original version stall stall 13 cycles lw $t1, 0($t0) lw$t2, 4($t0) lw$t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 11 cycles

Control Hazards • When the flow of instruction addresses is not sequential (i.e., PC = PC + 4) • Conditional branches (beq, bne) • Unconditional branches (j, jal, jr) • Exceptions • Possible “solutions” • Stall (impacts performance) • Move branch decision point as early in the pipeline as possible, thereby reducing the number of stall cycles • Delay decision (requires compiler support) • Predict and hope for the best ! • Control hazards occur less frequently than data hazards, but there is nothing as effective against control hazards as forwarding is for data hazards 37

DM DM DM Reg Reg Reg Reg Reg Reg IM IM IM IM ALU ALU ALU ALU beq DM Reg Reg Branches Cause Control Hazards • Dependencies backward in time cause hazards I n s t r. O r d e r lw Inst 3 Inst 4 When the proper instruction cannot execute in the proper pipeline clock cycle because the instruction that was fetched is not the one that is needed 38

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM IM ALU ALU ALU ALU ALU ALU flush flush flush beq target DM Reg Inst 3 One Way to “Fix” a Branch Control Hazard Fix branch hazard by waiting – flush – but affects CPI beq I n s t r. O r d e r 39

DM Reg DM DM Reg Reg Reg Reg Reg IM IM IM IM ALU ALU ALU ALU flush beq target DM Reg Inst 3 Another Way to “Fix” a Branch Control Hazard • Move branch decision hardware back to as early in the pipeline as possible – i.e., during the decode cycle • put in extra hardware so to test registers, calculate branch address, and update PC during the second stage of the pipeline Fix branch hazard by waiting – flush beq I n s t r. O r d e r 40

Two “Types” of Stalls • Noopinstruction (or bubble) inserted between two instructions in the pipeline (e.g., load-use hazards) • Keep the instructionsearlier in the pipeline (later in the code) from progressingdown the pipeline for a cycle (“bounce” them in place with write control signals) • Insert noopinstruction by zeroing control bits in the pipeline register at the appropriate stage • Let the instructions later in the pipeline (earlier in the code) progress normally down the pipeline • Flushes (or instruction squashing) where an instruction in the pipeline is replaced with a noopinstruction (as done for instructions located sequentially after jandbeqinstructions) • Zero the control bits for the instruction to be flushed 41

Branch Prediction • simple approach: predict always that branches will be untaken. • When you’re right, the pipeline proceeds at full speed. • If you’re wrong, only when branches are taken does the pipeline stall Prediction correct Prediction incorrect

Pipelining Summary • All modern day processors use pipelining • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Potential speedup: a really fast clock cycle and able to complete one instruction every clock cycle (CPI) • Pipeline rate limited by slowest pipeline stage • Unbalanced pipe stages makes for inefficiencies • The time to “fill” pipeline and time to “drain” it can impact speedup for deep pipelines and short code runs • Must detect and resolve hazards • Stalling negatively affects CPI (makes CPI greater than the ideal of 1) 43

MIPS Exceptions and Interrupts • Exception: refer to any unexpected change in control flow without distinguishing whether the cause is internal or external • unscheduled event that disrupts program execution • another form of control hazard • Interrupt: An exception that comes from outside of the processor • Examples: • Dealing with them without sacrificing performance is hard 44

MIPS Exceptions and Interrupts • In MIPS, exceptions managed by a System Control Coprocessor (CP0) • The OS looks at the cause of the exception and “deals” with it • CPU must provide OS with • An indication what type of event occurred • An indication where the event occurred • The Exception Program Counter (EPC) • CPU might undo addition of 4 from fetch cycle • Two main methods to communicate the exception reason: • Cause register • vectored interrupts. 45

Handling Exceptions – Cause Register • Idea: CPU provides OS with a value in a register that indicates what caused the event • Two registers are added to MIPS to deal with exceptions: • EPC: 32-bit register used to hold the address of the affected instruction. • Cause: 32-bit register used to record the cause of the exception. For example: • Invalid Instruction: Cause = 0x0000000A • Arithmetic Overflow: Cause = 0x0000000C • When exception occurs, CPU: • sets the EPC and Cause registers • Starts executing at a defined address • 0x80000180 in MIPS • OS determines how to handle the exception 46

Handling Exceptions – Vectored • Idea: Address to which control is transferred is determined by the cause of the exception. CPU starts executing at that address • EPC contains instruction address • No Cause register • CPU goes to an address based on the event type • Looks at the interrupt vector (or description) table • For example: • Arithmetic Overflow: PC = 0xC0000000 • Undefined Instruction: PC = 0xC0000020 • When an exception or interrupt occurs: • The CPUsets the EPC and looksupinterrupthandleraddress • Startsexecuting the interrupt handler • The handler returns to the program when done 47

Handler Actions • Read cause, and transfer to relevant handler • Determine action required • If restartable • Take corrective action • use EPC to return to program • Otherwise • Terminate program • Report error using EPC, cause, … 48

Exceptions in a Pipeline • Another form of control hazard • Consider overflow on add in EX stage : add $1, $2, $1 • Effect of exception: pipeline has to: • stop executing the offendinginstruction in midstream, • let all prior instructions complete, • flush all following instructions, • setCause and EPC register values (cause of the exception, Instruction address) • Transfer control to handler by jumping to a prearranged address (the address of the exception handler code) • Similar to mispredicted branch • Use much of the same hardware 49

DM Reg Reg IM ALU Where in the Pipeline Exceptions Occur Stage(s)? • Arithmetic overflow • Undefined instruction • Page fault • I/O service request • Hardware malfunction EX ID IF, MEM any any • Beware that multiple exceptions can occur simultaneously in a single clock cycle 50

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg D$ page fault IM IM IM IM IM ALU ALU ALU ALU ALU arithmetic overflow undefined instruction I$ page fault Multiple Simultaneous Exceptions Inst 0 I n s t r. O r d e r Inst 1 Inst 2 Inst 3 Inst 4 • Hardware sorts the exceptions so that the earliest instruction is the one interrupted first

CMPS 255 Computer Architecture Pipelining,hazards , and exception PH: 4.5, 4.9