320 likes | 445 Vues
Explore how assembly line implementation optimizes car production, pipeline stages in manufacturing, and benefits of dividing processes. Learn to resolve potential pipeline hazards.
E N D
Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish. 45 min. 275 min. Latency: Time from start to finish for one car. 275 minutes per car. (smaller is better) Throughput: Number of finished cars per time unit. 1 car/275 min = 0.218 cars/hour (larger is better) Issues: How can we make the process better by adding more workers? 6.1
2 3 1 4 1 3 4 2 time 3 1 4 2 4 3 2 1 1 3 2 4 An Assembly line 80 80 60 80 50 80 80 40 45 Last two stages only receive onecar/80 min to work on. Latency: 400 min/car Throughput: 4 cars/640 min (1 car/160 min) Will approach 1 car/80 min as time goes on First two stagescan’t produce faster thanone car/80 min or a backlog will occurat third stage. 6.1
Applying Assembly Lines to CPUs • The single-cycle design did everything “at once” • Can we break the single-cycle design up into stages? • Issues: • Car assembly works well. Will it be so easy to do the same technique to a CPU? 6.1
1 0 Registers Read reg. num A Read reg num A Read reg data A Read reg num B Write reg num Read reg data B 0 Write reg data 1 Breaking up the Single-Cycle Datapath 4 Reg.Write-back Result Add Result Sh.Left2 Add Rs:[25-21] Read address Rt:[20-16] Data Memory Read address PC Zero Read data 1 Instruction [31-0] Write address Result InstructionMemory 0 0 Write data Rd:[15-11] 1 Imm:[15-0] 16 32 signextend Instr. Fetch,PC=PC+4 Stages frommulti-cycle design Instr. DecodeRegister Fetch Execute,Address Calc. Memory 6.2
1 0 Registers Read reg. num A Read reg num A Read reg data A Read reg num B Write reg num Read reg data B 0 Write reg data 1 The Key - Pipeline Registers Reg.Write-back Instr. DecodeRegister Fetch 4 Result PC+4 Add Result Sh.Left2 Add Rs:[25-21] Read address Rt:[20-16] Data Memory Read address PC Zero Read data 1 Instruction [31-0] Write address Result InstructionMemory 0 0 Write data Rd:[15-11] 1 Instr. Fetch,PC=PC+4 Imm:[15-0] Execute,Address Calc. 16 32 signextend Memory clock 6.2
1 0 Registers Read reg. num A Read reg num A Read reg data A Read reg num B Write reg num Read reg data B 0 Write reg data 1 Example: R-type Instruction 4 Result PC+4 Add Result Sh.Left2 Add Rs:[25-21] Read address Rt:[20-16] Data Memory Read address PC Zero Read data 1 Instruction [31-0] Write address Result InstructionMemory 0 0 Write data Rd:[15-11] 1 Imm:[15-0] Writes the correct data to thewrongregister 16 32 signextend In general, arrows that go backwards across pipeline stages may be bad news... 6.2
1 0 Registers Read reg. num A Read reg num A Read reg data A Read reg num B Write reg num Read reg data B 0 Write reg data 1 Rd:[15-11] 0 1 Correcting the Write Register Problem 4 Result PC+4 Add Result Sh.Left2 Add Rs:[25-21] Read address Rt:[20-16] Data Memory Read address PC Zero Read data 1 Instruction [31-0] Write address Result InstructionMemory 0 0 Write data 1 Imm:[15-0] 16 32 signextend Rt:[20-16] Rd:[15-11] 6.2
5 4 3 1 2 Assembly-line Control Signals In an assembly line, the manufacturing instructions can be attachedto the car. The instructions then move along with the car. F: Standard E: 135 HP B: 2-door P: Green F: Leather E: 190 HP B: 4-door P: Blue F: Cotton B: 2-door P: Lavender F: Leather P: Green F: Vinyl F: Leather By separating the control signals by stages, only the signals needed for the current stage must be decoded. All signals for later stages must be passed along. 6.1
1 0 Registers Read reg. num A Read reg num A Read reg data A Read reg num B Write reg num Read reg data B Write reg data 0 1 The Pipelined Control Logic E Control PCSrc M M Op:[31-26] W W W 4 Branch PC+4 Result Result Add MemToReg Sh.Left2 RegWrite Add MemWrite Rs:[25-21] Read address Rt:[20-16] Data Memory ALUSrc Read address PC Zero Read data 1 Instruction [31-0] Write address Result InstructionMemory 0 0 Write data 1 Imm:[15-0] ALUcontrol 16 32 signextend MemRead Rt:[20-16] Rd:[15-11] ALUOp RegDest 6.3
How’d we do? • Compared to Single-cycle • 5 stages --> Potentially 5x speedup • Not likely • Stages won’t all be equally long • Pipeline registers will cause some delays • Latency --> Greater than in single-cycle design • More complexity, but nicely divided up
Example 1 • Consider executing the following code add $3, $4, $5 and $6, $7, $8 sub $9, $10, $11 on • A single-cycle machine with a cycle time of 200 ns • A 5-stage pipeline machine with a cycle time of 50 ns Which one runs faster? What if the instructions were 100 instead of 3?
IF RF M WB ADD EX IF M WB RF SUB EX IF M WB RF AND EX IF M WB RF SW EX IF M WB RF OR EX Analyzing Pipelines ADD $10, $14, $0 SUB $12, $13, $2 AND $1, $6, $11 SW $3, 200($9) OR $9, $13, $7 6.4
IF RF M WB ADD EX IF M WB RF SUB EX IF M WB RF AND EX IF M WB RF SW EX IF M WB RF OR EX Data Hazards ADD $13, $14, $0 SUB $12, $13, $2 AND $1, $6, $13 SW $3, 200($13) OR $9, $13, $7 Writes register $13 Reads wrong $13 Reads wrong $13 Reads ? $13 Reads correct $13 6.4
IF RF M WB ADD EX IF IF M M WB WB RF RF SUB SUB EX EX IF M RF AND EX IF RF SW EX IF RF OR Preventing Data Hazards ADD $13, $14, $0 NOP NOPNOP SUB $12, $13, $2 AND $1, $6,$13 SW $3, 200($13) OR $9, $13, $7 Insert NOP’s into the instruction stream to allow WB to happen before RF. Assume we can’t write a registerand read the new value in the same cycle 6.4
IF RF M WB ADD EX IF M WB RF SUB EX IF M WB RF AND EX IF M RF SW EX IF RF OR EX ADD $13, $14, $0 SUB $12, $13, $2 AND $1, $6, $13 SW $3, 200($13) OR $9, $13, $7 Detecting Hazards Write: $13 Compare write reg #in EX with read reg #in RF Compare write reg #in M with read reg #in RF Read A: $13 Compare write reg #in WB with read reg #in RF Read B: $13 Read A: $13 • Check each instruction as it is being decoded (RF-ID stage). • If it reads a register that will be written by any instruction ahead of it (in RF, EX, or M stages), there is a hazard. 6.5
IF RF M WB ADD EX = = IF SUB = IF SUB IF SUB IF M WB RF SUB EX IF M RF AND EX IF RF SW EX IF RF OR ADD $13, $14, $0 SUB $12, $13, $2 AND $1, $6,$13 SW $3, 200($13) OR $9, $13, $7 Stalling with Bubbles • Stalling: • Kill the current executionby “neutralizing” all the controlsignals so that it won’t write any registers. • Don’t write PC+4 into PC --> Stay at the current instruction and try again. 6.5
IF RF M WB ADD EX IF RF M WB SUB EX IF RF M WB AND EX IF RF M WB SW EX IF RF M WB OR EX Register Forwarding ADD $13, $14, $0 SUB $12, $13, $2 AND $1, $6, $13 SW $3, 200($13) OR $9, $13, $2 Register $13’s value is computed in the EX stage of the ADD even thoughit isn’t written in the register until the WB stage. --> The pipeline register following the EX stage hold the value of $13 that’s needed in the SUB instruction’s EX stage. 6.6
IF RF M WB LW EX IF RF M WB AND EX IF RF M WB AND EX IF RF M WB SW EX IF RF M WB OR OR EX Unforwardable Loads LW $2, 30($2) AND $1, $2, $13 SW $3, 200($2) OR $9, $2, $1 Loads don’t compute the register to write back until the Memory stage. This is one stage to late for the next instruction. ---> We can’t prevent stalls if the instruction following a Load uses the result of the Load. 6.6
Example 2 • Consider executing the following code on a 5-stage pipeline datapath add $3, $4, $5 lw $7, 100($3) sub $8, $7, $9 • Identify any potential data dependencies • How many cycles will it take to execute this code assuming no register forwarding? • How many cycles will it take to execute this code assuming register forwarding is available?
IF RF WB M BEQ EX RF M WB IF AND EX RF M WB IF SW EX RF M WB IF OR EX RF M WB IF OR LW EX Branch Hazards BEQ $2, $1, SKIP AND $1, $2, $13 SW $3, 200($2) OR $9, $2, $4 ADD $3, $2, $5 SKIP: LW $2,32($4) Don’t know result of branch untilthe end of the M stage If the branch is taken, we’ve blown it by executingthe intervening instructions 6.7
IF RF M WB BEQ EX IF AND IF AND IF AND IF M WB RF AND EX IF M RF SW EX IF RF OR EX IF RF ADD BEQ $2, $1, SKIP AND $1, $2, $13 SW $3, 200($2) OR $9, $2, $4 ADD $3, $2, $5 SKIP: LW $2,32($4) Solution 1: Stall Branchnot taken Stalling always solves theproblem. If we didn’t have somany branches in programs, it wouldnot be a problem 6.5
IF RF WB M BEQ EX RF M WB IF AND EX RF M WB IF SW EX RF M WB IF OR EX RF M WB IF LW EX BEQ $2, $1, SKIP AND $1, $2, $13 SW $3, 200($2) OR $9, $2, $4 ADD $3, $2, $5 SKIP: LW $2,32($4) Solution 2: Assume not Taken Branch is taken... Must be undone if branchis taken! If we guess right, we win --> No stall at all If we guessed wrong, 1. We have to undo all that we did (fortunately, no writebacks have occured yet). 2. We still take all the time of a stall 6.7
Solution 3: Better Prediction • Predict that the branch goes the same way as the last time • Works great for loops • Works great for “special-case” code • Need to keep track of the information for each branch, though... • One or two bits will do • Keep a small table of recently used branches and which way they went 6.7
Solution 4: Delayed Branches XOR $1, $3, $3 ADD $2, $3, $4 SUB $4, $3, $1 OR $3, $2, $0 BEQ $10, $11, SKIP LW $4, 60($2) SKIP AND $1, $2, $3 If we had some warning, wecould compute the branch aheadof time... XOR $1, $3, $3 Branch-After-Three-EQ $10,$11,SKIP ADD $2, $3, $4 SUB $4, $3, $1 OR $3, $2, $0 LW $4, 60($2) SKIP AND $1, $2, $3 3 delay slots These instructionsare always executed. Branch can’t dependon them... 6.7
IF RF M WB B3E EX IF RF M WB ADD EX IF RF M WB SUB EX IF RF M WB OR EX IF RF M WB LW or AND EX 3-slot Delayed Branch Branch-After-Three-EQ $10,$11,SKIP ADD $2, $3, $4 SUB $4, $3, $1 OR $3, $2, $0 LW $4, 60($2) SKIP AND $1, $2, $3 6.7
Branch summary • Two decent solutions: • Branch prediction • Requires more hardware • Used in modern microprocessors • Delayed branch • Requires special software manipulation • Often doesn’t deliver its promise • Used often in CPUs 4-10 years ago
Example 3 • Consider executing the following code LOOP: add $3, $4, $5 and $6, $7, $8 bne $12, $8, LOOP on • A single-cycle machine with a cycle time of 200 ns • A 5-stage pipeline machine with a cycle time of 50 ns • Assume the loop executes 10 times • Assume the loop executes 100 times • Assume the loop executes 1000 times Which one runs faster?
Example 4 • Consider executing the following code on a 5-stage pipeline datapath addi $3, $0, 10 LOOPSTART: lw $5, ARRAY($3) addi $5, $5, 1 sw $5, ARRAY addi $3, $3, -1 bne $3, $0, LOOPSTART add $3, $5, $6 sub $7, $8, $9 addi $4, $6, 3 • Identify potential data dependencies • How many cycles will it take to execute this code? • With nops/stalls • With branch prediction assuming branch not taken • With branch prediction based on one previous result