1 / 84

Pipelining

Pipelining. 6.1, 6.2. Performance Measurements. Cycle Time: Time __________________ Latency: Time to finish a _____________, start to finish Throughput: Average ______________ per unit time CyclesPerJob: On average, number of _______________________________. Goals. Faster clock rate

kenton
Télécharger la présentation

Pipelining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pipelining 6.1, 6.2

  2. Performance Measurements • Cycle Time: Time __________________ • Latency: Time to finish a _____________, start to finish • Throughput: Average ______________ per unit time • CyclesPerJob: On average, number of _______________________________.

  3. Goals • Faster clock rate • Use machine more efficiently • No longer execute only one instruction at a time

  4. Laundry • Laundry-o-matic washes, dries & folds • Wash: 30 min • Dry: 40 min • Fold: 20 min • It switches them internally with no delay • How long to complete 1 load? ______

  5. Laundry-o-Matic - SingleCycle 1 2 3 W D F Load 0 30 60 90 120 150 180 210 240 270 Minutes

  6. Laundry-o-Matic • Cycle Time: Clothing is switched every ____ minutes • Latency: A single load takes a total of ______ minutes • Throughput: A load completes each ______ minutes • CyclesPerLoad: Every ____ cycle(s), a load completes

  7. Pipelined Laundry • Split the laundry-o-matic into a washer, dryer, and folder (what a concept) • Moving the laundry from one to another takes 6 minutes

  8. Pipelined Laundry Switch all loads at the same time 1 2 3 W D F Load 0 30 60 90 120 150 180 210 240 270 Minutes

  9. Pipelined Laundry • Cycle Time: Clothing is switched every ____ minutes • Latency: A single load takes a total of ______ minutes • Throughput: A load completes each ______ minutes • CyclesPerLoad: Every ____ cycle(s), a load completes

  10. Single-Cycle vs Pipelined • _________ has the higher cycle time • _________ has the higher clock rate • _________ has the higher single-load latency • _________ has the higher throughput • _________ has the higher CPL (Cycles per Load) • More stages makes a _______ clock rate

  11. Obstacles to speedup in Pipelining W D F • 1. • 2. • Ideal cycle time w/out above limitations with n stage pipeline:

  12. Creating Stages IF ID MEM WB • Fetch – get instruction • Decode – read registers • Execute – use ALU • Memory – access memory • WriteBack – write registers Fetch Decode Execute Memory WriteBack

  13. Pipelined Machine Decode Execute Memory Fetch << 2 << 2 4 Addr Out Data Data Memory In Data src1 src1data src2 src2data Register File destreg destdata PC op/fun rs rt rd imm Read Addr Out Data Instruction Memory Sign Ext 16 32 (Writeback) Pipeline Register

  14. IF ID MEM WB add $s0, $0, $0 IF ID MEM WB lw $s1, 0($t0) sw $s2, 0($t1) IF ID MEM WB or $s3, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time-> In what cycle was $s1 written? In what cycle was $s4 read? In what cycle was the Add executed?

  15. Performance Analysis • Measurements related to our machine • Job = single instruction • Latency - Time to finish a complete _______________, start to finish. • Throughput – Average ______________ completed per unit time. • CyclesPerInstruction – An _________ completes each how many cycles? • Which is more important for reducing program execution time?

  16. Machine Comparison Fetch Decode Execute Memory WriteBack 2ns 1ns 2ns 2ns 1ns 0.1 ns pipeline register delay Single-Cycle Implementation Clock cycle time: _____ ns Latency of a single instruction: _____ ns Throughput for machine: _____ inst/ns Pipelined Implementation Clock cycle time: _____ ns Latency of a single instruction: _____ ns Throughput for machine: _____ inst/ns

  17. Example 2 – How do we speed up pipelined machine? Fetch Decode Execute Memory Writeback 6ns 4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Single cycle: 1 / ns Pipelined: 1 / ns

  18. Example 2 – Split more stages Fetch Decode Execute Memory Writeback 6ns 4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Which stage(s) should we split? _________ and _________

  19. Example 2 – After Split F WB ___ns ___ns ___ns ___ns ___ns ___ns ___ns 0.1 ns pipelined register delay Single cycle: 1 / ns Pipelined: 1 / ns

  20. Incorrect Execution Easy Right? Not so fast. In what cycle does the sw write $s0? In what cycle does the or read $s0? lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF ID MEM WB sw $s2, 0($t1) IF ID MEM WB and $s6, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->

  21. Correct, Slow ExecutionData Hazards ID & WB take only ½ cycle each. All other stages take the full cycle In what cycle does the sw write $s0? In what cycle does the oi read $s0? Stall - wasted cycles lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF IF IF ID MEM WB sw $s2, 0($t1) IF ID MEM WB and $s6, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->

  22. Barriers to pipeline performance • Uneven stages • Pipeline register delays • Data Hazards • ______________________________________

  23. Default: Stall Stall - wasted cycles lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF sw $s2, 0($t1) and $s6, $s4, $t3 1 2 3 4 5 6 7 8 Time->

  24. Solution 1: Data Forwarding In what cycle is $s0 produced in the machine? In what cycle is $s0 used in the machine? Add wires that pass data from the stage in which it is produced to the stage in which it is used. lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF ID sw $s2, 0($t1) IF and $s6, $s4, $t3 1 2 3 4 5 6 7 8 Time->

  25. Data-ForwardingWhere are those wires? Decode Execute Memory Fetch << 2 << 2 4 Addr Out Data Data Memory In Data src1 src1data src2 src2data Register File destreg destdata PC op/fun rs rt rd imm Read Addr Out Data Instruction Memory Sign Ext 16 32 (Writeback) Pipeline Register

  26. Data ForwardingExample 2 Draw the timing diagram with data forwarding Draw arrows to indicate data passing through forwarding lw $t0, 0($s0) F D M W addi $t0, $t0, 1 F D add $s2, $s2, $t0 F sw $s2, 0($s0) 1 2 3 4 5 6 7 8 9 10 11 12 Time->

  27. Who reorders instructions? • Static scheduling • Compiler • ___________, but does not know when caches miss or loads/stores are to the same locations (conservative) • Dynamic scheduling • Hardware • _____________________________, but has all knowledge (more accurate)

  28. Solution 2: Instruction Reordering (Before reordering) Stall - wasted cycles lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF IF IF MEM WB ID sw $s2, 0($t1) IF ID MEM WB and $s6, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->

  29. Solution 2: Instruction Reordering (With Instruction Reordering) lw $s0, 0($t4) IF ID MEM WB WB IF ID IF 1 2 3 4 5 6 7 8 Time->

  30. Instruction Reordering - Are these equivalent? (Get the same answer) lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF IF IF MEM WB ID sw $s3, 0($t1) IF ID MEM WB and $s0, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->

  31. A deeper look lw $s0, 0($t4) IF ID MEM WB WB sw $s3, 0($t1) IF ID MEM WB and $s0, $s4, $t3 IF ID MEM WB or $s3, $s0, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->

  32. Dependencies And our friends RAW, WAR, and WAW

  33. RAW – Read after Write add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

  34. WAR - Write after Read add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

  35. WAW –Write after Write add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

  36. Identify all of the dependencies RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8

  37. Which dependencies might cause stalls (hazards)? RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8 So why do we care about WAW and WAR dependencies at all?!?

  38. Can we reorder the or? RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8

  39. No, we can not reorder the or RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB RAW sub $t2, $t0, $s3 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8

  40. Can we reorder the mul? RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8

  41. No, siree. Mul stays put. RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB IF ID MEM WB sub $t2, $t0, $s3 or $s3, $t7, $s2 IF ID MEM WB 1 2 3 4 5 6 7 8

  42. How to remove WAR, WAW dependencies? add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 and $t4, $s3, $s5 add $s3, $s4, $s6 IF ID MEM WB 1 2 3 4 5 6 7 8

  43. ___________________________ use a different register for that result (and all subsequent uses of that result) add $t0, $s0, $s1 IF ID MEM WB sub ____, $t0, $s3 IF ID MEM WB or ____, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 and $t4, ___, $s5 add $s3, $s4, $s6 IF ID MEM WB 1 2 3 4 5 6 7 8

  44. Who renames registers? • Static register renaming • Compiler • Compiler is the one who makes assignments in the first place! • Number of registers limited by ____________ • Dynamic register renaming • Hardware • Can offer more registers – • Number of registers limited by ____________

  45. Minimizing Data HazardsSummary

  46. Dependencies Summary • What is the difference between a hazard and a dependency? • How can we get rid of name dependencies? • What limits this solution?

  47. The Big Picture Fewer Stages More Stages More hardware Smaller More power Faster Clock Fewer Stalls Higher Throughput

  48. The Big Picture • Throughput (inst / ns) = • inst / cycle * cycle / ns

  49. Control Hazard(Incorrect Execution) In what cycle does the nextPC get calculated for the bne? In what cycle does the or get fetched? add $s5, $s4, $t1 IF ID MEM WB IF ID MEM WB bne $s0, $s1, end IF ID MEM WB or $s3, $s0, $t3 IF ID MEM WB end: sw $s2, 0($t1) 1 2 3 4 5 6 7 8 Time->

More Related