1 / 82

Pipeline Datapath

Pipeline Datapath. With some slides from: John Lazzaro and Dan Garcia. השעון. מחשב פנטיום במהירות של פירושו שהוא מבצע 8^10 *2 מחזורי שעון בשניה. כל מחזור שעון לוקח. 200 MHZ. Hertz=1/sec. 5*10^-9=5nanosecond. כמה לוקחת פקודה בימינו?. ALU. 2. Decode/ Register Read.

casta
Télécharger la présentation

Pipeline Datapath

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

  2. השעון מחשב פנטיום במהירות של פירושו שהוא מבצע 8^10 *2 מחזורי שעון בשניה. כל מחזור שעון לוקח 200MHZ Hertz=1/sec 5*10^-9=5nanosecond כמה לוקחת פקודה בימינו?

  3. ALU 2. Decode/ Register Read 5. WriteBack 1. Instruction Fetch 4. Memory 3. Execute datapath מבנה ה- rd instruction memory registers PC rs Data memory rt +4 imm

  4. A B C D Gotta Do Laundry • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away • Washer takes 30 minutes • Dryer takes 30 minutes • “Folder” takes 30 minutes • “Stasher” takes 30 minutes to put clothes into drawers

  5. 2 AM 12 6 PM 1 8 7 11 10 9 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 T a s k O r d e r Time A B C D Sequential Laundry • Sequential laundry takes 8 hours for 4 loads

  6. 2 AM 12 6 PM 1 8 7 11 10 9 Time 30 30 30 30 30 30 30 T a s k O r d e r A B C D Pipelined Laundry • Pipelined laundry takes 3.5 hours for 4 loads!

  7. General Definitions • Latency: time to completely execute a certain task • for example, time to read a sector from disk is disk access time or disk latency • Throughput: amount of work that can be done over a period of time

  8. 6 PM 7 8 9 Time T a s k O r d e r 30 30 30 30 30 30 30 A B C D Pipelining Lessons (1/2) • Pipelining doesn’t help latencyof single task, it helps throughput of entire workload • Multiple tasks operating simultaneously using different resources • Potential speedup = Number pipe stages • Time to “fill” pipeline and time to “drain” it reduces speedup:2.3X v. 4X in this example

  9. 30 30 30 30 30 30 30 A B C D Pipelining Lessons (2/2) 6 PM 7 8 9 Time T a s k O r d e r • Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline? • Pipeline rate limited by slowestpipeline stage • Unbalanced lengths of pipe stages also reduces speedup

  10. Car body shell The clock Merge station Car chassis Bolting station Inspiration: Automobile assembly line Assembly line moves on a steady clock. Each station does the same task on each car.

  11. Inspiration: Automobile assembly line Simpler station tasks → more cars per hour. Simple tasks take less time, clock is faster.

  12. Inspiration: Automobile assembly line Line speed limited by slowest task. Most efficient if all tasks take same time to do

  13. Inspiration: Automobile assembly line Simpler tasks, complex car → long line! These lines go 24 x 7, and rarely shut down. Why?

  14. Lessons from car assembly lines Faster line movement yields more cars per hour off the line. Faster line movement requires morestages, each doing simpler tasks. To maximize efficiency, all stages should take same amount of time (if not, workers in fast stages are idle) “Filling”, “flushing”, and “stalling” assembly line are all bad news.

  15. ALU 2. Decode/ Register Read 5. WriteBack 1. Instruction Fetch 4. Memory 3. Execute datapath מבנה ה- rd instruction memory registers PC rs Data memory rt +4 imm

  16. Stage #3 Stage #4 Stage #2 Stage #5 IR IR IR IR Controls hardware in stage 2 Controls hardware in stage 3 Controls hardware in stage 4 Controls hardware in stage 5 Key Analogy: The instruction is the car Pipeline Stage #1 Instruction Fetch “Data-stationary control”

  17. Good for visualizing pipeline fills. Time: t1 t2 t3 t4 t5 t6 t7 t8 Sample Program Inst I1: ADD R4,R3,R2 I1: ID EX MEM WB I2: AND R6,R5,R4 I2: IF ID EX MEM WB I3: SUB R1,R9,R8 I3: IF ID EX MEM WB I4: XOR R3,R2,R1 I4: IF ID EX MEM WB I5: OR R7,R6,R5 I5: IF ID EX MEM Pipeline is “full” I6: IF ID EX Representation #1: Timeline IF (Fetch) ID (Decode) EX (ALU) MEM WB IR IR IR IR IF

  18. Time: t1 t2 t3 t4 t5 t6 t7 t8 Stage IF: I2 I3 I4 I5 I6 I7 I8 ID: I1 I2 I3 I4 I5 I6 I7 EX: I1 I2 I3 I4 I5 I6 MEM: I1 I2 I3 I4 I5 WB: I1 I2 I3 I4 Pipeline is “full” Representation #2: Resource Usage IF (Fetch) ID (Decode) EX (ALU) MEM WB IR IR IR IR Good for visualizing pipeline stalls. Sample Program I1: I1 ADD R4,R3,R2 I2: AND R6,R5,R4 I3: SUB R1,R9,R8 I4: XOR R3,R2,R1 I5: OR R7,R6,R5

  19. IFtch Dcd Exec Mem WB ALU 2. Decode/ Register Read 5. WriteBack 1. Instruction Fetch 4. Memory 3. Execute I$ D$ Reg Reg ALU Review: Datapath for MIPS rd instruction memory PC registers rs Data memory rt • Use datapath figure to represent pipeline +4 imm Stage 2 Stage 3 Stage 4 Stage 5 Stage 1

  20. Time (clock cycles) I n s t r. O r d e r I$ Reg ALU Load D$ Reg I$ Add ALU D$ Reg Reg I$ Reg ALU Store D$ Reg D$ Reg Sub ALU I$ Reg I$ D$ Reg Or ALU Reg Graphical Pipeline Representation (In Reg, right half highlight read, left half write)

  21. Example • Suppose 2 ns for memory access, 2 ns for ALU operation, and 1 ns for register file read or write • Nonpipelined Execution: • lw : IF + Read Reg + ALU + Memory + Write Reg = 2 + 1 + 2 + 2 + 1 = 8 ns • add: IF + Read Reg + ALU + Write Reg = 2 + 1 + 2 + 1 = 6 ns • Pipelined Execution: • Max(IF,Read Reg,ALU,Memory,Write Reg) = 2 ns

  22. חלוקה לשלבים

  23. הוספת הרגיסטרים

  24. l w I n s t r u c t i o n f e t c h 0 M u x 1 I F / I D I D / E X E X / M E M M E M / W B A d d A d d 4 A d d r e s u l t S h i f t l e f t 2 R e a d n o r e g i s t e r 1 i A d d r e s s P C t R e a d c u d a t a 1 r t R e a d s Z e r o r e g i s t e r 2 n I I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U m e m o r y 0 R e a d W r i t e A d a t a 2 r e s u l t 1 d a t a r e g i s t e r M M u u W r i t e x x d a t a 1 0 W r i t e d a t a 1 6 3 2 S i g n e x t e n d IF/ID

  25. l w 0 I n s t r u c t i o n d e c o d e M u x 1 I F / I D I D / E X E X / M E M M E M / W B A d d A d d 4 A d d r e s u l t S h i f t l e f t 2 n o R e a d i t c r e g i s t e r 1 A d d r e s s P C u R e a d r t s d a t a 1 R e a d n I Z e r o r e g i s t e r 2 I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U m e m o r y 0 R e a d W r i t e A d d r e s s d a t a 2 r e s u l t 1 d a t a r e g i s t e r M M u D a t a u W r i t e x m e m o r y x d a t a 1 0 W r i t e d a t a 3 2 1 6 S i g n e x t e n d ID/EX

  26. EX/MEM

  27. l w 0 M e m o r y M u x 1 I F / I D I D / E X E X / M E M M E M / W B A d d A d d 4 A d d r e s u l t S h i f t l e f t 2 n o R e a d i t c r e g i s t e r 1 A d d r e s s P C u R e a d r t d a t a 1 s R e a d n I Z e r o r e g i s t e r 2 I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U m e m o r y 0 R e a d W r i t e A d d r e s s d a t a 2 r e s u l t 1 d a t a r e g i s t e r M D a t a M u u m e m o r y W r i t e x x d a t a 1 0 W r i t e d a t a 1 6 3 2 S i g n e x t e n d MEM/WB

  28. l w 0 M u W r i t e b a c k x 1 I F / I D I D / E X E X / M E M M E M / W B A d d A d d 4 A d d r e s u l t S h i f t l e f t 2 n o R e a d i t c r e g i s t e r 1 A d d r e s s P C u R e a d r t d a t a 1 s R e a d n I Z e r o r e g i s t e r 2 I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U m e m o r y 0 R e a d W r i t e A d d r e s s d a t a 2 r e s u l t 1 d a t a r e g i s t e r M D a t a M u m e m o r y u W r i t e x x d a t a 1 0 W r i t e d a t a 1 6 3 2 S i g n e x t e n d

  29. A correction !!! תיקון Keep the right Rd all the way!

  30. So here is the updated CPU;

  31. Control

  32. קווי הבקרה

  33. Datapath with Control

  34. דוגמא A demonstration of a sequence of instructions: Lw $10,20($1) Sub $11,$2,$3 And $12,$4,$5 Or $13,$6,$7 Add $14,$8,$9

  35. ID: and $12, $4, $5

  36. I F : a f t e r I D : a f t e r < 3 > E X : a f t e r < 2 > M E M : a f t e r < 1 > W B : a d d $ 1 4 , . . . I F / I D I D / E X E X / M E M M E M / W B W B M W B E X M A d d 4 A d d e r e s u l t t i r W B r a n c h S h i f t g e R l e f t 2 A L U S r c g n e o i R t c o A d d r e s s t u r m t s e n M I Z e r o I n s t r u c t i o n A L U m e m o r y 0 r e s u l t 1 A L U c o n t r o l 0 A L U O p 1 R e g D s t < 4 > 0 M 0 0 0 0 u x 1 0 0 0 0 0 0 0 0 C o n t r o l 0 0 1 0 0 0 0 0 0 0 W B 0 0 0 A d d e t i r W m e R e a d M r e g i s t e r 1 P C R e a d d a t a 1 R e a d r e g i s t e r 2 R e g i s t e r s A L U R e a d 1 4 R e a d W r i t e d a t a 2 A d d r e s s 1 d a t a r e g i s t e r M M D a t a u u m e m o r y W r i t e x x d a t a 0 W r i t e d a t a I n s t r u c t i o n [ 1 5 – 0 ] S i g n M e m R e a d e x t e n d I n s t r u c t i o n [ 2 0 – 1 6 ] M 1 4 u I n s t r u c t i o n x [ 1 5 – 1 1 ] C l o c k 9

  37. Problems for Computers • Limits to pipelining:Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) • Control hazards: Pipelining of branches & other instructions stall the pipeline until the hazard “bubbles” in the pipeline • Data hazards: Instruction depends on result of prior instruction still in the pipeline

  38. An example for data hazards: sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2)

  39. An example for data hazards: sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) An example for data hazards:Register $2 is updated only at the WB phase, i.e., the 5th clock cycle (actually at the end of the 5th clock cycle). However, we try to use it at the 3rd clock cycle when we read $2 at the decode phase of the and instruction

  40. Graphic representation of data hazards:

  41. Solving data hazards by adding nops sub $2, $1, $3 nop nop nop and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2)

  42. T i m e ( i n c l o c k c y c l e s ) C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 C C 10 C C 11 C C 12 V a l u e o f r e g i s t e r $ 2 : 1 0 1 0 1 0 1 0 1 0 / – 2 0 – 2 0 – 2 0 – 2 0 – 2 0 – 2 0 – 2 0 – 2 0 s u b $ 2 , $ 1 , $ R R R g g g e e e I I I M M M R R R e e e g g g D D D M M M P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) R e g I M R e g D M I M D M R e g R e g I M D M R e g R e g I M D M R e g R e g I M D M R e g R e g Solving data hazards by adding nops 3 nop nop nop 5 a n d $ 1 2 , $ 2 , $ o r $ 1 3 , $ 6 , $ 2 2 a d d $ 1 4 , $ 2 , $ ) s w $ 1 5 , 1 0 0 ( $ 2

  43. The internal structure of the Register File Rd reg 1 (= Rs) 32 5 32 32 32 32 Read data 1 write data 32 32 32 Rd reg 2 (= Rt) 5 32 32 32 Read data 2 32 32 Wr reg (= Rd) 5 E RegWrite קוראים משתי היציאות בוזמנית ערכים של שני רגיסטרים שונים כותבים לאחד הרגיסטרים האחרים (בעליית השעון הבאה)

  44. T i m e ( i n c l o c k c y c l e s ) C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 C C 10 C C 11 C C 12 V a l u e o f r e g i s t e r $ 2 : 1 0 1 0 1 0 1 0 1 0 / – 2 0 – 2 0 – 2 0 – 2 0 – 2 0 – 2 0 – 2 0 – 2 0 s u b $ 2 , $ 1 , $ R R g g e e I I M M R R e e g g D D M M P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) R e g I M R e g D M I M D M R e g R e g I M D M R e g R e g I M D M R e g R e g I M D M R e R e g We could earn 1 ck cycle if GPR is “transparent” 3 nop nop 5 a n d $ 1 2 , $ 2 , $ o r $ 1 3 , $ 6 , $ 2 We could earn 1 ck cycle if GPR is “transparent”, i.e, we could see the write data to the GPR at the GPR outputs (if the write address equals the read address), i.e., during Ck #5. 2 a d d $ 1 4 , $ 2 , $ g ) s w $ 1 5 , 1 0 0 ( $ 2

  45. The internal structure of the modified Register File. We ‘bypass” the input data (the write data) to the read data1 output whenever Rs=Rd/Rt (i.e., whenever read reg1=write reg but not zero). We “bypass” the input data (the write data) to the read data2 output whenever Rt=Rd/Rt (i.e., whenever read reg2=write reg, but not zero). Wr reg 5 Rd reg 1 (= Rs) 5 write data 32 32 Read data 1 32 0 32 32 32 32 32 write data 32 32 Wr reg 32 5 Rd reg 2 (= Rt) write data 5 32 Read data 2 32 0 32 32 32 32 32 32 Wr reg (= Rd) 5 E RegWrite קוראים משתי היציאות בוזמנית ערכים של שני רגיסטרים שונים כותבים לאחד הרגיסטרים האחרים (בעליית השעון הבאה)

  46. After doing that change we only need 2 nops sub $2, $1, $3 nop nop and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) After the change the WB of an early instruction can happen at the same time with the read reg (decode) phase of a newer instruction (3 with two other instructions in between). In case we have a data hazard, we need to add only two nop instructions.Unfortunately, this happens too often. We need a better solution!

  47. Graphic representation of data hazards:

More Related