1 / 82

Lecture 9 Dynamic Scheduling of Pipeline

Lecture 9 Dynamic Scheduling of Pipeline. Static vs Dynamic Scheduling. Static Scheduling by compiler Code motion for LD delay slots and branch delay slots Code motion for avoiding data dependency In-order instruction issue:

jatin
Télécharger la présentation

Lecture 9 Dynamic Scheduling of Pipeline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 9Dynamic Scheduling of Pipeline CS510 Computer Architectures

  2. Static vs Dynamic Scheduling • Static Scheduling by compiler • Code motion for LD delay slots and branch delay slots • Code motion for avoiding data dependency • In-order instruction issue: • If an instruction is stalled, no later instructions can proceed. • Multiple copies of a unit may be idle - inefficiency • Dynamic Scheduling by Hardware • Allow Out-of-order execution, Out-of-order completion • Even though an instruction is stalled, later instructions, with no data dependencies with the instructions which are stalled and causing the stall, can proceed • Efficient utilization of functional unit with multiple units CS510 Computer Architectures

  3. HW Schemes:Instruction Parallelism • Why scheduling in HW at run time? • Works when dependencies are unknown at compile time • Simpler compiler • Code for one machine runs well on another • Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 In DLX,SUBDcannot be executed even if there is a separate adder available to maintain in- order-execution. • Enables out-of-order execution => out-of-order completion • DLX ID stage: checked both for structural hazards and data dependencies CS510 Computer Architectures

  4. HW Schemes:Instruction Parallelism • Out-of-order execution divides ID stage: 1.Issue - Decode instructions, check for structural hazards 2. Read operands - Wait until no data hazards, then read operands • Scoreboards(Control Data Corp. CDC 6600) allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions • Centralized implementation of Hazard Detection and Resolution • Every instruction goes through scoreboard • Scoreboard determines when instruction can read operands and begin execution • Monitoring every change in hardware and determine when to execute instruction CS510 Computer Architectures

  5. Scoreboard Implications • Out-of-order completion => WAR, WAW hazards? WARWAW ADDD R1,R2,R3 ADDD R1,R2,R3 LD R2,X LD R1,X • Solutions for WAR • Queue both the operation and copies of its operands • Read registers only during Read Operands stage • For WAW: stall until other to complete • Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units (superpipeline) • Scoreboard keeps track of dependencies, and the state of operations • Scoreboard replaces ID, EX, WB with 4 stages CS510 Computer Architectures

  6. 4 Stages of Scoreboard Control:1st Stage(ID1) - Issue • Decode instructions and check for structural hazards • If functional unitfor the instruction is free(no structural hazard), and no other active instruction has the same destination register(WAW) • Scoreboard issues instruction to functional unit • Updates internal data structure • IfStructural Hazard orWAW Hazardexists • Stall instruction issue • No further instruction issue until hazards are cleared • IF/ID1 Buffer allows further instruction fetch(IF) CS510 Computer Architectures

  7. 4 Stages of Scoreboard Control:2nd Stage(ID2) - Read Operands • Wait until no Data Hazard, then Read Operands • To prevent RAW, • If no earlier issued active instruction is going to writing it, or • If the register containing the operand is being written by none of the currently active functional units • Source operand is available for read • Scoreboard tells the functional unit to read and begin execution • Scoreboard resolves RAW Hazard dynamically • => out of order execution CS510 Computer Architectures

  8. 4 Stages of Scoreboard Control:3rd Stage(EX) - Execution • Operates on Operands • Functional Unit begins execution upon receiving operands • When the result is ready, the functional unit notifies the Scoreboard of the completion of execution CS510 Computer Architectures

  9. 4 Stages of Scoreboard Control:4th Stage(WB) - Write Result • Finish Execution • When Scoreboard knows the functional unitcompleted execution • Scoreboard checks for WAR Hazard If not, it writes the results If WAR Hazard, it stalls the instruction • Example: • DIVD F0,F2,F4 • ADDD F10,F0,F8 • SUBD F8,F8,F14 • CDC 6600 scoreboard would stall SUBD until ADDD reads operands CS510 Computer Architectures

  10. CS510 Computer Architectures

  11. 3 Parts of the Scoreboard 1. Instruction status- Indicates which of 4 steps(Issue,ReadOperands, Execution Complete, Write Result) the instruction is in 2. Functional unit status- Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy: Indicates whether the unit is busy or not Op: Operation to perform in the unit (e.g., + or - ) Fi:Destination register number Fj, Fk:Source-register numbers Qj, Qk: Functional units producing source registers Fj, Fk Rj, Rk: Flags indicating when Fj, Fk are ready 3. Register result status- Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register CS510 Computer Architectures

  12. WAW(if the same destination register) Wait until Bookkeeping Instruction status WAR Scoreboard Pipeline Control Issue Not busy (FU) and not result(D) Busy(FU) ¬ yes; Op(FU) ¬ op; Fi(FU) ¬ ‘D’; Fj(FU) ¬ ‘S1’;Fk(FU) ¬ ‘S2’; Qj ¬ Result(‘S1’); Qk ¬ Result(‘S2’); Rj ¬ not Qj; Rk ¬ not Qk; Result(‘D’) ¬ FU; Read operands Rj and Rk Rj ¬ No; Rk ¬ No; Qj ¬ 0; Qk ¬ 0; Execution complete Functional unit done Write result "f((Fj( f ) ¹ Fi(FU) or Rj(f)=No) & (Fk( f ) ¹ Fi(FU) or Rk( f )=No)) "f(if Qj(f)=FU then Rj(f) ¬ Yes);"f(if Qk(f)=FU then Rk(f) ¬ Yes); Result(Fi(FU)) ¬ 0; Busy(FU) ¬ No f: register number CS510 Computer Architectures

  13. Instruction Status Instruction j k Read Execution Write Issue Operands Complete Result Functional Unit Status Name Integer N Mult1 N Mult2 N Add N Divide N dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Register Result Status Clock0F0 F2 F4 F6 F8 F10 F12 …… F30 FU Scoreboard Example LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 CS510 Computer Architectures

  14. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result Functional Unit Status Name IntegerN Mult1N Mult2N Add N DivideN dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Load F6 R2Y Y Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 1 Cycle 1 1 Int CS510 Computer Architectures

  15. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 Functional Unit Status Name Integer N Mult1 N Mult2 N Add N Divide N dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Load F6 R2 Y Y Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 2 Int Cycle 2 2 N CS510 Computer Architectures

  16. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 Functional Unit Status Name Integer N Mult1 N Mult2 N Add N Divide N dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Load F6 R2 N Y Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 3 Int Cycle 3 3 CS510 Computer Architectures

  17. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 Functional Unit Status Name Integer N Mult1 N Mult2 N Add N Divide N dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Load F6 R2 N Y Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 4 Int Cycle 4 4 N CS510 Computer Architectures

  18. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 Load F2 R3 Y Y Functional Unit Status Name Integer N Mult1 N Mult2 N Add N Divide N dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 5 Int Int Cycle 5 5 CS510 Computer Architectures

  19. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 F2 Functional Unit Status Name Integer N Mult1 Y Mult2 N Add N Divide N dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Load F2 R3 N Y Y F2 Y Mult1 F0 F2 F4 Int N Y Y Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 6 Int Cycle 6 6 6 Mult1 CS510 Computer Architectures

  20. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 6 Functional Unit Status Name Integer N Mult1 N Mult2 N AddN Divide N dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Load F2 R3 F2 Y N Y Y F2 Mult F0 F2 F4 Int N Y Sub F8 F6 F2 Int Y N Y Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 7 Mult1Int Cycle 7 7 7 Add Int CS510 Computer Architectures

  21. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 6 7 Functional Unit Status Name Integer N Mult1 N Mult2 N Add N Divide N dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Load F2 R3 F2 Y N Y F2 Y F2 Int N Mult F0 F2 F4 Int Y F0 F2 Int N Sub F8 F6 F2 Int Y N Y F2 F0 Div F10 F0 F6 Mult N Y Y Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 8 Mult1Int Add Cycle 8a 8 Div Int CS510 Computer Architectures

  22. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 6 7 8 Y Mult F0 F2F4 Int Y F2 F2 Int N Y Functional Unit Status Name IntegerN Mult1 N Mult2 N Add Y Divide N dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk N Load F2 R3 F2 Y N Y F2 F2 Sub F8 F6 F2 Int Y N F2 Y Div F10 F0 F6 Mult N Y Y Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 8 Int Add Div Mult1 Cycle 8b 8 N Y Int N CS510 Computer Architectures

  23. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 6 7 8 Functional Unit Status Name Integer Mult1 Mult2 Add Y Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk N Load F2 R3 F2 Y N Y F2 Y Mult F0 F2 F4 Y F2 F2 F2 Int N N Y N Y Sub F8 F6 F2 Int Y Y F2 N N Y Div F10 F0 F6 Mult N Y Y Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 9 Int Add Div Mult1 Cycle 9 8 9 9 Time 10 2 N CS510 Computer Architectures

  24. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 8 6 9 7 9 8 Functional Unit Status Name Integer Mult1 Mult2 Add Y Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Time N Load F2 R3 F2 Y N Y F2 Y Mult F0 F2 F4 Y F2 F2 F2 Int N N Y N Y N Sub F8 F6 F2 Int Y Y F2 N N Y Div F10 F0 F6 Mult N Y Y Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 11 Int Add Div Mult1 Cycle 11 11 8 0 CS510 Computer Architectures

  25. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 8 6 9 7 11 9 8 Functional Unit Status Name Integer Mult1 Mult2 Add Y Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Time N Load F2 R3 F2 Y N Y F2 Y Mult F0 F2 F4 Y F2 F2 F2 Int N N Y N Y N Sub F8 F6 F2 Int Y Y F2 N N Y Div F10 F0 F6 Mult N Y Y Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 12 Int Add Div Mult1 Cycle 12 12 7 N CS510 Computer Architectures

  26. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 8 6 9 7 11 9 12 8 Functional Unit Status Name Integer N Mult1 Mult2 Add Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Time Y Mult F0 F2 F4Y F2 F2 F2 Int N N Y N Y N N Add F6 F8 F2 Y Y Y F6 Div F10 F0 F6 Mult N Y Y F6 Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 13 Mult1 Int Add Div Cycle 13 13 6 Y Add CS510 Computer Architectures

  27. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 8 6 9 7 11 9 12 8 13 Functional Unit Status Name Integer N Mult1 Mult2 Add Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Time 5 2 Y Mult F0 F2 F4 Y F2 5 F2 F2 Int N N Y N Y N N Add F6 F8 F2 Y Y Y F6 Div F10 F0 F6 Mult N Y Y F6 Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 14 Mult1 Int Add Add Div Cycle 14 14 Y N N CS510 Computer Architectures

  28. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 8 6 9 7 11 9 12 8 13 14 Functional Unit Status Name Integer N Mult1 Mult2 Add Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Time 4 1 Y Mult F0 F2 F4 Y F2 4 F2 F2 Int N N Y N Y N N Add F6 F8 F2 Y Y Y F6 N N Div F10 F0 F6 Mult N Y Y F6 Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 15 Mult1 Int Add Add Div Cycle 15 Y CS510 Computer Architectures

  29. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 8 6 9 7 11 9 12 8 13 14 Functional Unit Status Name Integer N Mult1 Mult2 Add Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Time 3 0 Y Mult F0 F2 F4 Y F2 3 F2 F2 Int N N Y N Y N N Add F6 F8 F2 Y Y Y F6 N N Div F10 F0 F6 Mult N Y Y F6 Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 16 Mult1 Int Add Add Div Cycle 16 16 Y CS510 Computer Architectures

  30. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 8 6 9 7 11 9 12 8 13 14 16 Functional Unit Status Name Integer N Mult1 Mult2 Add Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Time Y Mult F0 F2F4 Y F2 F2 F2 Int N N Y N Y N N Add F6 F8 F2 Y Y Y F6 N N Div F10 F0 F6 Mult N Y Y F6 Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 17 Mult1 Int Add Add Div Cycle 17 2 2 Y CS510 Computer Architectures

  31. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 8 6 9 7 11 9 12 8 13 14 16 Functional Unit Status Name Integer N Mult1 Mult2 Add Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Time Y Mult F0 F2 F4 Y F2 F2 F2 Int N N Y N Y N N Add F6 F8 F2 Y Y Y F6 N N Div F10 F0 F6 Mult N Y Y F6 Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 18 Mult1 Int Add Add Div Cycle 18 1 1 Y CS510 Computer Architectures

  32. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 8 6 9 7 11 9 12 8 13 14 16 Functional Unit Status Name Integer N Mult1 Mult2 Add Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Time Y Mult F0 F2F4 Y F2 F2 F2 Int N N Y N Y N N Add F6 F8 F2 Y Y Y F6 N N Div F10 F0 F6 Mult N Y Y F6 Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 19 Mult1 Int Add Add Div Cycle 19 19 0 Y CS510 Computer Architectures

  33. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 8 6 9 19 7 11 9 12 8 13 14 16 Functional Unit Status Name Integer N Mult1 Mult2 Add Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Time Y MultF0F2F4 Y F2 F2 F2 Int Int N N Y N Y N N Add F6 F8 F2 Y Y Y F6 N N Div F10 F0 F6 N Y Y F6 Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 20 Mult1 Int Add Add Div Cycle 20 20 N Y Mult1 F0 Y CS510 Computer Architectures

  34. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 8 6 9 19 20 7 11 9 12 8 13 14 16 Functional Unit Status Name Integer N Mult1 Mult2 Add Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Time Y Mult F0 F2 F4 Y F2 N F2 F2 Int Int N N Y N Y N N Add F6 F8 F2 Y Y Y F6 N N Div F10 F0 F6 N Y Y F0 F6 Y N N Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 21 Mult1 Int Add Add Div Cycle 21 21 CS510 Computer Architectures

  35. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 8 6 9 19 20 7 11 9 12 8 21 13 14 16 Functional Unit Status Name Integer N Mult1 Mult2 Add Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Time Y Mult F0 F2 F4 Y F2 N F2 F2 Int Int N N Y N Y N N Add F6 F8 F2 Y Y Y F6 N N Div F10 F0 F6 N Y Y F0 F6 Y N N Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 22 Mult1 Int Add Add Div Cycle 22 22 N 40 F6 CS510 Computer Architectures

  36. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 8 6 9 19 20 7 11 9 12 8 21 13 14 16 22 Functional Unit Status Name Integer N Mult1 Mult2 Add Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Time Y Mult F0 F2 F4 Y F2 N F2 F2 Int Int N N Y N Y N N Add F6 F8 F2 Y Y Y N F6 N N Div F10 F0 F6 Y F0 F6 F6 Y N N Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 61 Mult1 Int Add Add Div Cycle 61 61 0 CS510 Computer Architectures

  37. Instruction Status Instruction j k LD F6 34 + R2 LD F2 45 + R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Read Execution Write Issue Operands Complete Result 1 2 3 4 5 6 7 8 6 9 19 20 7 11 9 12 8 21 61 13 14 16 22 Functional Unit Status Name Integer N Mult1 Mult2 Add Divide dest S1 S2 FU for j FU for k Fj? Fk? Busy Op Fi Fj Fk Qj Qk Rj Rk Time Y Mult F0 F2 F4 Y F2 N F2 F2 Int Int N N Y N Y N N Add F6 F8 F2 Y Y Y N F6 N N Div F10 F0 F6 Y F0 F6 F6 Y N N Register Result Status Clock F0 F2 F4 F6 F8 F10 F12 …… F30 FU 62 Mult1 Int Add Div Add Cycle 62 62 N CS510 Computer Architectures

  38. Scoreboard Summary Scoreboard Summary • Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache) limits benefit • Limitations of 6600 scoreboard: • No forwarding hardware • Limited to instructions in basic block (small window) • Small number of functional units (structural hazards) • Wait for WAR hazards • Prevent WAW hazards • Speedup 1.7 from FORTRAN program, 2.5 by hand coded Assembly Language program BUT slow memory (no cache) limits benefit • Limitations of 6600 scoreboard: • No forwarding hardware • Limited to instructions in basic block (small window) • Small number of functional units (structural hazards) • Wait for WAR hazards • Prevent WAW hazards CS510 Computer Architectures

  39. CS510 Computer Architectures

  40. Case Study:Tomasulo Algorithm CS510 Computer Architectures

  41. Limitations of Scoreboard • No forwarding • Limited to instructions in basic block (small window) • Number of functional units(structural hazards) • Wait for WAR hazards • Prevent WAW hazards CS510 Computer Architectures

  42. Another Dynamic Algorithm: Tomasulo Algorithm • For IBM 360/91 about 3 years after CDC 6600 • Goal: High Performance without special compilers • Differences between IBM 360 & CDC 6600 ISA • IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 • IBM has 4 FP registers vs. 8 in CDC 6600 • Differences between Tomasulo Algorithm & Scoreboard • Control & buffers are distributed with Function Units, called “reservation stations” vs. centralized in scoreboard; • Registers in instructions are replaced by pointers to reservation station buffer • HW renaming of registers to avoid WAR, WAW hazards • Common Data Bus(CDB) broadcasts results to all FUs • Load and Stores treated as FUs as well CS510 Computer Architectures

  43. Only Data Dependence with Register Renaming Name dependence(arrows) and Data Dependence(blue& green) Loop: LD F0, 0(R1) ADDD F4,F0, F2 SD 0(R1),F4 LDF6, -8(R1) ADDD F8,F6, F2 SD -8(R1),F8 LDF10, -16(R1) ADDDF12,F10, F2 SD -16(R1),F12 LDF14, -24(R1) ADDD F16,F14, F2 SD -24(R1),F16 SUBIR1,R1, #32 BNEZ R1, Loop Loop: LD F0, 0(R1) ADDDF4,F0,F2 SD 0(R1),F4 LD F0, -8(R1) ADDDF4,F0, F2 SD -8(R1),F4 LDF0, -16(R1) ADDDF4,F0, F2 SD -16(R1),F4 LDF0, -24(R1) ADDD F4,F0, F2 SD -24(R1),F4 SUBIR1, R1, #32 BNEZR1, Loop Register Renaming Register Renaming CS510 Computer Architectures

  44. FromInstructionUnit FromMemory FP Registers Floating Point Operations Queue (Issue) Load Buffers (values to be loaded in registers) 6 5 4 3 2 1 Operand Bus Store Buffers (addresses) 3 2 1 Operation Bus To Memory To Memory FP Multiply Reservation Station FP Add Reservation Station 3 2 1 2 1 FP Multiplier FP Adder Tomasulo Organization Reservation Station Common Data Bus(CDB) CS510 Computer Architectures

  45. Reservation Station Components Op:Operation to perform in the unit (e.g., + or - ) Qj, Qk:Reservation stations producing source Vj, Vk. 0 indicates that Vj,Vk are ready, eliminating Rj, Rk fields in scoreboard Vj, Vk:Value of Source operands Busy:Indicates reservation station and FU is busy Register result status:Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. CS510 Computer Architectures

  46. Three Stages of Tomasulo Algorithm 1. Issue: Get instruction from FP Op Queue • FP op: If reservation station is free, issue instr, and send operation & operands if they are in Reg’s(renames Reg’s). • LD/ST: If Buffer is available, issue instr. • If reservation station or buffer is not available, structural hazard-stall • Register renaming 2. Execution: Operate on operands (EX) • When an operand is ready, put it in the reservation station. • If not ready, watch CDB for registers. • When both operands are available, execute • RAW check 3. Write Result: Finish execution (WB) • When result is available write on Common Data Bus, and from there to all awaiting units; Registers, Reservation stations • Mark reservation station available. CS510 Computer Architectures

  47. Instruction status Exec Write Busy Address Instruction j k Issue complete Result LD1 No LD F6 34+ R2 LD F2 45+ R3 LD2 No MULTD F0 F2 F4 LD3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 S1 S2 RS for j Reservation Stations RS for k Busy Op Vj Vk Qj Qk Time Name No 0 Add1 No 0 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 Register result status F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R2 R3 Qi 0 80 90 Cycle 0 CS510 Computer Architectures

  48. Instruction status Exec Write Busy Address Instruction j k Issue complete Result LD F8 34+ R2 LD1 No LD F6 34+ R2 LD F2 45+ R3 LD2 No MULTD F0 F2 F4 LD3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 S1 S2 RS for j Reservation Stations RS for k Busy Op Vj Vk Qj Qk Time Name No 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 Register result status F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R2 R3 Qi 0 80 90 Cycle 1 Yes 34+80 1 LD1 1 CS510 Computer Architectures

  49. Instruction status Exec Write Busy Address Instruction j k Issue complete Result Yes 34+80 LD F8 34+ R2 1 LD1 No LD F6 34+ R2 LD F2 45+ R3 LD F2 45+ R3 LD2 No MULTD F0 F2 F4 LD3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 S1 S2 RS for j Reservation Stations RS for k Busy Op Vj Vk Qj Qk Time Name No 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 Register result status F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R2 R3 Qi 1 0 80 90 Cycle 2 Yes 45+90 2 LD2 LD1 2 CS510 Computer Architectures

  50. Instruction status Exec Write Busy Address Instruction j k Issue complete Result Yes 34+80 LD F8 34+ R2 1 LD1 No LD F6 34+ R2 Yes 45+90 LD F2 45+ R3 2 LD F2 45+ R3 LD2 No MULTD F0 F2 F4 MULTD F0 F2 F4 LD3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 S1 S2 RS for j Reservation Stations RS for k Busy Op Vj Vk Qj Qk Time Name No 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 Register result status F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R2 R3 LD2 Qi 1 2 0 80 90 Cycle 3 3 3 Yes MULTD R(F4) LD2 0 Mult1 LD1 3 CS510 Computer Architectures

More Related