1 / 39

Advanced Pipelining and Instruction Level Parallelism (ILP)

Advanced Pipelining and Instruction Level Parallelism (ILP). HW Schemes: Instruction Parallelism. Why in HW at run time? Works when can’t know real dependence at compile time Makes the compiler simpler Code for one machine runs well on another

chace
Télécharger la présentation

Advanced Pipelining and Instruction Level Parallelism (ILP)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Pipelining and Instruction Level Parallelism (ILP)

  2. HW Schemes: Instruction Parallelism • Why in HW at run time? • Works when can’t know real dependence at compile time • Makes the compiler simpler • Code for one machine runs well on another • Key idea: Allow instructions behind stall to proceed DIVDF0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 • Enables out-of-order execution => out-of-order completion • ID stage checked both for structural and data hazards • Scoreboard dates to CDC 6600 in 1963

  3. Dynamic Scheduling Compiler scheduling is static scheduling • Idea: Dynamic HW control of hazard and issue • Hazards handled are structural and data • Control are also handled, but ignore them for now • Issue to EX stage only when things are clear • Implementation: two approaches in the book • Control-centric – Scoreboarding • Data-centric – Tomasulo algorithm • Variants of these basic schemes are also seen in today’s processors • Issue queue – decentralized control-centric • Also distributed versions of the data-centric approach

  4. Scoreboarding • Out-of-order execution divides ID stage: 1. Issue—decode instructions, check for structural hazards 2. Read operands—wait until no data hazards, then read operands • Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions • CDC 6600: In order issue, out of order execution, out of order commit ( also called completion)

  5. Scoreboard Implications • Out-of-order completion => WAR, WAW hazards? • Solutions for WAR • Queue both the operation and copies of its operands • Read registers only during Read Operands stage • For WAW, must detect hazard: stall until other completes • Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units • Scoreboard keeps track of dependencies, state or operations • Scoreboard replaces ID, EX, WB with 4 stages

  6. Rules Text is a bit underspecified due to ALU/FP focus • In reality, issue rules are influenced by: • Instruction window (how many instructions you see) • Size affects how many instructions are available for issue • Bigger tends to be better up to a point (bigger is always more expensive…) • Fill rate – how many instructions can enter the window per cycle • Issue rate – how many instructions can issue per cycle • Execution units • What and how many? • Result latencies and initiation intervals have a clear influence • Register file issues (FPRs and GPRs) • Size and number of read and write ports available • Bypass (forwarding) capabilities • Coupled with execution latencies to determine operand availability

  7. Issue Rules Idea is to only issue when hazards clear • Structural hazards • Obvious – desired EX unit must be free • Bus structure influences • Sharing of decoded instruction or register file ports can limit issue • E.g. FMPY and FDIV units may exist, but only one may issue per cycle • Data Hazards • More complex due to parallel multi-cycle EX units • RAW – delay issue until producing instruction clears • WAW – delay issue if pending Rd target is in EX • WAR – delay issue of Writing inst until reading inst clears operand read stage

  8. Four Stages of Scoreboard Control 1. Issue—decode instructions & check for structural hazards (ID1) If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared. 2. Read operands—wait until no data hazards, then read operands (ID2) A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.

  9. Four Stages of Scoreboard Control 3. Execution—operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. 4. Write result—finish execution (WB) Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction. Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBDF8,F8,F14 CDC 6600 scoreboard would stall SUBD until ADDD reads operands

  10. Three Parts of the Scoreboard 1.Instruction status—which of 4 steps the instruction is in 2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy—Indicates whether the unit is busy or not Op—Operation to perform in the unit (e.g., + or –) Fi—Destination register Fj, Fk—Source-register numbers Qj, Qk—Functional units producing source registers Fj, Fk Rj, Rk—Flags indicating when Fj, Fk are ready 3. Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register

  11. Instruction status Wait until Bookkeeping Issue Not busy (FU) and not result(D) Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’; Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj; Rk not Qk; Result(‘D’) FU; Read operands Rj and Rk Rj No; Rk No Execution complete Functional unit done Write result f((Fj( f )≠Fi(FU) or Rj( f )=No) & (Fk( f ) ≠Fi(FU) or Rk( f )=No)) f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No Detailed Scoreboard Pipeline Control

  12. Scoreboarding Example • Pick some functional unit latencies • FP Add / Sub is 2 cycles • FP Mult is 10 cycles • FP divide is 40 cycles

  13. Scoreboard Example

  14. Scoreboard Example Cycle 1

  15. Scoreboard Example Cycle 2 • Issue 2nd LD?

  16. Scoreboard Example Cycle 3 • Issue MULT?

  17. Scoreboard Example Cycle 4

  18. Scoreboard Example Cycle 5

  19. Scoreboard Example Cycle 6

  20. Scoreboard Example Cycle 7 • Read multiply operands?

  21. Scoreboard Example Cycle 8a

  22. Scoreboard Example Cycle 8b

  23. Scoreboard Example Cycle 9 • Read operands for MULT & SUBD? Issue ADDD?

  24. Scoreboard Example Cycle 11

  25. Scoreboard Example Cycle 12 • Read operands for DIVD?

  26. Scoreboard Example Cycle 13

  27. Scoreboard Example Cycle 14

  28. Scoreboard Example Cycle 15

  29. Scoreboard Example Cycle 16

  30. Scoreboard Example Cycle 17 • Write result of ADDD?

  31. Scoreboard Example Cycle 18

  32. Scoreboard Example Cycle 19 Figure 4.5 in the book…

  33. Scoreboard Example Cycle 20

  34. Scoreboard Example Cycle 21

  35. Scoreboard Example Cycle 22

  36. Scoreboard Example Cycle 61

  37. Scoreboard Example Cycle 62

  38. CDC 6600 Scoreboard • Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache) limits benefit • Limitations of 6600 scoreboard: • No forwarding hardware • Limited to instructions in basic block (small window) • Small number of functional units (structural hazards), especailly integer/load store units • Do not issue on structural hazards • Wait for WAR hazards • Prevent WAW hazards

  39. Intrinsic Scoreboard Limits • Amount of ILP • Based on true dependencies (RAW hazards) • Compiler can do a lot to reduce this, unrolling, scheduling, etc. • Number of Scoreboard entries • Window size • Fill rate and issue rate • Number and type of functional units • More means fewer structural hazards • Antidependencies and output dependencies • WAR and WAW hazards • Renaming (using virtual registers) and more registers will help • Plus you can also speculate…

More Related