1 / 49

Embedded Computer Architectures

Embedded Computer Architectures. Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102), smit@cs.utwente.nl André Kokkeler (Zilverling 4096), kokkeler@utwente.nl. Contents. Introduction Hazards <= dependencies

orien
Télécharger la présentation

Embedded Computer Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EmbeddedComputerArchitectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102), smit@cs.utwente.nl André Kokkeler (Zilverling 4096), kokkeler@utwente.nl

  2. Contents • Introduction • Hazards <= dependencies • Instruction Level Parallelism; Tomasulo’s approach • Branch prediction

  3. Dependencies • True Data dependency • Name dependency • Antidependency • Output dependency • Control dependency

  4. Data Dependency Inst i Inst i+1 Inst i+2 Data Dep Result Data Dep Data Dep Result Two instructions are data dependent => risk of RAW hazard

  5. Name Dependency • Antidependence • Output dependence Inst i register or memory location Inst j Read Write Two instructions are antidependent => risk of WAR hazard Inst i register or memory location Inst j Write Write Two instructions are antidependent => risk of WAW hazard

  6. Control Dependency • Branch condition determines whether instruction i is executed => i is control dependent on the branch

  7. Instruction Level Parallelism • Pipelining = ILP • Other approach: Dynamic scheduling => out of order execution • Instruction Decode stage split into • Issue (decode, check for structural hazards) • Read Operands

  8. Instruction Level Parallelism • Scoreboard: • Sufficient resources • No data dependencies • Tomasulo’s approach • Minimize RAW hazards • Register renaming to minimize WAW and RAW hazards Read operands issue Reservation Station (park instructions while waiting for operands)

  9. Tomasulo’s approach • Register Renaming Time Register F0 register use of instruction start of instruction • Read F0 • Write F0 • Read F0 • Write F0

  10. Tomasulo’s approach • Register Renaming Time Register F0 • Read F0 • Write F0 • Read F0 • Write F0 Problems if arrows cross

  11. Tomasulo’s approach • Register Renaming Time Register F0 • Read F0 • Write F0 • Read F0 • Write F0 Instr 2, 3,… will be stalled. Note that Instr 2 and 3 are stalled only because Instr 1 is not ready. If not for Instr 1, they could be executed earlier

  12. Tomasulo’s approach • Register Renaming Instr 1.Register F0 Instr 3.Register F0 • Read F0 • Write F0 • Read F0 • Write F0 How is it arranged that value is written into Instr 3. Register F0 and not in Instr 1. Register F0?

  13. Tomasulo’s approach • Register Renaming Instr 1.Register F0 Instr 1.F0Source Instr. k Instr 3.Register F0 Instr 3.F0Source Instr. 2 • Read F0 • Write F0 • Read F0 • Write F0 The result of Instr 2 is labelled with ‘Instr. 2’. Hardware checks whether there Is an instruction waiting for the result (checking the F0Source fields of instructions) And places the result in the correct place.

  14. Tomasulo’s approach • Register Renaming Instr 3.Register F0 Instr 3.F0Source Instr. 2 • Read F0 • Write F0 • Read F0 • Write F0 operation (read) F0Data F0Source

  15. Tomasulo’s approach • Register Renaming • Read F0 • Write F0 • Read F0 • Write F0 operation (read) F0Data F0Source operation (read) F0Data F0Source

  16. Tomasulo’s approach • Register Renaming Filled during Issue Filled during execution • Read F0 • Write F0 • Read F0 • Write F0 operation (read) F0Data F0Source operation (write) operation (read) F0Data F0Source operation (write) Issue Reservation Station

  17. Tomasulo’s approach • Effects • Register Renaming: prevents WAW and WAR hazards • Execution starts when operands are available (datafields are filled): prevents RAW

  18. Tomasulo’s approach • Issue in more detail (issue is done sequentially) This is the only information you have: During issue, you have to keep track which instruction changed F0 last!!!! Format: label operation data source • Read F0 • Write F0 • Read F0 • Write F0 read1 read Empty ????? write1 write read2 read Empty Reservation Station

  19. Tomasulo’s approach • Issue in more detail Keeping track of register status during issue is done for every register Format: label operation data source F0 ???? • Read F0 • Write F0 • Read F0 • Write F0 read1 read Empty ????? write1 write1 write write1 read2 read Empty Write1 write2 write2 write Reservation Station

  20. Tomasulo’s approach • Definitions for the MIPS • For each reservation station:Name Busy Operation Vj Vk Qj Qk AName = labelBusy = in execution or notOperation = instructionV = operand valueQ = operand sourceA = memory address (Load, Store)

  21. Tomasulo’s approach; hardware view From instruction queue Issue hardware Register Renaming Fill Reservation Stations Reservation Station “Reservation Fill Hardware” Puts data in correct place in reservation station Results + identification Of instruction producing the result Of which instructions are operands and corresponding execution units available? => Transport operands to executions unit “Execution Control Hardware” Execution Units Common Data Bus

  22. Branch prediction • Data Hazards => Tomasulo’s approach • Branch (control) hazards => Branch prediction • Goal: Resolve outcome of branch early => prevent stalls because of control hazards

  23. Branch prediction; 1 history bit • Example: Outerloop: … R=10 Innerloop: … R=R-1 BNZ R, Innerloop … … Branch Outerloop History bit History bit: is branch taken previously or not: - predict taken: fetch from ‘Innerloop’ - predict not taken: fetch next instr Actual outcome of branch: - taken: set history bit to ‘taken’ - not taken: set history bit to ‘not taken’ In this situation: Correct prediction in 80 % of branch evaluations

  24. Branch prediction; 2 history bits • Example: Outerloop: … R=10 Innerloop: … R=R-1 BNZ R, Innerloop … … Branch Outerloop 2 history bits Not taken Predict taken Predict taken In this application: correct prediction in 90 % of branch evaluations taken taken Not taken Not taken Predict not taken Predict not taken taken

  25. Branch prediction; Correlating branch predictors If (aa == 2) aa=0; If (bb == 2) bb=0; If (aa != bb) Results of these branches are used in prediction of this branch Example: suppose aa == 2 and bb == 2 then condition for last ‘if’ is always false => if previous two branches are not taken, last branch is taken.

  26. Branch prediction; Correlating branch predictors • Mechanism: Suppose result of 3 previous branches is used to influence decision. • 8 possible sequences: br-3 br-2 br-1 br NT NT NT T NT NT T NT …. …. …. …. T T T T • Dependent on outcome of branch under consideration prediction is changed: • 1 bit history: (3,1) predictor Branch under consideration For the sequence (NT NT NT) the prediction is that the branch will be taken => Fetches from branchdestination

  27. Branch prediction; Correlating branch predictors • Mechanism: Suppose result of 3 previous branches is used to influence decision. • 8 possible sequences: br-3 br-2 br-1 br NT NT NT T NT NT T NT …. …. …. …. T T T T • Dependent on outcome of branch under consideration prediction is changed: • 1 bit history: (3,1) predictor • 2 bit history: (3,2) predictor Branch under consideration For the sequence (NT NT NT) the prediction is that the branch will be taken => Fetches from branchdestination • Represented by 2 bits • 2 combinations indicate:predict taken • 2 combinations indicate: predict non taken • Updated by means of statemachine

  28. Branch Target Buffer • Solutions: • Delayed Branch • Branch Target buffer Even with a good prediction, we don’t know where to branch too until here and we’ve already retrieved the next instruction

  29. Branch Target Buffer From Instruction Decode hardware Memory (Instruction cache) Addresses of branch instructions Program Counter Corresponding Branch Targets Select Address Branch Target Hit? After IF stage, branch address already in PC

  30. Branch Folding Memory (Instruction cache) Addresses of branch instructions Program Counter Corresponding Instructions at Branch Targets Unconditional Branches: Effectively removing Branch instruction (penalty of -1) Address Instruction at target Hit?

  31. Return Address Predictors • Indirect branches: branch address known at run time. • 80% of time: return instructions. • Small fast stack: Procedure Call Procedure Return RET RET

  32. Multiple Issue Processors Goal: Issue multiple instructions in a clockcycle • Superscalarissue varying number of instructions per clock • Statically scheduled • Dynamically scheduled • VLIWissue fixed number of instructions per clock • Statically scheduled

  33. Multiple Issue Processors • Example

  34. Hardware Based Speculation • Multiple Issue Processors => nearly 1 branch every clock cycle • Dynamic scheduling + branch prediction:fetch+issue • Dynamic scheduling + branch speculation:fetch+issue+execution • KEY: Do not perform updates that cannot be undone until you’re sure the corresponding operation really should be executed.

  35. Hardware Based Speculation • Tomasulo: Operations beyond this point are finished Register File Operation i Branch (Predict Not Taken) Issued Operation k • Operation k: • Operand available • Execution postponed until clear whether branch is taken

  36. Hardware Based Speculation • Tomasulo: Register File Operation i Finished Branch (Predict Not Taken) Operation k • Dependent on outcome branch: • Flush reservation stations • Start execution Issued

  37. Hardware Based Speculation • Speculation: Results of operations beyond this point are committed (from reorder buffer to register file) Register File Operation i Commit: sequentially Branch (Predict Not Taken) Reorder Buffer Issued Operation k • Operation k: • Operand available and executed

  38. Hardware Based Speculation • Speculation: Register File Operation i Committed Commit: sequentially Branch (Predict Not Taken) Reorder Buffer Operation k • Operation k: • Operand available and executed Issued

  39. Hardware Based Speculation • Speculation: Register File Operation i Commit: sequentially Branch (Predict Not Taken) Reorder Buffer Committed Operation k • Operation k: • Operand available and executed

  40. Hardware Based Speculation • Speculation: Register File Operation i Commit: sequentially Branch (Predict Not Taken) Reorder Buffer Committed Operation k • Operation k: • Operand available and executed

  41. Hardware Based Speculation • Some aspects • Instructions causing a lot of work should not have been executed => restrict allowed actions in speculative mode • ILP of a program is limited • Realistic branch predictions: easier to implement => less efficient

  42. Pentium Pro Implementation • Pentium Family

  43. Pentium Pro Implementation • I486: CISC => problems with pipelining • 2 observations • Translation CISC instructions into sequence of microinstructions • Microinstruction is of equal length • Solution: pipelining microinstructions

  44. Pentium Pro Implementation ... Jump to Indirect or Execute Fetch cycle routine ... Jump to Execute Indirect Cycle routine ... Jump to Fetch Interrupt cycle routine Jump to Op code routine Execute cycle begin ... Jump to Fetch or Interrupt AND routine ... Jump to Fetch or Interrupt ADD routine Note: each micro-program ends with a branch to the Fetch, Interrupt, Indirect or Execute micro-program

  45. Pentium Pro Implementation

  46. Pentium Pro Implementation • All RISC features are implemented on the execution of microinstructions instead of machine instructions • Microinstruction-level pipeline with dynamically scheduled microoperations • Fetch machine instruction (3 stages) • Decode machine instruction into microinstructions (2 stages) • Issue microinstructions (2 stages, register renaming, reorder buffer allocation performed here) • Execute of microinstructions (1 stage, floating point units pipelined, execution takes between 1 and 32 cycles) • Write back (3 stages) • Commit (3 stages) • Superscalar can issue up to 3 microoperations per clock cycle • Reservation stations (20 of them) and multiple functional units (5 of them) • Reorder buffer (40 entries) and speculation used

  47. Pentium Pro Implementation • Execution Units have the following stages • Integer ALU 1 • Integer Load 3 • Integer Multiply 4 • FP add 3 • FP multiply 5 (partially pipelined –multiplies can start every other cycle) • FP divide 32 (not pipelined)

  48. Thread-Level Parallelism • ILP: on instruction level • Thread-Level Parallelism: on a higher level • Server applications • Database queries • Thread: has all information (instructions, data, PC register state etc) to allow it to execute • On a separate processer • As a process on a single process.

  49. Thread-Level Parallelism • Potentially high efficiency • Desktop applications: • Costly to switch to ‘thread-level reprogrammed’ applications. • Thread level parallelism often hard to find=> ILP continues to be focus for desktop-oriented processors (for embedded processors, the situation is different)

More Related