1 / 73

Superscalar Organization ECE/CS 752 Fall 2017

Superscalar Organization ECE/CS 752 Fall 2017. Prof. Mikko H. Lipasti University of Wisconsin-Madison. CPU, circa 1986. MIPS R2000, ~“most elegant pipeline ever devised” J. Larus Enablers: RISC ISA, pipelining, on-chip cache memory. Source: https://imgtec.com. Instructions. Cycles.

Télécharger la présentation

Superscalar Organization ECE/CS 752 Fall 2017

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Superscalar OrganizationECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison

  2. CPU, circa 1986 • MIPS R2000, ~“most elegant pipeline ever devised” J. Larus • Enablers: RISC ISA, pipelining, on-chip cache memory Source: https://imgtec.com Mikko Lipasti-University of Wisconsin

  3. Instructions Cycles Time = X X Instruction Program Cycle (code size) (CPI) (cycle time) Iron Law Time Processor Performance = --------------- Program Architecture --> Implementation --> Realization Compiler Designer Processor Designer Chip Designer Mikko Lipasti -- University of Wisconsin

  4. Limitations of Scalar Pipelines • Scalar upper bound on throughput • IPC <= 1 or CPI >= 1 • Rigid pipeline stall policy • One stalled instruction stalls entire pipeline • Limited hardware parallelism • Only temporal (across pipeline stages) Mikko Lipasti-University of Wisconsin

  5. Superscalar Proposal • Fetch/execute multiple instructions per cycle • Decouple stages so stalls don’t propagate • Exploit instruction-level parallelism (ILP)

  6. Limits on Instruction Level Parallelism (ILP)

  7. High-IPC Processor Evolution Desktop/Workstation Market Scalar RISC Pipeline 2-4 Issue In-order Large ROB Out-of-Order Limited Out-of-Order Mid 1990s: PowerPC 604 Intel P6 1980s: MIPS SPARC Intel 486 Early 1990s: IBM RIOS-I Intel Pentium 2000s: DEC Alpha 21264 IBM Power4/5 AMD K8 1985 – 2005: 20 years, 100x frequency Mobile Market Scalar RISC Pipeline 2-4 Issue In-order Large ROB Out-of-Order Limited Out-of-Order 2002: ARM11 2005: Cortex A8 2009: Cortex A9 2011: Cortex A15 2002 – 2011: 10 years, 10x frequency Mikko Lipasti-University of Wisconsin

  8. What Does a High-IPC CPU Do? • Fetch and decode • Construct data dependence graph (DDG) • Evaluate DDG • Commit changes to program state Source: [Palacharla, Jouppi, Smith, 1996] Mikko Lipasti-University of Wisconsin

  9. A Typical High-IPC Processor 1. Fetch and Decode 2. Construct DDG 3. Evaluate DDG 4. Commit results Mikko Lipasti-University of Wisconsin

  10. Power Consumption Core i7 [Source: Intel] ARM Cortex A15 [Source: NVIDIA] • Actual computation overwhelmed by overhead of aggressive execution pipeline Mikko Lipasti-University of Wisconsin

  11. Limitations of Scalar Pipelines • Scalar upper bound on throughput • IPC <= 1 or CPI >= 1 • Inefficient unified pipeline • Long latency for each instruction • Rigid pipeline stall policy • One stalled instruction stalls all newer instructions

  12. Parallel Pipelines

  13. Intel Pentium Parallel Pipeline

  14. Diversified Pipelines

  15. I-Cache PC Fetch Q BR Scan Decode BR Predict FP Issue Q FX/LD 1 Issue Q FX/LD 2 Issue Q BR/CR Issue Q Reorder Buffer FP1 Unit FP2 Unit FX1 Unit LD1 Unit LD2 Unit FX2 Unit CR Unit BR Unit StQ D-Cache Power4 Diversified Pipelines

  16. Backward Propagation of Stalling Bypassing Stalled of Stalled Instruction Instruction Not Allowed Rigid Pipeline Stall Policy

  17. Dynamic Pipelines

  18. Interstage Buffers

  19. Superscalar Pipeline Stages In Program Order Out of Order In Program Order

  20. Limitations of Scalar Pipelines • Scalar upper bound on throughput • IPC <= 1 or CPI >= 1 • Solution: wide (superscalar) pipeline • Inefficient unified pipeline • Long latency for each instruction • Solution: diversified, specialized pipelines • Rigid pipeline stall policy • One stalled instruction stalls all newer instructions • Solution: Out-of-order execution, distributed execution pipelines

  21. High-IPC Processor Mikko Lipasti-University of Wisconsin

  22. PC Instruction Cache only 3 instructions fetched Instruction Flow • Challenges: • Branches: unpredictable • Branch targets misaligned • Instruction cache misses • Solutions • Prediction and speculation • High-bandwidth fetch logic • Nonblocking cache and prefetching Objective: Fetch multiple instructions per cycle Mikko Lipasti-University of Wisconsin

  23. Address Address Cache r TAG Cache Line e r d e Line o TAG c d • • • e TAG o • • • D c • • • w e • TAG o • D R • w TAG o R TAG 1 cache line = 1 physical row 1 cache line = 2 physical rows I-Cache Organization SRAM arrays need to be square to minimize delay Mikko Lipasti-University of Wisconsin

  24. Fetch Alignment Mikko Lipasti-University of Wisconsin

  25. IBM RIOS-I Fetch Hardware Mikko Lipasti-University of Wisconsin

  26. Disruption of Instruction Flow Mikko Lipasti-University of Wisconsin

  27. Branch Prediction • Target address generation Target speculation • Access register: • PC, General purpose register, Link register • Perform calculation: • +/- offset, autoincrement • Condition resolution Condition speculation • Access register: • Condition code register, General purpose register • Perform calculation: • Comparison of data register(s) Mikko Lipasti-University of Wisconsin

  28. Target Address Generation Mikko Lipasti-University of Wisconsin

  29. Branch Condition Resolution Mikko Lipasti-University of Wisconsin

  30. Branch Instruction Speculation Mikko Lipasti-University of Wisconsin

  31. Hardware Smith Predictor • Jim E. Smith. A Study of Branch Prediction Strategies. International Symposium on Computer Architecture, pages 135-148, May 1981 • Widely employed: Intel Pentium, PowerPC 604, MIPS R10000, etc. Mikko Lipasti-University of Wisconsin

  32. Cortex A15: Bi-Mode Predictor • PHT partitioned into T/NT halves • Selector chooses source • Reduces negative interference, since most entries in PHT0 tend towards NT, and most entries in PHT1 tend towards T 15% of A15 Core Power! Mikko Lipasti-University of Wisconsin

  33. Branch Target Prediction • Does not work well for function/procedure returns • Does not work well for virtual functions, switch statements Mikko Lipasti-University of Wisconsin

  34. Branch Speculation • Leading Speculation • Done during the Fetch stage • Based on potential branch instruction(s) in the current fetch group • Trailing Confirmation • Done during the Branch Execute stage • Based on the next Branch instruction to finish execution Mikko Lipasti-University of Wisconsin

  35. Branch Speculation • Start new correct path • Must remember the alternate (non-predicted) path • Eliminate incorrect path • Must ensure that the mis-speculated instructions produce no side effects Mikko Lipasti-University of Wisconsin

  36. Mis-speculation Recovery • Start new correct path • Update PC with computed branch target (if predicted NT) • Update PC with sequential instruction address (if predicted T) • Can begin speculation again at next branch • Eliminate incorrect path • Use tag(s) to deallocate resources occupied by speculative instructions • Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations Mikko Lipasti-University of Wisconsin

  37. Parallel Decode • Primary Tasks • Identify individual instructions (!) • Determine instruction types • Determine dependences between instructions • Two important factors • Instruction set architecture • Pipeline width Mikko Lipasti-University of Wisconsin

  38. Intel P6 Fetch/Decode Mikko Lipasti-University of Wisconsin

  39. Predecoding in the AMD K5 • Now commonly employed in loop buffers, decoded instruction caches (uop caches)

  40. Dest Dest Dest Dest Src0 Src0 Src0 Src0 Src1 Src1 Src1 Src1 Dependence Checking • Trailing instructions in fetch group • Check for dependence on leading instructions ?= ?= ?= ?= ?= ?= ?= ?= ?= ?= ?= ?= Mikko Lipasti-University of Wisconsin

  41. Summary: Instruction Flow • Fetch group alignment • Target address generation • Branch target buffer • Branch condition prediction • Speculative execution • Tagging/tracking instructions • Recovering from mispredicted branches • Decoding in parallel Mikko Lipasti-University of Wisconsin

  42. High-IPC Processor Mikko Lipasti-University of Wisconsin

  43. Register Data Flow • Parallel pipelines • Centralized instruction fetch • Centralized instruction decode • Diversified execution pipelines • Distributed instruction execution • Data dependence linking • Register renaming to resolve true/false dependences • Issue logic to support out-of-order issue • Reorder buffer to maintain precise state Mikko Lipasti-University of Wisconsin

  44. Issue Queues and Execution Lanes ARM Cortex A15 Source: theregister.co.uk Mikko Lipasti-University of Wisconsin

  45. Necessity of Instruction Dispatch

  46. Centralized Reservation Station

  47. Distributed Reservation Station

  48. Issues in Instruction Execution • Current trends • More parallelism  bypassing very challenging • Deeper pipelines • More diversity • Functional unit types • Integer => short vector • Floating point => vector (longer and longer) • Load/store  most difficult to make parallel • Branch • Specialized units (media) • Very wide datapaths (256 bits/register or more)

  49. I-Cache PC Fetch Q BR Scan Decode BR Predict FP Issue Q FX/LD 1 Issue Q FX/LD 2 Issue Q BR/CR Issue Q Reorder Buffer FP1 Unit FP2 Unit FX1 Unit LD1 Unit LD2 Unit FX2 Unit CR Unit BR Unit StQ D-Cache Bypass Networks • O(n2) interconnect from/to FU inputs and outputs • Associative tag-match to find operands • Solutions (hurt IPC, help cycle time) • Use RF only (IBM Power4) with no bypass network • Decompose into clusters (Alpha 21264)

  50. Specialized units • Intel Pentium 4 staggered adders • Fireball • Run at 2x clock frequency • Two 16-bit bitslices • Dependent ops execute on half-cycle boundaries • Full result not available until full cycle later

More Related