1 / 31

EECS 470

EECS 470. Superscalar Architectures and the Pentium 4 Lecture 12. Optimizing CPU Performance. Golden Rule: t CPU = N inst *CPI*t CLK Given this, what are our options Reduce the number of instructions executed Reduce the cycles to execute an instruction Reduce the clock period

Télécharger la présentation

EECS 470

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12

  2. Optimizing CPU Performance • Golden Rule: tCPU = Ninst*CPI*tCLK • Given this, what are our options • Reduce the number of instructions executed • Reduce the cycles to execute an instruction • Reduce the clock period • Our next focus: Further reducing CPI • Approach: Superscalar execution • Capable of initiating multiple instructions per cycle • Possible to implement for in-order or out-of-order pipelines

  3. Why Superscalar? • Optimization results in more complexity • Longer wires, more logic  higher tCLK and tCPU • Architects must strike a balance with reductions in CPI Pipelining Superscalar + Pipelining

  4. Implications of Superscalar Execution • Instruction fetch? • Taken branches, multiple branches, partial cache lines • Instruction decode? • Simple for fixed length ISA, much harder for variable length • Renaming? • Multi-port RT, inter-inst dependencies must be recognized • Dynamic Scheduling? • Requires multiple results buses, smarter selection logic • Execution? • Multiple functional units, multiple result buses • Commit? • Multiple ROB/ARF ports, dependencies must be recognized

  5. P4 Overview • Latest iA32 processor from Intel • Equipped with the full set of iA32 SIMD operations • First flagship architecture since the P6 microarchitecture • Pentium 4 ISA = Pentium III ISA + SSE2 • SSE2 (Streaming SIMD Extensions 2) provides 128-bit SIMD integer and floating point operations + prefetch

  6. Comparison Between Pentium III and Pentium 4

  7. Execution Pipeline

  8. Front End • Predicts branches • Fetches/decodes code into trace cache • Generates mops for complex instructions • Prefetches instructions that are likely to be executed

  9. Branch Prediction • Dynamically predict the direction and target of branches based on PC using BTB • If no dynamic prediction available, statically predict • Taken for backwards looping branches • Not taken for forward branches • Implemented at decode • Traces built across (predicted) taken branches to avoid taken branch penalties • Also includes a 16-entry return address stack predictor

  10. Decoder • Single decoder available • Operates at a maximum of 1 instruction per cycle • Receives instructions from L2 cache 64 bits at a time • Some complex instructions must enlist the micro-ROM • Used for very complex iA32 instructions (> 4 mops) • After the microcode ROM finishes, the front-end resumes fetching mops from the Trace Cache

  11. Execution Pipeline

  12. Trace Cache • Primary instruction cache in P4 architecture • Stores 12k decoded mops • On a miss, instructions are fetched from L2 • Trace predictor connects traces • Trace cache removes • Decode latency after mispredictions • Decode power for all pre-decoded instructions

  13. Branch Hints • P4 software can provide hints to branch prediction and trace cache • Specify the likely direction of a branch • Implemented with conditional branch prefixes • Used for decode-stage predictions and trace building

  14. Execution Pipeline

  15. Execution Pipeline

  16. Execution • 126 mops can in flight at once • Up to 48 loads / 24 stores • Can dispatch up to 6 mops per cycle • 2x trace cache and retirement mop bandwidth • Provides additional B/W for scheduling mispeculation

  17. Execution Units

  18. Register Renaming

  19. Register Renaming • 8-entry architectural register file • 128-entry physical register file • 2 RAT (Front-end RAT and Retirement RAT) • Retirement RAT eliminates register writes into ARF

  20. Store and Load Scheduling • Out of order store and load operations Stores are always in program order • 48 loads and 24 stores could be in flight • Store/load buffers are allocated at the allocation stage • Total 24 store buffers and 48 load buffers

  21. Execution Pipeline

  22. Retirement • Can retire 3 mops per cycle • Implements precise exceptions • Reorder buffer used to organize completed mops • Also keeps track of branches and sends updated branch information to the BTB

  23. Data Stream of Pentium 4 Processor

  24. On-chip Caches • L1 instruction cache (Trace Cache) • L1 data cache • L2 unified cache • All caches use a pseudo-LRU replacement algorithm • Parameters:

  25. L1 Data Cache • Non-blocking • Support up to 4 outstanding load misses • Load latency • 2-clock for integer • 6-clock for floating-point • 1 Load and 1 Store per clock • Load speculation • Assume the access will hit the cache • “Replay” the dependent instructions when miss detected

  26. L2 Cache • Non-blocking • Load latency • Net load access latency of 7 cycles • Bandwidth • 1 load and 1 store in one cycle • New cache operations may begin every 2 cycles • 256-bit wide bus between L1 and L2 • 48Gbytes per second @ 1.5GHz

  27. L2 Cache Data Prefetcher • Hardware prefetcher monitors the reference patterns • Bring cache lines automatically • Attempts to fetch 256 bytes ahead of current access • Prefetch for up to 8 simultaneous independent streams

  28. System Bus Deliver data with 3.2Gbytes/S • 64-bit wide bus • Four data phase per clock cycle (quad pumped) • 100MHz clocked system bus

  29. Execution on MPEG4 Benchmarks @ 1 GHz

  30. Performance Trends Real-time speech 10k SPECInt2000 Moore's Law Speedup Performance Gap

  31. Power Trends Rocket Nozzle Nuclear Reactor Hot Plate Power Gap Real-time Speech 500 mW Power

More Related