310 likes | 420 Vues
This lecture focuses on optimizing CPU performance using the Superscalar architecture, specifically in the context of the Pentium 4 processor. Key strategies include reducing instruction execution counts, minimizing cycles per instruction, and shortening clock periods. Superscalar execution enables the initiation of multiple instructions per cycle, leading to increased performance while addressing the complexities of instruction fetch, decode, renaming, dynamic scheduling, and execution. Insights will also be provided on Pentium 4's architecture, instruction pipelines, and advanced cache management techniques.
E N D
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12
Optimizing CPU Performance • Golden Rule: tCPU = Ninst*CPI*tCLK • Given this, what are our options • Reduce the number of instructions executed • Reduce the cycles to execute an instruction • Reduce the clock period • Our next focus: Further reducing CPI • Approach: Superscalar execution • Capable of initiating multiple instructions per cycle • Possible to implement for in-order or out-of-order pipelines
Why Superscalar? • Optimization results in more complexity • Longer wires, more logic higher tCLK and tCPU • Architects must strike a balance with reductions in CPI Pipelining Superscalar + Pipelining
Implications of Superscalar Execution • Instruction fetch? • Taken branches, multiple branches, partial cache lines • Instruction decode? • Simple for fixed length ISA, much harder for variable length • Renaming? • Multi-port RT, inter-inst dependencies must be recognized • Dynamic Scheduling? • Requires multiple results buses, smarter selection logic • Execution? • Multiple functional units, multiple result buses • Commit? • Multiple ROB/ARF ports, dependencies must be recognized
P4 Overview • Latest iA32 processor from Intel • Equipped with the full set of iA32 SIMD operations • First flagship architecture since the P6 microarchitecture • Pentium 4 ISA = Pentium III ISA + SSE2 • SSE2 (Streaming SIMD Extensions 2) provides 128-bit SIMD integer and floating point operations + prefetch
Front End • Predicts branches • Fetches/decodes code into trace cache • Generates mops for complex instructions • Prefetches instructions that are likely to be executed
Branch Prediction • Dynamically predict the direction and target of branches based on PC using BTB • If no dynamic prediction available, statically predict • Taken for backwards looping branches • Not taken for forward branches • Implemented at decode • Traces built across (predicted) taken branches to avoid taken branch penalties • Also includes a 16-entry return address stack predictor
Decoder • Single decoder available • Operates at a maximum of 1 instruction per cycle • Receives instructions from L2 cache 64 bits at a time • Some complex instructions must enlist the micro-ROM • Used for very complex iA32 instructions (> 4 mops) • After the microcode ROM finishes, the front-end resumes fetching mops from the Trace Cache
Trace Cache • Primary instruction cache in P4 architecture • Stores 12k decoded mops • On a miss, instructions are fetched from L2 • Trace predictor connects traces • Trace cache removes • Decode latency after mispredictions • Decode power for all pre-decoded instructions
Branch Hints • P4 software can provide hints to branch prediction and trace cache • Specify the likely direction of a branch • Implemented with conditional branch prefixes • Used for decode-stage predictions and trace building
Execution • 126 mops can in flight at once • Up to 48 loads / 24 stores • Can dispatch up to 6 mops per cycle • 2x trace cache and retirement mop bandwidth • Provides additional B/W for scheduling mispeculation
Register Renaming • 8-entry architectural register file • 128-entry physical register file • 2 RAT (Front-end RAT and Retirement RAT) • Retirement RAT eliminates register writes into ARF
Store and Load Scheduling • Out of order store and load operations Stores are always in program order • 48 loads and 24 stores could be in flight • Store/load buffers are allocated at the allocation stage • Total 24 store buffers and 48 load buffers
Retirement • Can retire 3 mops per cycle • Implements precise exceptions • Reorder buffer used to organize completed mops • Also keeps track of branches and sends updated branch information to the BTB
On-chip Caches • L1 instruction cache (Trace Cache) • L1 data cache • L2 unified cache • All caches use a pseudo-LRU replacement algorithm • Parameters:
L1 Data Cache • Non-blocking • Support up to 4 outstanding load misses • Load latency • 2-clock for integer • 6-clock for floating-point • 1 Load and 1 Store per clock • Load speculation • Assume the access will hit the cache • “Replay” the dependent instructions when miss detected
L2 Cache • Non-blocking • Load latency • Net load access latency of 7 cycles • Bandwidth • 1 load and 1 store in one cycle • New cache operations may begin every 2 cycles • 256-bit wide bus between L1 and L2 • 48Gbytes per second @ 1.5GHz
L2 Cache Data Prefetcher • Hardware prefetcher monitors the reference patterns • Bring cache lines automatically • Attempts to fetch 256 bytes ahead of current access • Prefetch for up to 8 simultaneous independent streams
System Bus Deliver data with 3.2Gbytes/S • 64-bit wide bus • Four data phase per clock cycle (quad pumped) • 100MHz clocked system bus
Performance Trends Real-time speech 10k SPECInt2000 Moore's Law Speedup Performance Gap
Power Trends Rocket Nozzle Nuclear Reactor Hot Plate Power Gap Real-time Speech 500 mW Power