EECS 470

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12

Optimizing CPU Performance • Golden Rule: tCPU = Ninst*CPI*tCLK • Given this, what are our options • Reduce the number of instructions executed • Reduce the cycles to execute an instruction • Reduce the clock period • Our next focus: Further reducing CPI • Approach: Superscalar execution • Capable of initiating multiple instructions per cycle • Possible to implement for in-order or out-of-order pipelines

Why Superscalar? • Optimization results in more complexity • Longer wires, more logic  higher tCLK and tCPU • Architects must strike a balance with reductions in CPI Pipelining Superscalar + Pipelining

Implications of Superscalar Execution • Instruction fetch? • Taken branches, multiple branches, partial cache lines • Instruction decode? • Simple for fixed length ISA, much harder for variable length • Renaming? • Multi-port RT, inter-inst dependencies must be recognized • Dynamic Scheduling? • Requires multiple results buses, smarter selection logic • Execution? • Multiple functional units, multiple result buses • Commit? • Multiple ROB/ARF ports, dependencies must be recognized

P4 Overview • Latest iA32 processor from Intel • Equipped with the full set of iA32 SIMD operations • First flagship architecture since the P6 microarchitecture • Pentium 4 ISA = Pentium III ISA + SSE2 • SSE2 (Streaming SIMD Extensions 2) provides 128-bit SIMD integer and floating point operations + prefetch

Comparison Between Pentium III and Pentium 4

Execution Pipeline

Front End • Predicts branches • Fetches/decodes code into trace cache • Generates mops for complex instructions • Prefetches instructions that are likely to be executed

Branch Prediction • Dynamically predict the direction and target of branches based on PC using BTB • If no dynamic prediction available, statically predict • Taken for backwards looping branches • Not taken for forward branches • Implemented at decode • Traces built across (predicted) taken branches to avoid taken branch penalties • Also includes a 16-entry return address stack predictor

Decoder • Single decoder available • Operates at a maximum of 1 instruction per cycle • Receives instructions from L2 cache 64 bits at a time • Some complex instructions must enlist the micro-ROM • Used for very complex iA32 instructions (> 4 mops) • After the microcode ROM finishes, the front-end resumes fetching mops from the Trace Cache

Trace Cache • Primary instruction cache in P4 architecture • Stores 12k decoded mops • On a miss, instructions are fetched from L2 • Trace predictor connects traces • Trace cache removes • Decode latency after mispredictions • Decode power for all pre-decoded instructions

Branch Hints • P4 software can provide hints to branch prediction and trace cache • Specify the likely direction of a branch • Implemented with conditional branch prefixes • Used for decode-stage predictions and trace building

Execution • 126 mops can in flight at once • Up to 48 loads / 24 stores • Can dispatch up to 6 mops per cycle • 2x trace cache and retirement mop bandwidth • Provides additional B/W for scheduling mispeculation

Execution Units

Register Renaming

Register Renaming • 8-entry architectural register file • 128-entry physical register file • 2 RAT (Front-end RAT and Retirement RAT) • Retirement RAT eliminates register writes into ARF

Store and Load Scheduling • Out of order store and load operations Stores are always in program order • 48 loads and 24 stores could be in flight • Store/load buffers are allocated at the allocation stage • Total 24 store buffers and 48 load buffers

Retirement • Can retire 3 mops per cycle • Implements precise exceptions • Reorder buffer used to organize completed mops • Also keeps track of branches and sends updated branch information to the BTB

Data Stream of Pentium 4 Processor

On-chip Caches • L1 instruction cache (Trace Cache) • L1 data cache • L2 unified cache • All caches use a pseudo-LRU replacement algorithm • Parameters:

L1 Data Cache • Non-blocking • Support up to 4 outstanding load misses • Load latency • 2-clock for integer • 6-clock for floating-point • 1 Load and 1 Store per clock • Load speculation • Assume the access will hit the cache • “Replay” the dependent instructions when miss detected

L2 Cache • Non-blocking • Load latency • Net load access latency of 7 cycles • Bandwidth • 1 load and 1 store in one cycle • New cache operations may begin every 2 cycles • 256-bit wide bus between L1 and L2 • 48Gbytes per second @ 1.5GHz

L2 Cache Data Prefetcher • Hardware prefetcher monitors the reference patterns • Bring cache lines automatically • Attempts to fetch 256 bytes ahead of current access • Prefetch for up to 8 simultaneous independent streams

System Bus Deliver data with 3.2Gbytes/S • 64-bit wide bus • Four data phase per clock cycle (quad pumped) • 100MHz clocked system bus

Execution on MPEG4 Benchmarks @ 1 GHz

Performance Trends Real-time speech 10k SPECInt2000 Moore's Law Speedup Performance Gap

Power Trends Rocket Nozzle Nuclear Reactor Hot Plate Power Gap Real-time Speech 500 mW Power

EECS 470

EECS 470

Presentation Transcript

EECS 470: Computer Architecture

EECS 470 Power and Architecture

EECS 470

EECS 470 Lecture 8

EECS 470 Lecture 8

Finishing out EECS 470

EECS 470

EECS 470

EECS 470 Lecture 1

EECS 470 Power and Architecture

EECS/CS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470