Design decisions common to modern processors

A Critical Look At IA-64Massive Resources, Massive ILP, But Can It Deliver?Martin Hopkins, IBM Research2/7/00 Sampoorani, Sivakumar and Joshua

Design decisions common to modern processors • Pipelining • Micro Ops • Large ROB • Single path execution • Dynamic scheduling

At what cost? • Accurate Branch Prediction • Dependency Checking • Register Renaming • Alias Detection Hardware

Performance of IA-64 Execution time = Cycle Time *IC* CPI No improvement reported in frequency Possible Reasons? • Reducing CPI at the cost of cycle time • Compares and branches in same cycle • Predicated Execution => more FUs => more complexity + longer wires limit on frequency => more power

Dynamic Path Length (IC) Longer than other architectures Reasons? • Speculation • Check operations and recovery code • Predication • No sign extended loads • No integer multiply or divide

Dynamic Path Length (IC) • Loads and Stores – Only post execution update of base register ldsz.ldtype.ldhint r1 = [r3] no base update form ldsz.ldtype.ldhint r1 = [r3], r2register base update ldsz.ldtype.ldhint r1 = [r3], imm immediate base update

CPI Cache Effects • Larger code footprint • 128 bit bundle - 3 instructions • Restrictions on placing instructions • Branch target - beginning of bundle • Recovery code • Pollutes I-Cache and/or triggers page faults • Speculative loads - Pollute D-cache

Stalls possible Example load ra = load rb = ;; // end of bundle add rx = ra load ry = [rb];; If load ra causes a cache miss, stall. Superscalar out-of-order processors – can execute non-dependent instructions in parallel with the cache miss.

Comparing Complexities • Support for speculative execution • Superscalar processors • reorder buffer • register renaming hardware • EPIC • need to expose parallelism, speculation • hardware just does what the compiler says

IA-64: Exposing Speculative Execution • Control speculation (moving loads above branches) • Data speculation (moving loads above stores)

Control Speculation • Hardware for deferring exceptions exposed to software • NaT (Not a Thing or poison bits) • set NaT bit associated with a register on exception • perform an explicit check before using the register • Increase in machine state • 2 NaT registers • instructions to modify, test, and retrieve NaT values

Data Speculation • Explicit memory-alias-detection table • ALAT (Advanced Load Address table) • loads place their entries in ALAT • stores remove the entry if addresses match • Hardware cost: • ALAT is 32 entry, 2 way set associative • recovery code requires that operands be maintained (until the store is seen the operands have to be maintained) • increased register requirements (128 Int + 128 FP)

Data Speculation Hardware Costs • Increased register pressure implies • more state to be saved across functions • to avoid this: • Register stacking (SPARC register windows) • (0-31) global registers, others dynamically mapped • CFM (Current Frame Marker) • Register Stack engine • Should also handle stack overflows • Additional complexity due to rotating registers

Reorder buffer Register rename mechanism NaT bits, associated instructions ALAT Increased number of registers Reg Stack Engine Additional complexities due to rotating registers, page faults, … Hardware Costs

Runtime Information • Information about behavior of programs • Can’t be predicted at compile time • Profiling helps • But costly… • Superscalar machines • Dynamic selection of instructions to execute • Rely upon information known at run time

Epic • Depends mostly on compiler • Run time information is not used so much • Consider the following code sequence cmp p1, p2 = .. /* set predicate registers */ (p1) br.cond low_probability_path ;; /* if (p1) goto ...*/ l ra = [rb];; add rc = ra, rd;; use of (rc) 4 bundles, load not hoisted over a branch (which is not usually taken)

As Scheduled by IA64 Compiler • Optimize for the most probable path l.s ra = [rb];; add rc = ra, rd cmp p1, p2 = ... (p1) br.cond low_probability_path ;; check.s rc, recovery_code use of (rc) • 3 bundles

Superscalar processor Execute the load as early as possible Cancel if found to be mis-speculated Change assumptions dynamically EPIC load has to complete since dependant add is in next bundle may take 100s of cycles if the pointer is random Heavy penalty if the compiler gets the probabilities wrong When Low Probability Path Is Taken

Dependence on Profiling • RISC and CISC find profiling useful, but not essential • IA-64 is much more dependent on profiling • Difficulties involved with profiling • Additional responsibility for programmer • Creating a representative test suite • Using in demanding, diverse development environments

Code Bloat • RISC instructions 50 • 3 instructions per 128 bits 33 • Avg of 2 instructions per bundle 33 • Branch target at beginning of bundle 10 • Check ops • Recovery code 20 • No base+disp addressing 15 • No sign-extended loads • Predication • Optimizations 30 IA-64 code should be 4.8 times x86 code

Some things that may reduce code size • Post-increment loads can eliminate and add in a loop • eg. accessing an array in strides • Combining a compare and a logical op • r1 + r2 +1 • Rotating register files for s/w pipelining All the above amount to <5% difference. So net code bloat is about 4 times. (excluding optimization overhead) Code bloat => More memory b/w requirement.

Performance comparison 800MHz Itanium • SPECint <68% Alpha 21264 (1GHz) (20% less power) <60% P4 (2GHz) • SPECfp >20% Alpha 21264 >8% P4 Power – a major hurdle

Conclusion • The IA-64 gamble – power is not going to be a critical limitation in future. • This allows use of massive resources

Design decisions common to modern processors