Onur Mutlu EE 382N Guest Lecture

Runahead ExecutionA Power-Efficient Alternative to Large Instruction Windows for Tolerating Long Main Memory Latencies Onur Mutlu EE 382N Guest Lecture

Outline • Motivation • Runahead Execution • Evaluation • Limitations of the Baseline Runahead Mechanism • Overcoming the Limitations • Efficient Runahead Execution • Address-Value Delta (AVD) Prediction • Summary

Motivation • Memory latency is very long in today’s processors (hundreds of processor clock cycles) • CDC 6600: 10 cycles [Thornton, 1970] • Alpha 21264: 120+ cycles [Wilkes, 2001] • Intel Pentium 4: 300+ cycles [Sprangle & Carmean, 2002] • And, it continues to increase (in terms of processor cycles) • DRAM latency is not reducing as fast as processor cycle time • Existing techniques to tolerate the memory latency do not work well enough.

Existing Latency Tolerance Techniques • Caching [Wilkes, 1965] • Widely used, simple, effective, but not enough, inefficient, passive • Not all applications/phases exhibit temporal or spatial locality • Cache miss penalty is very high • Prefetching [IBM 360/91, 1967] • Works well for regular memory access patterns • Prefetching irregular access patterns is difficult, inaccurate, and hardware-intensive • Multithreading [CDC 6600, 1964] • Does not improve the performance of a single thread • Out-of-order execution [Tomasulo, 1967] • Tolerates cache misses that cannot be prefetched • Requires extensive hardware resources for tolerating long latencies

Out-of-order Execution • Instructions are executed out of sequential program order to tolerate latency. • Execution of an older long-latency instruction is overlapped with the execution of younger independent instructions. • Instructions are retired in program order to support precise exceptions/interrupts. • Not-yet-retired instructions and their results are buffered in hardware structures, called the instruction window: • Scheduling window (reservation stations) • Register files • Load and store buffers • Reorder buffer

Small Windows: Full-window Stalls • When a long-latency instruction is not complete, it blocks retirement. • Incoming instructions fill the instruction window if the window is not large enough. • Processor cannot place new instructions into the window if the window is already full. This is called a full-window stall. • A full-window stall prevents the processor from making progress in the execution of the program.

Small Windows: Full-window Stalls 8-entry instruction window: • L2 cache misses are responsible for most full-window stalls. Oldest LOAD R1  mem[A] L2 Miss! Takes 100s of cycles. BEQ R1, R0, target Independent of the L2 miss, executed out of program order, but cannot be retired. Younger instructions cannot be executed because there is no space in the instruction window. The processor stalls until the L2 Miss is serviced.

Impact of L2 Cache Misses 500-cycle DRAM latency, 4.25 GB/s DRAM bandwidth, aggressive stream-based prefetcher Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model

The Problem • Out-of-order execution requires large instruction windows to tolerate today’s main memory latencies. • Even in the presence of caches and prefetchers • As main memory latency (in terms of processor cycles) increases, instruction window size should also increase to fully tolerate the memory latency. • Building a large instruction window is not an easy task if we would like to achieve • Low power/energy consumption • Short cycle time • Low design and verification complexity

Overview of Runahead Execution [HPCA’03] • A technique to obtain the memory-level parallelism benefits of a large instruction window (without having to build it!) • When the oldest instruction is an L2 miss: • Checkpoint architectural state and enter runahead mode • In runahead mode: • Instructions are speculatively pre-executed • The purpose of pre-execution is to discover other L2 misses • L2-miss dependent instructions are marked INV and dropped • Runahead mode ends when the original L2 miss returns • Checkpoint is restored and normal execution resumes

Runahead Example Perfect Caches: Load 2 Hit Load 1 Hit Compute Compute Small Window: Load 2 Miss Load 1 Miss Compute Compute Stall Stall Miss 1 Miss 2 Runahead: Load 1 Miss Load 2 Miss Load 2 Hit Load 1 Hit Runahead Compute Compute Saved Cycles Miss 1 Miss 2

Benefits of Runahead Execution Instead of stalling during an L2 cache miss: • Pre-executed loads and stores independent of L2-miss instructions generate very accurate data prefetches: • From main memory to L2 • From L2 to L1 • For both regular and irregular access patterns • Instructions on the predicted program path are prefetched into the instruction/trace cache and L2. • Hardware prefetcher and branch predictor tables are trained using future access information.

Runahead Execution Mechanism • Entry into runahead mode • Instruction processing in runahead mode • INV bits and pseudo-retirement • Handling of store/load instructions (runahead cache) • Handling of branch instructions • Exit from runahead mode • Modifications to pipeline

Entry into Runahead Mode Load 1 Miss Compute Miss 1 When an L2-miss load instruction is the oldest in the instruction window: • Processor checkpoints • architectural register state (correctness) • branch history register (performance) • return address stack (performance) • Processor records the address of the L2-miss load. • Processor enters runahead mode. • L2-miss load marks its destination register as INV (invalid) and it is removed from the instruction window.

Processing in Runahead Mode Load 1 Miss Runahead Compute Miss 1 Runahead mode processing is the same as normal instruction processing, EXCEPT: • It is purely speculative (No architectural state updates) • L2-miss dependent instructions are identified and treated specially. • They are quickly removed from the instruction window. • Their values are not trusted.

Processing in Runahead Mode Load 1 Miss Runahead Compute Miss 1 • Two types of results are produced: INV (invalid), VALID • INV = Dependent on an L2 miss • First INV result is produced by the L2-miss load that caused entry into runahead mode. • An instruction produces an INV result • If it sources an INV result • If it misses in the L2 cache (A prefetch request is generated) • INV results are marked using INV bits in the register file, store buffer, and runahead cache. • INV bits prevent introduction of bogus data into the pipeline. • Bogus values are not used for prefetching/branch resolution.

Pseudo-retirement in Runahead Mode Load 1 Miss Runahead Compute Miss 1 • An instruction is examined for pseudo-retirement when it becomes the oldest in the instruction window. • An INV instruction is removed from window immediately. • A VALID instruction is removed when it completes execution and updates only the microarchitectural state. • Architectural (software-visible) register/memory state is NOT updated in runahead mode. • Pseudo-retired instructions free their allocated resources. • This allows the processing of later instructions. • Pseudo-retired stores communicate their data and INV status to dependent loads.

Store/Load Handling in Runahead Mode Load 1 Miss Runahead Compute Miss 1 • A pseudo-retired store writes its data and INV status to a dedicated scratchpad memory, called runahead cache. • Purpose: Data communication through memory in runahead mode. • A runahead load accesses store buffer, runahead cache, and L1 data cache in parallel. • If a runahead load is dependent on a pseudo-retired runahead store, it reads its data and INV status from the runahead cache. • Does not need to be always correct Size of runahead cache is very small (512 bytes).

Branch Handling in Runahead Mode Load 1 Miss Runahead Compute Miss 1 • Branch instructions are predicted in runahead mode (as in normal mode execution). • Runahead branches use the same predictor as normal branches. • VALID branches are resolved and initiate recovery if mispredicted. • VALID branches update the branch predictor tables. • Early training of the predictor. Reduces mispredictions in normal mode. • INV branches cannot be resolved. • A mispredicted INV branch causes the processor to stay on the wrong program path until the end of runahead execution.

Exit from Runahead Mode Load 1 Miss Load 1 Re-fetched and Re-executed Runahead Compute Compute Miss 1 When the runahead-causing L2 miss is serviced: • All instructions in the machine are flushed. • INV bits are reset. Runahead cache is flushed. • Processor restores the architectural state as it was before the runahead-causing instruction was fetched. • Architecturally, NOTHING happened. • But, hopefully useful prefetch requests were generated (caches warmed up). • Processor starts fetch beginning with the runahead-causing instruction. Mode is switched to normal mode. • Instructions executed in runahead mode are re-executed in normal mode.

CHECKPOINTED INV STATE FP FP FP Units FP Queue Sched Regfile Reorder Trace INT INV Frontend Cache Uop Queue Int Queue INT Buffer Sched Fetch RAT INT Units Unit Regfile Renamer ADDR MEM Mem Queue GEN L1 Sched Units Data Backend Cache RAT Selection Logic Prefetcher Instruction Store Decoder Buffer INV L2 Access Queue RUNAHEAD CACHE From memory L2 Cache Front Side Bus To memory Access Queue Modifications to Pipeline < 0.05% area overhead

Baseline Processor • 3-wide fetch, 29-stage pipeline x86 processor • 128-entry instruction window • 48-entry load buffer, 32-entry store buffer • 512 KB, 8-way, 16-cycle unified L2 cache • Approximately 500-cycle L2 miss latency • Bandwidth, contention, conflicts modeled in detail • Aggressive streaming data prefetcher (16 streams) • Next-two-lines instruction prefetcher

Evaluated Benchmarks • 147 Intel x86 benchmarks simulated for 30 million instructions • Benchmark Suites • SPEC CPU 95 (S95): 10 benchmarks, mostly FP • SPEC FP 2000 (FP00): 11 benchmarks • SPEC INT 2000 (INT00): 6 benchmarks • Internet (WEB): 18 benchmarks: SpecJbb, Webmark2001 • Multimedia (MM): 9 benchmarks: mpeg, speech rec., quake • Productivity (PROD): 17 benchmarks: Sysmark2k, winstone • Server (SERV): 2 benchmarks: TPC-C, timesten • Workstation (WS): 7 benchmarks: CAD, nastran, verilog

Performance of Runahead Execution

Runahead vs. Large Windows

Runahead vs. Large Windows (Alpha)

Limitations of the Baseline Runahead Mechanism • Energy Inefficiency • Many instructions are speculatively executed, sometimes without providing any performance benefit • On average 27% extra instructions for 22% IPC improvement • Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06] • Ineffectiveness for pointer-intensive applications • Runahead cannot parallelize dependentL2 cache misses • Address-Value Delta (AVD) Prediction [MICRO’05] • Irresolvable branch mispredictions in runahead mode • Cannot recover from a mispredicted L2-miss dependent branch • Wrong Path Events [MICRO’04]

Causes of Energy Inefficiency • Short runahead periods • Periods that last for 10s of cycles instead of 100s • Overlapping runahead periods • Two periods that execute the same dynamic instructions (e.g. due to dependent L2 cache misses) • Useless runahead periods • Periods that do not result in the discovery of L2 cache misses

Efficient Runahead Execution • Key idea: Predict if a runahead period is going to be short, overlapping, or useless. If so, do not enter runahead mode. • Simple hardware and software (compile-time profiling) techniques are effective predictors. • Details in our papers. • Extra executed instructions decrease from 27% to 6% • Performance improvement remains the same (22%)

The Problem: Dependent Cache Misses Runahead: Load 2 is dependent on Load 1 • Runahead execution cannot parallelize dependent misses • This limitation results in • wasted opportunity to improve performance • wasted energy (useless pre-execution) • Runahead performance would improve by 25% if this limitation were ideally overcome Cannot Compute Its Address! INV Load 1 Miss Load 2 Load 1 Hit Load 2 Miss Runahead Compute Miss 1 Miss 2

The Goal of AVD Prediction • Enable the parallelization of dependent L2 cache misses in runahead mode with a low-cost mechanism • How: • Predict the values of L2-miss address (pointer) loads • Address load: loads an address into its destination register, which is later used to calculate the address of another load • as opposed to data load

Parallelizing Dependent Cache Misses Cannot Compute Its Address! Load 1 Miss Load 2 INV Load 1 Hit Load 2 Miss Runahead Compute Miss 1 Miss 2 Can Compute Its Address Value Predicted Saved Speculative Instructions Miss Load 2 Load 1 Hit Load 2 Hit Load 1 Miss Compute Runahead Saved Cycles Miss 1 Miss 2

AVD Prediction • Address-value delta (AVD) of a load instruction defined as: AVD = Effective Address of Load – Data Value of Load • For some address loads, AVD is stable • An AVD predictor keeps track of the AVDs of address loads • When a load is an L2 miss in runahead mode, AVD predictor is consulted • If the predictor returns a stable (confident) AVD for that load, the value of the load is predicted Predicted Value = Effective Address – Predicted AVD

Identifying Address Loads in Hardware • If the AVD is too large, the value being loaded is likely NOT an address • Only predict loads that have satisfied: -MaxAVD < AVD < +MaxAVD • This identification mechanism eliminates almost all data loads from consideration • Enables the AVD predictor to be small

An Implementable AVD Predictor • Set-associative prediction table • Prediction table entry consists of • Tag (PC of the load) • Last AVD seen for the load • Confidence counter for the recorded AVD • Updated when an address load is retired in normal mode • Accessed when a load misses in L2 cache in runahead mode • Recovery-free: No need to recover the state of the processor or the predictor on misprediction • Runahead mode is purely speculative

Why Do Stable AVDs Occur? • Regularity in the way data structures are • allocated in memory AND • traversed • Two types of loads can have stable AVDs • Traversal address loads • Produce addresses consumed by address loads • Leaf address loads • Produce addresses consumed by data loads

Traversal Address Loads Regularly-allocated linked list: A traversal address load loads the pointer to next node: node = nodenext A AVD = Effective Addr – Data Value A+k Effective Addr Data Value AVD A A+k -k A+2k A+k A+2k -k A+3k A+2k A+3k -k A+3k A+4k -k A+4k A+4k A+5k -k ... A+5k Striding data value Stable AVD

A A+3k A+k A+k A A+2k A+2k A+3k Properties of Traversal-based AVDs • Stable AVDs can be captured with a stride value predictor • Stable AVDs disappear with the re-organization of the data structure (e.g., sorting) • Stability of AVDs is dependent on the behavior of the memory allocator • Allocation of contiguous, fixed-size chunks is useful Sorting Distance between nodes NOT constant!

Leaf Address Loads Sorted dictionary in parser: Nodes point to strings (words) String and node allocated consecutively Dictionary looked up for an input word. A leaf address load loads the pointer to the string of each node: lookup (node, input) { // ... ptr_str = nodestring; m = check_match(ptr_str, input); if (m>=0) lookup(node->right, input); if (m<0) lookup(node->left, input); } A+k A B+k C+k node string AVD = Effective Addr – Data Value B C Effective Addr Data Value AVD D+k E+k F+k G+k A+k A k D E F G C+k C k F+k F k No stride! Stable AVD

A+k C+k A C B+k C+k A+k B+k B C A B Properties of Leaf-based AVDs • Stable AVDs cannot be captured with a stride value predictor • Stable AVDs do not disappear with the re-organization of the data structure (e.g., sorting) • Stability of AVDs is dependent on the behavior of the memory allocator Distance between node and string still constant! Sorting

Performance of AVD Prediction 12.1%

Effect on Executed Instructions 13.3%

AVD Prediction vs. Stride Value Prediction • Performance: • Both can capture traversal address loads with stable AVDs • e.g., treeadd • Stride VP cannot captureleaf address loads with stable AVDs • e.g., health, mst, parser • AVD predictor cannot capturedata loads with striding data values • Predicting these can be useful for the correct resolution of mispredicted L2-miss dependent branches, e.g., parser • Complexity: • AVD predictor requires much fewer entries (only address loads) • AVD prediction logic is simpler (no stride maintenance)

AVD vs. Stride VP Performance 2.5% 4.5% 12.1% 12.6% 13.4% 16% 16 entries 4096 entries

Onur Mutlu EE 382N Guest Lecture