Advanced Microarchitecture

Advanced Microarchitecture Lecture 5: Advanced Fetch

Branch Predictions Can Be Wrong • How/When do we detect a misprediction? • What do we do about it? • resteer fetch to correct address • hunt down and squash instructions from the wrong path Lecture 5: Advanced Fetch

Example Control Flow br A correct path predicted path B C D E F G Lecture 5: Advanced Fetch

Multiple speculatively fetched basic blocks may be in-flight at the same time! Simple Pipeline Fetch (IF) Decode (ID) Dispatch (DP) Execute (EX) T br A br B A br D B A br … D B A Mispred Detected Lecture 5: Advanced Fetch

In More Detail Direction prediction, target prediction IF ID We know if branch is return, indirect jump, or phantom branch RAS iBTB Squash instructions in BP and I$-lookup Resteer BP to new target from RAS/iBTB DP If indirect target, can potentially read target from RF Squash instructions in BP, I$ and ID Resteer BP to target from RF Detect wrong direction, or wrong target (indirect) Squash instructions in BP, I$, ID and DP, plus RS and ROB Resteer BP to correct next PC EX Lecture 5: Advanced Fetch

I$ ADD BR XOR BR Phantom Branches • May occur when performing multiple bpreds 4 preds corresponding to 4 possible branches in the fetch group A B C D PC BPred N N T T X Z Fetch: ABCX… (C appears to be a branch) After fetch, we discover C cannot be taken because it is not even a branch! This is a phantom branch. Should have fetched: ABCDZ… Lecture 5: Advanced Fetch

Hardware Organization NPC PC I$ ID is indir is retn uncond br actual target no branch BPred != control BTB push on call pop on retn RAS + EX sizeof(I$-line) iBTB Lecture 5: Advanced Fetch

??? nop nop nop nop nop nop nop nop What about insts that are already in the RS, ROB, LSQ? EFGH nop nop nop nop nop’s are filtered out – no need to take up RS and ROB entries Recovery • Squashing instructions in front-end pipeline IF ID DS EX WXYZ QRST KLMN mispred! Lecture 5: Advanced Fetch

Wait for Drain • Squash in-order front-end (as before) • Stall dispatch (no new instructions  ROB, RS) • Let OOO engine execute as usual • Let commit operate as usual except: • check for the mispredicted branch • cannot commit any instructions after it • but after mispredicted branch committed, any remaining instructions in ROB, RS, LSQ must be on the wrong path • flush the OOO engine • allow dispatch to continue Lecture 5: Advanced Fetch

What if this load has a cache miss and goes to main memory? D&W: LOAD - - - LOAD  - - - ADD - - - ADD  - - -  BR - - - BR - - - X X XOR XOR junk junk X X LOAD LOAD junk junk X X SUB SUB junk junk X X ST ST junk junk X X BR BR junk junk Wait for Drain (2) • Simple to implement! • Performance degradation Ideal: LOAD ADD BR junk junk junk junk junk Lecture 5: Advanced Fetch

Branch Tags/IDs/Colors • Each instruction fetched is assigned the “current branch tag” • Each predicted branch causes a new branch tag to be allocated (and becomes the current tag) branch ROB Tags 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 7 7 7 7 7 5 3 3 3 3 (Tags might not necessarily be in any particular order) Lecture 5: Advanced Fetch

7 5 3 mispred!           ROB Branch Tags (2) Tags 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 7 7 7 7 7 5 3 3 3 3 Tag List 1 2 4 7 5 3 Lecture 5: Advanced Fetch

Overkill for ROB / LSQ • ROB and LSQ keep instructions in program order (more on this in future lecture) • All instruction physically after the mispredicted branch should be squashed … Simple! • Some sort of tagging/coloring useful for RS! • instructions in RS may be in arbitrary order • may be multiple sets of RS’s ROB Integer RS FP RS Lecture 5: Advanced Fetch

Height increases with num branch tags Width increases with num branch tags Hardware Complexity invalidate tag 0 invalidate tag 1 invalidate tag 2 my tag = = = Overall area overhead is quadratic in tag count squash Lecture 5: Advanced Fetch

Simplifications • For a ROB with n entries, could potentially have n different branches, each requiring a unique tag • In practice, only a fraction of insts are branches, so limit to k < n tags instead • If a k+1st branch is fetched, dispatch must be stalled until a tag has been deallocated Lecture 5: Advanced Fetch

     Resume Fetch     Simplifications (2) • For k tags, may need to broadcast all if oldest branch mispredicted, resulting in O(k2) overhead • Limit to only one (for example) broadcast per cycle 7 5  3 Lecture 5: Advanced Fetch

Branch Predictor Latency • To provide a continuous stream of instructions, the branch predictor must make one prediction every cycle • Pipelining? • Nope. If current prediction is NT, then next PC is A. If taken, then next PC is B. A dependency exists between successive predictions • Limits predictor size/latency • Smaller predictor is less accurate • Or clock frequency penalty Lecture 5: Advanced Fetch

Ahead Prediction • Normally: • PC1 PC2  PC3  PC4  PC5  … • Each “” is a prediction that takes a single cycle • PCi is predicted from PCi-1 • Instead: • PC1 PC3  PC5  … • and PC2  PC4  … • PCi is predicted from PCi-2, and so the prediction can take two cycles instead of one • In general, can k-ahead pipeline the predictor Lecture 5: Advanced Fetch

Cycle k+1 Cycle k Cycle k+2 PCi+1 PCi+2 PCi PCi+1 PCi PCi+2 PCi PCi+1 PCi-1 PCi PCi PCi+1 PCi+2 PCi+1 PCi+2 PCi+3 PCi+2 PCi+4 Ahead Prediction Timing 2-cycle ahead-pipelined branch predictor Fetch Address Lecture 5: Advanced Fetch

The address before NPC is the PC of the mispredicted branch mispredict! NPC N2PC PC NPC - - - PC NPC ??? PC  NPC New PC sent to front-end Cycle k+1: NPC  I$, PC  predictor Cycle k+2: I$ bubble, NPC  predictor Cycle k+3: N2PCI$, N2PC  predictor PC  next-next PC (N2PC) Ahead Prediction Misprediction PC  PCwrong Cycle k: mispredict Lecture 5: Advanced Fetch

Overriding Branch Predictors • Use two branch predictors • 1st one has single-cycle latency (fast, medium accuracy) • 2nd one has multi-cycle latency, but more accurate • Second predictor can override the 1st prediction if it disagrees • Idea: better to pay for a small number of bubbles (difference in 1st and 2nd predictor latencies) than to pay for a full branch misprediction (full pipeline flush, 20+ cycles of delay) Lecture 5: Advanced Fetch

B Predict B Predict C Predict C’ Predict B’ Fetch A Fetch B Predict B’ Predict A’ Fetch A Predict A’ If A=A’ (both preds agree), done Overriding Predictors (2) Z A Predict A Predict A’ Fast 1st Pred 2-cycle Pipelined I$ If A != A’, flush A, B andC restart fetch with A’ Slower 2nd Pred Lecture 5: Advanced Fetch

Worst case, branch mispred penalty is worse than without overriding predictors! Benefit of Overriding Predictors • Assume • 1-cycle predictor 80% accuracy • 3-cycle predictor 95% accuracy • Misprediction penalty of 20 cycles • Fetch bubbles per branch • 1-cycle pred only: 0.80 +0.220 = 4 • 3-cycle pred only: 0.953 + 0.0520 = 3.85 • Overriding config: 0.80.950 + 0.20.953 + 0.20.0520 + 0.80.0523 = 1.69 Lecture 5: Advanced Fetch

Predict: A B C D E F G Update: A B C D E F G time Speculative Branch Update • Ideal branch prediction problem • Given PC, predict branch outcome • Given actual outcome, update/train predictor • Repeat • Actual problem • Streams of predictions and updates in parallel Lecture 5: Advanced Fetch

Speculative Branch Update (2) • BHR update cannot be delayed until branch retirement Predict: A B C D E F G Update: A B C D E F G • Can’t update BHR until commit because outcome not known until then BHR: 011010 011010 011010 011010 011010 110101 Branches B-E all predicted with The same stale BHR value Lecture 5: Advanced Fetch

Speculative Branch Update (3) • Update branch history using predictions • Speculative update • If predictions are correct, then BHR is correct • Effectively simulates alternating lookup and update w.r.t. the BHR • So what if there’s a misprediction? • Checkpoint and recover Lecture 5: Advanced Fetch

Recovery of Speculative BHR BPred Lookup 0110100100100… Speculative BHR BPred Update Retirement Mispredict! Retirement BHR Lecture 5: Advanced Fetch

Execution-Time Recovery • Commit-time recovery may substantially delay branch misprediction recovery $-miss to DRAM • Have every branch checkpoint the BHR at the time it predicted • On mispredict, recover the speculative BHR from this checkpoint Load Executed, but can’t recover until load retires Br Lecture 5: Advanced Fetch

Example Traces A B C H I J H I J K L A G H I J H I J K L A B C H I J Traces • A “Trace” is a dynamic stream of instructions A B C D E F G H I J K L Static Layout Observed paths through the program Lecture 5: Advanced Fetch

Trace Cache A B A B C D E F G H I J C D E F G A B C D E F G H I J H I J I$ Fetch (5 cycles) T$ Fetch (1 cycle) Trace Cache • Idea is to cache dynamic Traces instead of static instructions E F G H I J K A B C D I$ Lecture 5: Advanced Fetch

Tag, etc. Insts Hit Logic Line-Fill Buffer Trace Cache Fill Control Merge Logic Hardware Organization Fetch Address I$ BPred BTB BTB Logic Mask, exchange, Shift instruction latch to decoder Lecture 5: Advanced Fetch

Tags, etc. Tag Fall-thru Addr 3rd branch # Br. 2nd branch Target Addr 1st branch Branch Mask A 3 11,1 X Y Fetch: A Branches 1&2 both Taken in this trace Trace ends in a branch Lecture 5: Advanced Fetch

Multi-BPred N T T 0 1 Cond. AND Next Fetch Address Match Remaining Block(s) Trace hit Hit Logic, Next Address Selection Fetch: A Tag # Br. Mask Fall-thru Target A 3 11,1 X Y = = = Match 1st Block Lecture 5: Advanced Fetch

Generating Multiple Predictions BHR BPred BPred BPred Serialized access: incredibly slow Three predictions in parallel Predictor must be BHR-based only (no PC bits!) Lecture 5: Advanced Fetch

Associativity • Set-Associativity ABC ABC A B C X Y Z A B C XYZ XYZ Benefit: reduced miss rate Cost: access time, replacement complexity • Path/Trace-Associativity Benefit: possible reduced miss rate, trace-thrashing Cost: access time, replacement complexity, code duplication ABC A B D A B C ABD Lecture 5: Advanced Fetch

X Y A BHR bits B A A B C C D A A B D Works if path after AB consistently correlates with path before AB Provides similar benefit to path-assoc. Indexing T$ A A B C Lecture 5: Advanced Fetch

Build Trace at Retire ROB Instructions from Retire trace construction buffer T$ Store when trace complete Trace Fill Unit Placement Build Trace at Fetch I$ To Decode trace construction buffer T$ Store when trace complete Lecture 5: Advanced Fetch

Trace Fill Unit Placement (2) • At Fetch • Speculative traces (uses branch prediction - not verified) • Construction buffer management • Building ABC, detect mispredict  should be ABD; need to find C in the buffer, clean it out, and then insert D • At Retire • Non-speculative, all traces are “correct” • No interaction with branch predictor • Simpler construction buffer • Slower response time • Time from fetching ABC  retiring ABC may be long • Until retirement, ABC not in T$ and fetch must use I$ Lecture 5: Advanced Fetch

97% 3% Trace Selection • Some traces may have poor temporal locality • Storing ACD evicts ABD (assuming no path-assoc), but likely won’t be useful • Alternative, use a trace filtering mechanism • extra HW required A B C D Lecture 5: Advanced Fetch

Statistical Filtering [PACT 2005] • For each trace, insert with probability p < 1.0 • Example: p=0.05 (5% chance of insertion per trace) • Hot trace: ABC, seen 50 times • Cold trace: XYZ, seen twice • Probability of ABC getting inserted • 1.0 – P(not getting inserted) = 1.0 – (1.0-0.05)50 = 1.0 – 0.9550= 92.3% (good chance that ABC gets in the T$) • Probability of XYZ getting inserted • 1.0 – (1.0-0.1)2 = 1.0-0.92 = 9.75% (not so likely) Lecture 5: Advanced Fetch

Partial Matches Fetch: A Trace $ BPred I$ ABC ABD A Partial Hit? = Benefit: More insts Cost: More complex “hit” logic Squashing logic ABC  AB Targets for intermediate branches AB A Lecture 5: Advanced Fetch

Trace-based prediction (predict next-trace, not next-PC) Decoded Instructions Netburst (P4) Trace Cache Front-end BTB iTLB and Prefetcher L2 Cache No I$ !! Decoder Trace $ BTB Trace $ Rename, execute, etc. Lecture 5: Advanced Fetch

Trace Prediction • Each trace has a unique identifier, analogous to but different from a conventional PC • effectively starting PC plus intra-trace branch directions • Trace predictor takes a trace-id as input, and outputs a predicted next-trace-id • Trace cache is indexed with the trace-id, tag match against trace-id as well Lecture 5: Advanced Fetch

No I$, Decoded Trace Cache • No I$ means T$ miss must pay the latency for an L2 access • Severe performance penalty for applications with poor trace locality • Decoded instructions remove decode logic from branch misprediction penalty Misp Fetch Fetch Dec Dec Ren Disp Exec Mispredict Penalty Misp T$ T$ Ren Disp Exec Lecture 5: Advanced Fetch

Advanced Microarchitecture