CRE652 Processor Architecture Dynamic Branch Prediction

CRE652 Processor Architecture DynamicBranch Prediction

Dynamic Branch Prediction Predict branch outcome at run time with where target instruction is. • avoid control hazard • heavy effects on multi-issue processors example: bne <target> add … <target> sub IF ID … IF To avoid stall, needs to know which one, either ADD or SUB, to fetch even before the branch is decoded

example: suppose branch comes every six instructions. If the rates of prediction success are Static-Taken: 70% and Dynamic: 90%. Assuming 2-cycle stall for mis-prediction (and no other stalls in pipe), With single -issue CPI = 1 + (0.3*2)1/6 = 1.1 for static CPI = 1 + (0.1*2)1/6 = 1.03 for dynamic About 7% difference With 6-issue, Branch comes six times fast CPI = 1 + (0.3*2)6/6 = 1.6 for static CPI = 1 + (0.1*2)6/6 = 1.2 for dynamic About 30% difference! (if one commit/cycle) CPI = 1/6 + (0.3*2)6/6 = 0.76 for static CPI = 1/6 + (0.1*2)6/6 = 0.26 for dynamic About 300% difference! (if six commit/cycle)

Branch Prediction • What to predict • Branch direction(taken or not taken) • For conditional branch • Harder part • Branch targetif taken • When to predict • Target at IF stage; Direction could be later but earlier than EX • When to verify • At the end of EX of branch. Branch is resolved. • Predictor type • Static: always assume branch is either taken or not taken • Dynamic: changes over time -> our focus Add r1, r2, r3 load r4, 100(r5) Subi r4,r4, 200 Store r4, 120(r5) Addi r5, r5,1 BNE r1, r5, offset IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX..

Branch Misprediction recovery x Predicted Path A Actual Path B C • Assume branch at A is mis-predicted. Program should • redirect the fetch point to another Branch of A and • cancel/nullify the instructions in B and D. D The mis-prediction penalty is the cycles between the time when branch is predicted (at fetch) to the time when branch is resolved (typically at the end of EX). All instructions fetched, decoded, and executed in between should be canceled.

Dynamic Branch Prediction With single-issue pipe, dynamic branch prediction may be a novel scheme, but an essential feature for multiple-issue pipes Dynamic Prediction based on branch history • Just looking at the history of the branch for prediction → prediction in isolation • Looking at the history of other branches in addition to the branch for prediction → correlating prediction

Dynamic Branch Prediction example: … bne <target1> … <target1> … beq <target2> … <target2> Consider the case of prediction for “beq”, Consider the history of “beq” only (prediction-in-isolation) or the histories of “bne” and “beq” together (correlating prediction).

Prediction in isolation Branch history Q: How many previous branch decisions to consider (branch history depth) to have a good prediction success? • One-bit history aka last-value prediction: what was previous branch decision Start with prediction either T or N If wrong, change prediction to the other for next T T 0 N 1 T N N

2-bit History More bits will record more history making the prediction more accurate (or maybe NOT?) • Two-bit history (prediction) bits (based on static profiling) 2-bit history Profile of Taken(%)Prediction NN (00) 11 N NT (01) 61 T TN (10) 54 T TT (11) 97 T State variable is the branch history N T 01 T 00 N N T N T 11 T 10 T T N

How large n might be? n compiler or Business Scientific System 0 64.1 64.4 70.4 54.0 • 91.9 95.2 86.6 79.7 2 93 96.5 90.8 83.4 3 93.7 96.6 91.0 83.5 4 94.5 96.8 91.8 83.7 5 94.7 97.0 92.0 83.9 note: 0-bit is static Taken prediction Even with ∞ bits, it improves little over 2-bit prediction.

Bi-modal Predictorcounting mis-predictions instead of recording branch history • 1-bit predictor might be too sensitive. for i =1; i <= 5; i++ for j = 1; j<=10; j++ Do something Label1: i = i +1; Label2: do something j = j + 1; ble j, 10, label2 ble i,5, Label1 For each inner loop, the blue branch will be mis-predicted twice

Bi-modal (saturating counter) predictor • Bi-modal predictor: • 2-bit “saturating” counter: state variable is a number • Only Two consecutive mis-predictions cause prediction change. N T(+1) 1N 0N State variable is a counter N(-1) T (+1) N (-1) N(-1) With the same hardware resource, bi-modal predictor has a better prediction accuracy than 2-bit history one. 2 T 3 T T(+1) T

32-bit Hardware organization PC l-bit (0<l<=32) each entry: an n-bit counter/history Multiple branches could be mapped into one: entry –aliasing problem or resolution issue How many entries in the prediction table?

Correlating prediction A branch decision may be affected by otherbranch decisions: If (aa == 2) aa =0; If (bb == 4) bb =0; If (aa != bb) {….. if the first two conditions are true then the third will be false

Correlating Branch Predictor • If we use 2 branches as histories, then there are 4 possibilities (T-T, N-T, N-N, T-T). • For each possibility, we need to use a predictor. • And this repeats for every branch. 24= 16 (2,2) branch prediction

Another way to view correlating branch predictor Save recent branch outcomes to approximate the control paths followed → Branch History Register (BHR) Some people called BHSR: Branch history shift register. Shift Register of m-bit to hold branch outcome of the last mbranch instruction executions (!recall it’s dynamic prediction!) whenever a branch decision is made, BHR is shifted out with a new decision bit shifted in. BHR T N 0 N T T N 0 1 T 0 1 0

Performance of Correlating Branch Prediction • With same number of state bits, (2,2) performs better than noncorrelating 2-bit predictor. • Outperforms a 2-bit predictor with infinite number of entries

Correlating Predictor note: • (0, 2) predictor is a 2-bit prediction in isolation • sometimes m and n represent the same branch instruction • e.g. loop closing branch without any other branch in the loop body. • Note: entry is not unique to a specific branch: • Program can follow different execution path, thereby different BHR, to reach one particular branch • larger m may provide better resolution leading to better accuracy: 10 or 12 seems popular

Correlating Predictor (m, n) predictor m: m-bit (global) BHR n: n-bit history bits or counter (per local branch) Using PC and BHR to access branch prediction/history table (table of history/prediction bits: most cases 2-bit history table)

BHR information, as well as branch’s PC, is used to index into an array of isolated predictor gshare Predictor by McFarling PC m-bit xx Prediction BHR PC and BHR can be to access 2-bit history table: either Concatenated or XORed (partially or fully) Branch History Table (BHT) Pattern History Table (PHT) 2m entry history table of 2-bit history/counter predictor

2-level Predictors – extended idea • BHR and BHR table • We can have one BHR (global BHR) for a program (G) • Only one register that is read and updated by any branch • Or per address BHR (P) • BHR table indexed by a portion of PC bits • Each BHR is dedicated to one particular branch • Use current branch’s PC to locate one BHR and update/read that BHR. Read and update by one particular branch Read and update by all branches PC one global BHR BHR table contains multiple BHRs

2-level Predictors • PHT (Pattern History Table) • Each entry in PHT contains a n-bit history/counter predictor • We can only have one PHT indexed by BHR (G) • Or per address PHT (a set of PHTs) • Use PC to locate a PHT first, then use BHR to locate one particular Entry. • Each PHT is dedicated to one particular branch BHR bits PC n –bit history/counter predictor BHR bits one global PHT Multiple PHT gAp

xAy predictor - Gag • Yeh and Patt proposed 2-level predictor - xAy • A means adaptive • x: BHR organization ; y: PHT organization • G: global; p: individual e.g. Gag: global BHR with global PHT • A variation of Gag: gshare by McFarling BHR bits BHR bits PC bits Index of PHT is randomized one global BHR one global PHT

xAy predictor -PAg PAg Predictor → per address BHR (local BHR) with single global PHT (now BHRs in a form of table: Table of BHRs) → use PC as Tag to match instruction address to a specific local BHR Surprisingly, BHR alone without PC can improve prediction success rate if PHT size is big (>4K entry) and BHR size is big (> 12 bits) global PHT BHR BHR pc prediction bb BHR BHR

xAy predictor -PAp Pap predictor Per address BHR with per address PHT → use PC as Tag to match instruction address to BHR and PHT, and then use BHR to match PHT entry PHTs BHR BHR bb pc bb prediction bb BHR bb BHR

Combining Predictor S. McFarling, “Combining branch predictors”,WRL technical note TN-36, June 1993. Hybrid/Tournament Predictor Each predictor has its own advantage, works better than the other in certain situations. → combine two different predictors to create better, i.e. more accurate predictor → needs to have a predictor of predictors Meta-predictor A: W & B: R Weak A e.g. 2-bit saturating counter as a meta-predictor choosing one of the two predictors – local and Gshare Recall how 2-bit saturating counter works: two consecutive false predictions change the predictor Strong A A: R & B: W A: W & B: R A: R & B: W A: R & B: W Strong B Weak B A: W & B: R

Branch Prediction – Alpha 21264 saturating counters 1024 Local(pAg) 10bit BHR 3-bit PC 4096 4096 Global (gAg) BHR 12-bit path 2-bit last 12 branches 2-bit Different from Mcfarling’s

Branch Prediction • Tournament with meta predictor • Aliasing • In the same process • Between the threads • Effectiveness of BHR how about path history, instead of T or N, one may take (portion of ) addresses followed

Branch Target Buffer (BTB) Recall Prediction alone does not remove stalls due to control hazard: branch: IF ID … IF To avoid stall, even without knowing the fetched instruction is branch, PC for the next instruction should be loaded. Fetch target address at instruction fetch

Branch Target Buffer To reduce restart delay, Branch Target Buffer (BTB) • small faster cache holding target addresses • indexed with PC of conditional branching instr • Each entry contains the branch’s PC as the tag to guarantee current instruction is the branch buffered in BTB. • accessed at the same time of I-Fetch • sometimes, extension of I-cache

BTB operation

e.g. BTB-cache with tag of branch instr. addresses access with PC as index entry: target addr prediction–n/t (target instruction) branch IF ID accessTIF TID I-Cache <IF> & BTB actual check <IF> If BTB hit branch prediction if wrong Update PC decision if wrongprediction based on reverse prediction prediction & update BTB Assume branch is resolved

Branch Target Buffer Note: when to put branch instr. into BTB no need to put instr. executed only once → Optimizing BTB design How large? 1K to 8K entries?!? • When to put First time branch instruction is executed First time TAKEN branch is made: better hit • When to kick out (replacement) Doesn’t matter much, usual LRU is OK

Branch Folding In BTB, Target Instructioninstead of Target address → Branch Folding: 0-cycle branch e.g. IF ID <branch> IF ID EX <target> IF IF <if prediction was wrong: retarget> Fetch branch instr. and target address from BTB without folding

Branch Folding Assuming separate decode for branch and other instructions, IF ID EX … ID EX (target instruction) IF <if prediction was wrong:retarget> Fetch branch and target instruction (from BTB) Decode branch and target both instructions If prediction is correct, proceed to EX stage Otherwise, fetch the correct target Note: • Still 2-cycle delay if prediction is wrong but Free-branch if the prediction is correct. • predicated instructions: generalized branch folding with tag of prediction bit compiler based approach as in Intel EPIC

Indirect Jump Unlike PC-relative with constant in most conditional branches, some branches use registers or memory locations for target address holder. For such indirect branches, target address changes frequently at run time. jr register; (the register contains the target) branch prediction scheme based on branch history does not work well.

Return Address Stack (RAS) • Return address changes as calls coming from different places • Return address in previous jump may have nothing to do with the current instance of jump to return address • BTB with last jump address as target does not work well : • only 51.8% prediction success with SPECint95. Even worse with speculative execution. • Majority of indirect branch is return • 85% of indirect jumps = return • Return Address Stack (implemented as a circular buffer in h/w) • When fetching a call instruction, push the next address into a stack • When fetching return, pop the address from stack before the return gets executed. • The popped value is speculated as the target address • Note the value could be wrong because hardware stack has a limited capacity and context switch

Return Address Stack (RAS) Small fast HW stack cache with the most recent return address on top. If hit (i.e. the instr. is return) then update PC with address from the stack note: 1. with some instruction format, cache with tag is not necessary 2. How many slots in the RAS? Maximum call depth? Intel Pentium-3, 8 slots Alpha 21264, 32 slots PC+104 PC+4 aa PC+4 bb aa xxx yyy ret bb pc+100: call zz pc+104: sub yy ret pc: call xx pc+4: add yy PC+4 aa aa bb bb

Integrated instruction fetch units • An aggressive fetch unit: Important in multi-issue superscalar processor. • Integrated branch prediction: • do both target prediction and direction prediction • Instruction pre-fetch: • fetch ahead beyond the cache line. • Instruction memory access and buffering • Memory provides a smooth instruction flow to fetch unit. • Trace cache • Previously fetch boundary is the first branch in each cycle. • I-cache include “traces” rather than a consecutive block. • In each cycle, fetch instructions from multiple branches

CRE652 Processor Architecture Dynamic Branch Prediction

CRE652 Processor Architecture Dynamic Branch Prediction

Presentation Transcript

Dynamic Branch Prediction

Dynamic Branch Prediction

Dynamic Branch Prediction

Computer Architecture Advanced Branch Prediction

Computer Architecture Advanced Branch Prediction

Dynamic Branch Prediction

Dynamic Branch Prediction

Dynamic Branch Prediction and Speculation

Dynamic Branch Prediction

Neural Methods for Dynamic Branch Prediction

Hardware Dynamic Branch Prediction

Dynamic Branch Prediction (Sec 4.3)

Dynamic Hardware Branch Prediction

Dynamic Branch Prediction

Tomasulo Algorithm and Dynamic Branch Prediction

CRE652 Processor Architecture Making it Trustworthy Trustworthy Computing and Branch Prediction

Reducing Branch Costs with Dynamic Branch Prediction

Dynamic Branch Prediction

Two-Level Adaptive Dynamic Branch Prediction

Lecture 8: Branch Prediction, Dynamic ILP

Dynamic Branch Prediction

CRE652 Processor Architecture Course Objective: To gain