1 / 69

Branch Prediction

Branch Prediction. J. Nelson Amaral. Why Branch Prediction?. Every 5-7 instruction of a program is a branch Not predicting, or miss-predicting, is very costly in architectures with deep pipelines or with many functional units. Baer p. 129. Anatomy of a Predictor. Baer p. 130.

Télécharger la présentation

Branch Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Branch Prediction J. Nelson Amaral

  2. Why Branch Prediction? • Every 5-7 instruction of a program is a branch • Not predicting, or miss-predicting, is very costly in architectures with deep pipelines or with many functional units. Baer p. 129

  3. Anatomy of a Predictor Baer p. 130

  4. Anatomy of a Branch Predictor Prog. Exec. • Event Source: the execution of the program • Predictive information: • Can be encoded in the instruction code • a bit indicates most likely outcome • forward/backward branch • Obtained from some profiling information Baer p. 130

  5. Anatomy of a Branch Predictor (cont.) Event Selec. • Event Selection: when to predict? • Simple solution: compute the prediction for every instruction (even non-branches) • Only use the result of the prediction for branches Baer p. 130

  6. Anatomy of a Branch Predictor (cont.) Pred. Index. • Prediction Indexing: • Use part of the PC to index prediction tables: • history of outcome of previous branches at this PC • history of execution path leading to this PC Baer p. 130

  7. Anatomy of a Branch Predictor (cont.) • Predictor Mechanism: • Static (example): • forward: always not taken • backward: always taken • Dynamic: • Finite State Machine predictor: saturating counters • Markov predictor: correlation Pred. Mechan. Baer p. 131

  8. Anatomy of a Branch Predictor (cont.) • Feedback and Recovery: • Use real outcome to reinforce prediction • Must recover from miss-predictions Feedback Baer p. 131

  9. Control Flow Statistics A 4-way superscalar has to predict a branch, on average, every other cycle. Baer p. 131

  10. Interbranch Distances 40% of the time there is 1 or 0 cycles between predictions Branch resolution takes +/- 10 cycles If the prediction is wrong, up to 40 wrong instructions are in flight by the time the resolution occurs. Simulation for a 4-way out-of-order architecture Baer p. 131

  11. Static Predictions OR Always Taken Always Not Taken Baer p. 132

  12. Static Predictions • Early studies indicated that 2/3 of branches are taken • but 30% of those branches were unconditional! • For conditional branches there appears to be no preferred direction. Always Taken Baer p. 132

  13. Alternative Static Predictions Accuracy improvements are barely noticeable. Static prediction based on profiling is slightly better. Static branch-not-taken has no implementation cost on pipeline. Forward Always Not Taken Backward Always Taken Baer p. 132

  14. Dynamic Predictors • Prediction of a given branch changes with the execution of the program. • Simple: a finite-state machine encodes the outcome of a few recent executions of the branch. • Elaborate: Not only early branch outcomes, but other correlated parts of the programs are considered. Baer p. 132

  15. When to predict? • Static prediction: at the Instruction Decode stage • Know that the instruction is a branch • Dynamic prediction: at the Instruction Fetch stage • Calculate prediction for every instruction, even non-branch ones. Baer p. 133

  16. What to Predict? • Branch Direction: Is branch taken on not? • Branch Target: Address of next instruction for a taken branch Baer p. 133

  17. Predicting Direction • Where we find the prediction? • How to encode the prediction? Look at the recent past: What was the direction the last time this same branch was executed? A single bit encodes the prediction: Prediction bit is set at prediction time. Baer p. 133

  18. Prediction Hysteresis • Look at the last two resolutions • Two wrong predictions are necessary to change the prediction • Motivated by wrong predictions at the end of inner loops. Baer p. 133

  19. 2-Bit Saturating Counter Last instance was not taken but the previous was taken Last two instances were taken Last instance was taken but the previous was not Last two instances were not taken Baer p. 134

  20. 2-Bit Saturating Counter (Example) for(i=0 ; i < m ; i++) for(j=0; j<n ; j++) begin S1; S2; …; Sk end; m ≤ 0 i ← 0 1-bit n ≥ 0 i j Pred Outc 0 0 NT T j ← 0 0 1 T T 0 n T NT S1; S2; …; Sk 1 0 NT T j←j+1 T 1 1 T T j < n NT i←i+1 2 × m misspredictions i < m i←i+1 Baer p. 134

  21. 2-Bit Saturating Counter (Example) for(i=0 ; i < m ; i++) for(j=0; j<n ; j++) begin S1; S2; …; Sk end; m ≤ 0 i ← 0 1-bit 2-bit n ≥ 0 i j Pred Outc State Pred Outc 0 0 NT T wNT NT T j ← 0 0 1 T T sT T T 0 n T NT sT T NT S1; S2; …; Sk 1 0 NT T wT T T j←j+1 T 1 1 T T sT T T j < n NT i←i+1 i < m m + 1 misspredictions i←i+1 Baer p. 134

  22. Accuracy of Branch Prediction • Includes unconditional branches • Predictions are associated with branches after each branch’s first execution Average of 26 traces (IBM 379, DEC PDP-11, CDC 6400) Average of 32 traces (MIPS R2000, Sun SPARC, DEC VAX, Motorola 68000) 3-bit counters yield only minor improvements Fix prediction. Determined by the first execution of the branch. Baer p. 135

  23. Where to store the Prediction 32-bit address → 230 entries Need one (or two) bit for each possible branch address. Storing prediction bits with instructions. Need to modify code every 5 instructions. Many more bits for tags than for predictions. Use a cache (Branch Prediction Buffer – BPB). Solution: ditch the tags. Baer p. 136

  24. Pattern History Table (PHT) Use selected bits from PC to index (or hash) the PHT. Each entry of the PHP stores the state of a finite state machine associated with a branch. Aliasing: multiple branches may index the same PHT entry. Performance degrades slightly. Baer p. 136

  25. Accuracy of Bimodal Predictor(based on PHT) Based on 10 SPEC89 traces. Baer p. 137

  26. Where the Predictor is Stored? Separate PHT Embedded in Instruction cache MIPS R10000: (512 counters) Alpha 21264: 1 counter per instruction? (2K counters) Sun UltraSPARC: 2 counters/cache line (2K counters) IBM PowerPC 620: (512 counters) AMD K5: 1 counter/cache line (1K counters) Intel Pentium: Combines PHP with Branch Target Buffer (512 entries) Baer p. 137

  27. Feedback and Recovery Feedback Baer p. 137

  28. Feedback: Bimodal Predictor • Feedback: update 2-bit counter for executing branch • When the updating is done? • When the actual direction is found (EX stage) Other predictions of the same branch are done. • When the branch commits Even more predictions are done. • Speculatively when the prediction is done Only reinforces prediction in bimodal predictor. EX/commit updating makes little difference in performance. Baer p. 137 Textbook typo (p. 137): choice for the timing of the “update”.

  29. Local × Global Predictor • Local: • Only use history of the branch to be predicted • Global: • Use history of other branches that precede the branch to be predicted. Baer p. 138

  30. Motivation for Global Prediction • Example from SPEC program eqntott: if (aa == 2) /* b1 */ aa = 0; if (bb == 2) /* b2 */ bb = 0; if(aa != bb){ /* b3 */ …. } If b1 and b2 are taken, then b3 is not taken. Baer p. 138

  31. Correlator Predictor Two-level predictor. History Register Shifted-out bits are lost 1 inserted to the right when a branch is taken (0 otherwise) Baer p. 139

  32. Update Problem in theCorrelator Predictor • PHT is updated non-speculatively at commit stage. • What is the problem with non-speculative updates of the global register? Baer p. 139

  33. Updating the Global Register in theCorrelator Predictor if (aa == 2) /* b1 */ aa = 0; if (bb == 2) /* b2 */ bb = 0; if(aa != bb){ /* b3 */ …. } Branches b1 and b2 are not include in the prediction of branch b3! Baer p. 139

  34. Updating the Global Register in theCorrelator Predictor Mispredictions and cache misses affect the commit time of earlier branches. • Two consecutive predictions • of a branch b may use different • ancestors of b. • Even if the path leading to • b is the same if (aa == 2) /* b1 */ aa = 0; if (bb == 2) /* b2 */ bb = 0; if(aa != bb){ /* b3 */ …. } Baer p. 139

  35. Solution to the Update Problem in theCorrelator Predictor • Update Global Register speculatively when prediction is made. • New problem: • Need a repair mechanism • All bits after a misprediction are from branches in the wrong path. Baer p. 139

  36. Repair Mechanism for Global Register in the Correlator Predictor • Decode Stage: • Checkpoint current GR into a FIFO queue • Commit Stage: • H: head of the queue • The corresponding check-pointed GR is H. • Correct prediction: discard H • Incorrect prediction: shift branch outcome into H and make it the new GR. Baer p. 144

  37. Optimization to GR Checkpointing Put into the queue a GR that has the corrected bit shifted into it. Baer p. 144

  38. Issues with Correlator Predictor • For small PHTs • Performance is worse than local predictors • It does not use the location of the branch in the program for the prediction • May introduce excessive aliasing • Solution to the aliasing problem: • Reintroduce the PC in the indexing of PHT Baer p. 140

  39. gshare Predictor A common hash is an XOR function. Baer p. 141

  40. Accuracy and Use of gshare • Almost perfect for SPEC FP95. • 0.83 accuracy for SPEC INT95 • 0.65 for program go Sun UltraSPARC IBM Power4 AMD K5 Baer p. 141

  41. Example m ≤ 0 i ← 0 • Assume n=4: • bimodal mispredicts 1/5 times • global mispredicts from 0 to 5 times depending on other branches in the loop • This branch has a fix pattern: • “4 taken, 1 not taken” • How can this pattern be learned? • Remember the history of individual branches • We need predictors more attuned to locality of individual branches n ≥ 0 j ← 0 S1; S2; …; Sk j←j+1 T j < n NT i←i+1 i < m i←i+1 Baer p. 142

  42. global-set predictor • First Level: A global shift register for correlations • Second Level: A set of multiple PHTs to prevent aliasing • expensive in terms of storage • must use few PHTs to be viable Baer p. 142/143

  43. set-global predictor • Set of Branch History registers (BHT) • A single global PHT Baer p. 143

  44. set-set predictor • A set of branch history registers (BHT) • A set of PHTs Baer p. 143

  45. Predicting the Branch Target • When is the target of a branch computed? • In a superscalar architecture (p.e., the IA-32 of the Intel P6) after several pipeline stages. • What is the point of predicting direction early if we don’t know where the branch goes? • Need to also predict the branch target address. Baer p. 145

  46. Branch Target Buffer (BTB) • A cachelike storage that records branch addresses and associated targets • If there is a hit in BTB for branch predicted taken: • PC ← Target in BTB for branch Baer p. 146

  47. Integrated BTB-PHT • BTB needs much more space than the PHT • # of entries is limited by BTB. • BTB must be accessed on a single cycle Baer p. 146

  48. Decoupled BTB-PHT • Parallel BTB and PHT access • if PHT say ‘taken’ and hit in BTB then PC ← Address in BTB Baer p. 146

  49. Decoupled BTB-PHT • For space efficiency: • Only taken branches are added to BTB • They are added at the backend when the outcome is known. IBM PowerPC 620: 256-entry, 2-way set-associative BTB 2K counter PHT Baer p. 146

  50. Integrating the BTB with the Branch History Table (BHT) Most likely, it is not the same bit field from the PC that is used to index the BTB+BHT and to select the PHT Intel P6 4-bit local history 512 BTB entries # of PHTs not published What happens on a BTB miss? “Backward taken, forward not taken” prediction. • The history of all branches needs to be recorded in BTB+BHT • Taken and not taken branches need to be included Baer p. 147

More Related