1 / 73

Microarchitecture of Superscalars (3) Branch Prediction

Microarchitecture of Superscalars (3) Branch Prediction. Dezső Sima Fall 2007. (Ver. 2.0).  Dezső Sima, 2007. Branch prediction. 1. Introdutcion. 2. Basic branch prediction mechanisms. 3. Auxiliary branch prediction mechanisms. 4. Accessing the branch target path.

argus
Télécharger la présentation

Microarchitecture of Superscalars (3) Branch Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microarchitecture of Superscalars (3)Branch Prediction Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

  2. Branch prediction 1. Introdutcion 2. Basic branch prediction mechanisms 3. Auxiliary branch prediction mechanisms 4. Accessing the branch target path

  3. 1.1 The branch processing problem of pipelining (1) t t t t t i+1 i i+2 i+3 i+4 i i b W F D E i i+1 D F i F i+2 i F j BTI Branch fetching Branch detection BTA calculation BTI fetching 2 bubbles Figure 1.1: Straightforward processing of an unconditional branch on a four stage pipeline

  4. t t t t t t i+1 i i+2 i+3 i+4 i+5 i i bc W F D E i F D i+1 E F D i i+2 i F i+3 i j F BTI bc fetching bc detection Condition checking (branch!) BTA calculation BTI fetching 3 bubbles 1.1 The branch processing problem of pipelining (2) Figure 1.2: Straightforward processing of a conditional branch on a four stage pipelinewith immediate condition resolution

  5. 1.1 The branch processing problem of pipelining (3) Figure 1.3: Straightforward processing of a conditional branch on a four stage pipeline, with delayed conditionresolution

  6. No of pipeline stages 40 P4 Prescott (~30) 30 * Pentium 4 (~20) Core Duo 20 * Conroe Pentium Pro Athlon-64 (14) (~12) (12) Athlon * Pentium K6 * * (6) 10 (6) (5) * * * Year 1995 2005 2000 1990 1.1 The branch processing problem of pipelining (4) Figure 1.4: Number of pipeline stages in Intel’s and AMD’s processors

  7. 1.2 Branch statistics (1) Figure 1.5: Dynamic ratio of branches

  8. 1.2 Branch statistics (2) Figure 1.6: Ratio of the main instruction types Source: Stephens et al. „Instruction level profiling and evaluation of the IBM RS/6000”, Proc. 18th ISCA, pp. 137-146

  9. ~ 1/3 ~ 1/3 ~ 1/6 ~ 1/6 1.2 Branch statistics (3) Branches Unconditional branches Conditional branches Loop-closing Other Simple Branch Return from conditional conditional unconditional to subroutine subroutine branch branches branch ~ 1/3 Taken for the first (n-1) iterations Taken Not taken Not taken Taken ~ 1/6 ~ 5/6 Figure 1.7: Grohoski’s estimate of branch statistics Source: Grohoski, G.F, IBM J. Res. Develop., 34 Jan. pp. 37-58

  10. 1.2 Branch statistics (3) Figure 1.8: Frequency of taken and not taken branches Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 303

  11. 1.3 The principle of branch prediction (1) Figure 1.9: Correctly predicted conditional branchwith delayed condition resolution on a four stage pipeline

  12. t t t t t t t t t t i+1 j+1 i i+2 i+3 i+4 j j+2 j+3 j+4 i i i i+1 i i+2 Condition Condition checking checking Condition checking i i+3 i j i j+1 i i+1 1.3 The principle of branch prediction (2) W E E D F E E bc F D F bc fetching bc detection Condition BTA checking Branch pred. (branch!) BTA calc. calculation (no branch!) Dynamic stop D BTI (speculative) F F F BTA fetching BTI decode F A large number of bubbles Figure 1.10: Incorrectly predicted conditional branchwith delayed condition resolution on a four stage pipeline fetching

  13. 1.3 The principle of branch prediction (3) Figure 1.11: Branch misprediction penalty on a long pipeline

  14. 1.4 Branch prediction accuracy/penalty (1) BHT : Branch history tableBTAC : Branch target address cache BTIC : Branch target instruction cacheIC : Instruction cache Figure 1.12: Branch prediction accuracy Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 340

  15. 1.4 Prediction accuracy/penalty (2) Effective penalty of branch processing (simplified) fc: Probability (frequency) of correctly predicted branches fm: Probability (frequency) of mispredicted branches Pc: Penalty of correctly predicted branches Pm: Penalty of mispredicted branches Examples: PPro P4 Willamette P4 Prescott 1 1 1.5 0.1 10 cycles 0.05 20 cycles 0.05 30 cycles

  16. 2. Basic branch prediction mechanisms 2.1 Introduction (1) Branch processing Branch detection Branch prediction Accessing the branch target path

  17. 2.1 Introduction (2) Branch prediction mechanisms Basic branch prediction mechanism Auxilliary branch prediction mechanism

  18. 2.1 Introduction (2) Basic branch prediction mechanism Processor based Compiler hints Local

  19. ? Figure 2.1.: Local prediction Prediction depends only on the behaviour of the branch considered

  20. 2.1 Introduction (2) Basic branch prediction mechanism Processor based Compiler hints Local Global (2-level)

  21. 1 0 0 Path 2: Path 1: 0 . . 0 0 0 . . 1 0 0 0 0 ? Figure 2.2.: Global prediction Prediction depends on the actual execution path, that is on all branches executed

  22. 2.1 Introduction (2) Basic branch prediction mechanism Processor based Compiler hints Local Combined Global (Choice prediction) (2-level)

  23. 2.2. Local prediction (1) Local prediction 1-level 2-level

  24. 2.2. Local prediction (2) 1-level (local) prediction Dynamic prediction Fixed prediction Static prediction Always the same prediction Based on the object code Based on the execution history 'Always not taken' 'Always taken' Displacement- Opcode- 1-bit approach approach based based prediction 80486 (1989) MC 68040 (1990) SuperSparc (1992) R4000 (1992) R8000 (1994) POWER1 (1990) PPC 601 (1993) PPC 601 (1993) POWER2 (1993) PPC: PowerPC

  25. BHT (Branch History Table) 0: sequential cont1: branch. } x: x IFA: Figure 2.3: Principle of the 1-bit dynamic prediction 2.2. Local prediction (3)

  26. NT Not T NT Taken taken T T: Branch has been taken NT: Branch has not been taken Figure 2.4: State transition diagram of the 1-bit dynamic prediction 2.2. Local prediction (4)

  27. 2.2. Local prediction (6) 1-level (local) prediction Dynamic prediction Fixed prediction Static prediction Always the same prediction Based on the object code Based on the execution history 'Always not taken' 'Always taken' Displacement- Opcode- 1-bit 2-bit approach approach based based prediction prediction 80486 (1989) Pentium (1993) MC 68040 (1990) MC 68060 (1993) SuperSparc (1992) UltraSparc (1995) R4000 (1992) R10000 (1996) R8000 (1994) PPC 604 (1995) POWER1 (1990) PPC 601 (1993) PPC 601 (1993) PPC 620 (1996) POWER2 (1993) PPC: PowerPC

  28. BHT xx } xx: IFA: 00,01: sequential cont10,11: branch. BHT: Branch History Table Figure 2.6: Principle of the 2-bit dynamic prediction 2.2. Local prediction (7)

  29. ANT ANT ANT Strongly Weakly Weakly Strongly AT ANT not not taken taken taken taken 10 11 01 00 AT AT AT Initialised when a branch is taken first Prediction: "Taken" Prediction: "Not Taken" Branch has been : AT: actually taken ANT: actually not taken Figure 2.7: Statetransition diagram of the most frequently used 2-bit dynamic prediction (Smith algorithm) 2.2. Local prediction (8)

  30. IFA: IFA: IFA: Index BHT Tags Index IFA Tags C Tags C C IFA C (Counters) (E.g. two-way set associative) Reduces interferences but increases cost. For large tables most branches will Avoids interference but stronly increases cost. map to a unique entry. For smaller tables multiple branches may map to the same entry, resulting in interferences and thus in degrated 128*4 way BHT/BTAC (Pentium Pro) 16K entry local BHT (Power4) prediction accuracy. 1K*4 way BHT/BTAC (Pentium II, III, 4) 16K entry global BHT (Power4) Examples: 128*2 way BTAC (Power3) 16K entry selector table (Power4) 64 entry BTAC (PPC 604) 2.2. Local prediction (5) Accessing BHTs/BTACs Cache-like access Indexed access Associative access (direct / set associative) Figure 2.5: Alternatives for accessing Branch History Tables or Branch Target Address Buffers

  31. 2.2. Local prediction (9) 1-level (local) prediction Dynamicprediction Fixedprediction Staticprediction Always the same prediction Based on the object code Based on the execution history 3-bit 'Alwaysnottaken' 'Alwaystaken' Displacement- Opcode- 1-bit 2-bit prediction approach approach based based prediction prediction 80486 (1989) Pentium (1993) MC 68040 (1990) MC 68060 (1993) SuperSparc (1992) UltraSparc (1995) R4000 (1992) R10000 (1996) R8000 (1994) PPC 604 (1995) POWER1 (1990) PPC 601 (1993) PPC 601 (1993) PPC 620 (1996) POWER2 (1993) PPC: PowerPC Figure 2.8: Early branch prediction mechanisms and their trends indicated by subsequent models of pipelined, 1. and 2. generation superscalars

  32. 2.2. Local prediction (10) Local prediction 1-level 2-level Fixed prediction Staticprediction Dynamic prediction Based on the execution history Always the same prediction Based on the object code

  33. 2.2. Local prediction (11) 2-level local prediction 2-level local branch prediction (1.-level: branch patterns, 2.-level: history bits) Individual counters Shared counters With a shared global historytable for all patterns With individual history tables for different patterns (Alpha 21264) (Pentium Pro) Local BHT (e.g. 16×2 bit) IFA: IFA: Local BHT Local BHT Local BHT (e.g. 1K×3bit)1 (e.g. 1K×10bit) (e.g. 128×4bit) 6 Branch 1 0 1 1 1 0 0 1 0 1 0 0 1 Branch 0 1 1 0 1 0 e.g. 4-ways each The 21264 uses 3-bit saturating counters whose most significant bit provides the prediction

  34. 0 1 0 0 00/01 not taken xx: 10/11 taken 2.2. Local prediction (12) 7 6 0 BTA (linear) BHT Index Tag 127 Way 2 Way 0 Way 3 Way 1 0 1 1 0 15 0 History History Tags History History Tags Tags Tags 6 x x 4-bit 4-bit 4-bit 4-bit 0 Counters Figure 2.9.: The principle of Pentium Pro’s 128x4 way set associative BHT

  35. 2.2. Local prediction (13) 127 0 Tag Tag Tag H C H C H H C Tag C Figure 2.10.: The actual layout of Pentium Pro’s 128x4 way set associative BHT

  36. 2.3. Global prediction (1) Basic branch prediction mechanism Processor based Compiler hints Local Combined Global (Choice prediction) (2-level)

  37. 2.3. Global prediction (1) Global prediction Simple global

  38. 2.3. Global prediction (1) Global history (shift register) 0 1 1 0 0 1 1 BHT x Branch history Figure 2.11.: Simple global prediction

  39. 2.3. Global prediction (1) Global prediction Simple global Gshare

  40. 2.3. Global prediction (1) } Global history 0 1 1 0 0 1 1 XOR ... 1 0 0 1 1 0 0 IFA BHT x Branch history Figure 2.12.: Principle of the Gshare prediction

  41. 2.3. Global prediction (1) Global prediction Simple global Gshare Gselect

  42. 0 1 1 0 2.3. Global prediction (1) Global history 0 1 1 0 0 1 1 BHT Branch history x ... 1 0 IFA: Figure 2.13.: Principle of the Gselect prediction

  43. 2.4. Combined prediction (1) Basic branch prediction mechanism Processor based Compiler hints Local Combined Global (Choice prediction) (2-level)

  44. 2.4. Combined prediction (2) Global history IFA: IFA: Local Best choice Global BHT BHT BHT x Local Global Global prediction Local prediction prediction prediction Actualprediction (for updating) Resulting prediction Figure 2.14.: Principle of the combined local and global prediction (as used in the Alpha 21264, or the POWER 4)

  45. 2.4. Combined prediction (3) Combined prediction Choice 1. prediction 2. prediction 2-level local dynamic prediction with a Simple 2-level global prediction Global history referenced choice table shared counter table for all patterns Alpha 21264 (1K * 10 bits/1K * 3 bits) (12-bit global history/4K * 2 bits) (12-bit global history/4K * 2-bits) Figure 2.15.: Implementation alternatives of the combined prediction

  46. 2.4. Combined prediction (4) • Minimum branch penalty: 7 cycles • Typical branch penalty: 11+ cycles (IQ delay) • 48K bits of target addresses stored in I-cache • 32-entry return address stack • Predictor tables are reset on a context switch Figure 2.16.: The combined predictor of the Alpha 21264 Source: Microprocessor Report, 10/28/96

  47. 2.4. Combined prediction (5) Combined prediction Choice 1. prediction 2. prediction 2-level local dynamic prediction with a Simple 2-level global prediction Global history referenced choice table shared counter table for all patterns Alpha 21264 (1K * 10 bits/1K * 3 bits) (12-bit global history/4K * 2 bits) (12-bit global history/4K * 2-bits) Accessed in the same way as the 2-level Gshare global prediction 1-level local dynamic prediction global counter table (11-bit global history is hashed with POWER 4 the IFA, 16K * 1-bit counter table) (16K * 1-bit) (16K * 1-bit) Figure 2.17.: Implementation alternatives of the combined prediction

  48. 11-bit global history } 0 1 1 0 0 1 1 XOR 1-bit per group ... 1 0 0 1 1 0 0 18 5 IFA IFA: BHT 14 14 14 16K*1bit 16K*1bit 16K*1bit Global History Selector Table Local History Update Select the better Local Global prediction prediction 2.4. Combined prediction (6) Figure 2.18.: The principle of the combined predictor of the POWER 4

  49. 2.5. Overview of the basic branch prediction mechanisms Figure 2.20.: Trends of branch prediction schemes used in 2. and 3. generation superscalars

  50. 3. Auxillary branch prediction mechanisms Auxiliary branch prediction mechanisms Backup use of static prediction 1 Pentium Pentium Pentium Pro Pentium Pro P4 Will/Northw. P4 Will/Northw. P4 Prescott P4 Prescott K6 K7 K8 PPC 604 PPC 620 POWER 3 POWER 4 POWER 5 1 Alpha 21164 Alpha 21264 PA-8000 PA-8500/8700 UltraSPARC-III 1: 1. generation superscalars RAS: Return Address Stack 2: Supported by compiler hints 1 Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars

More Related