Microarchitecture of Superscalars (3) Branch Prediction

Microarchitecture of Superscalars (3)Branch Prediction Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

Branch prediction 1. Introdutcion 2. Basic branch prediction mechanisms 3. Auxiliary branch prediction mechanisms 4. Accessing the branch target path

1.1 The branch processing problem of pipelining (1) t t t t t i+1 i i+2 i+3 i+4 i i b W F D E i i+1 D F i F i+2 i F j BTI Branch fetching Branch detection BTA calculation BTI fetching 2 bubbles Figure 1.1: Straightforward processing of an unconditional branch on a four stage pipeline

t t t t t t i+1 i i+2 i+3 i+4 i+5 i i bc W F D E i F D i+1 E F D i i+2 i F i+3 i j F BTI bc fetching bc detection Condition checking (branch!) BTA calculation BTI fetching 3 bubbles 1.1 The branch processing problem of pipelining (2) Figure 1.2: Straightforward processing of a conditional branch on a four stage pipelinewith immediate condition resolution

1.1 The branch processing problem of pipelining (3) Figure 1.3: Straightforward processing of a conditional branch on a four stage pipeline, with delayed conditionresolution

No of pipeline stages 40 P4 Prescott (~30) 30 * Pentium 4 (~20) Core Duo 20 * Conroe Pentium Pro Athlon-64 (14) (~12) (12) Athlon * Pentium K6 * * (6) 10 (6) (5) * * * Year 1995 2005 2000 1990 1.1 The branch processing problem of pipelining (4) Figure 1.4: Number of pipeline stages in Intel’s and AMD’s processors

1.2 Branch statistics (1) Figure 1.5: Dynamic ratio of branches

1.2 Branch statistics (2) Figure 1.6: Ratio of the main instruction types Source: Stephens et al. „Instruction level profiling and evaluation of the IBM RS/6000”, Proc. 18th ISCA, pp. 137-146

~ 1/3 ~ 1/3 ~ 1/6 ~ 1/6 1.2 Branch statistics (3) Branches Unconditional branches Conditional branches Loop-closing Other Simple Branch Return from conditional conditional unconditional to subroutine subroutine branch branches branch ~ 1/3 Taken for the first (n-1) iterations Taken Not taken Not taken Taken ~ 1/6 ~ 5/6 Figure 1.7: Grohoski’s estimate of branch statistics Source: Grohoski, G.F, IBM J. Res. Develop., 34 Jan. pp. 37-58

1.2 Branch statistics (3) Figure 1.8: Frequency of taken and not taken branches Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 303

1.3 The principle of branch prediction (1) Figure 1.9: Correctly predicted conditional branchwith delayed condition resolution on a four stage pipeline

t t t t t t t t t t i+1 j+1 i i+2 i+3 i+4 j j+2 j+3 j+4 i i i i+1 i i+2 Condition Condition checking checking Condition checking i i+3 i j i j+1 i i+1 1.3 The principle of branch prediction (2) W E E D F E E bc F D F bc fetching bc detection Condition BTA checking Branch pred. (branch!) BTA calc. calculation (no branch!) Dynamic stop D BTI (speculative) F F F BTA fetching BTI decode F A large number of bubbles Figure 1.10: Incorrectly predicted conditional branchwith delayed condition resolution on a four stage pipeline fetching

1.3 The principle of branch prediction (3) Figure 1.11: Branch misprediction penalty on a long pipeline

1.4 Branch prediction accuracy/penalty (1) BHT : Branch history tableBTAC : Branch target address cache BTIC : Branch target instruction cacheIC : Instruction cache Figure 1.12: Branch prediction accuracy Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 340

1.4 Prediction accuracy/penalty (2) Effective penalty of branch processing (simplified) fc: Probability (frequency) of correctly predicted branches fm: Probability (frequency) of mispredicted branches Pc: Penalty of correctly predicted branches Pm: Penalty of mispredicted branches Examples: PPro P4 Willamette P4 Prescott 1 1 1.5 0.1 10 cycles 0.05 20 cycles 0.05 30 cycles

2. Basic branch prediction mechanisms 2.1 Introduction (1) Branch processing Branch detection Branch prediction Accessing the branch target path

2.1 Introduction (2) Branch prediction mechanisms Basic branch prediction mechanism Auxilliary branch prediction mechanism

2.1 Introduction (2) Basic branch prediction mechanism Processor based Compiler hints Local

? Figure 2.1.: Local prediction Prediction depends only on the behaviour of the branch considered

2.1 Introduction (2) Basic branch prediction mechanism Processor based Compiler hints Local Global (2-level)

1 0 0 Path 2: Path 1: 0 . . 0 0 0 . . 1 0 0 0 0 ? Figure 2.2.: Global prediction Prediction depends on the actual execution path, that is on all branches executed

2.1 Introduction (2) Basic branch prediction mechanism Processor based Compiler hints Local Combined Global (Choice prediction) (2-level)

2.2. Local prediction (1) Local prediction 1-level 2-level

2.2. Local prediction (2) 1-level (local) prediction Dynamic prediction Fixed prediction Static prediction Always the same prediction Based on the object code Based on the execution history 'Always not taken' 'Always taken' Displacement- Opcode- 1-bit approach approach based based prediction 80486 (1989) MC 68040 (1990) SuperSparc (1992) R4000 (1992) R8000 (1994) POWER1 (1990) PPC 601 (1993) PPC 601 (1993) POWER2 (1993) PPC: PowerPC

BHT (Branch History Table) 0: sequential cont1: branch. } x: x IFA: Figure 2.3: Principle of the 1-bit dynamic prediction 2.2. Local prediction (3)

NT Not T NT Taken taken T T: Branch has been taken NT: Branch has not been taken Figure 2.4: State transition diagram of the 1-bit dynamic prediction 2.2. Local prediction (4)

2.2. Local prediction (6) 1-level (local) prediction Dynamic prediction Fixed prediction Static prediction Always the same prediction Based on the object code Based on the execution history 'Always not taken' 'Always taken' Displacement- Opcode- 1-bit 2-bit approach approach based based prediction prediction 80486 (1989) Pentium (1993) MC 68040 (1990) MC 68060 (1993) SuperSparc (1992) UltraSparc (1995) R4000 (1992) R10000 (1996) R8000 (1994) PPC 604 (1995) POWER1 (1990) PPC 601 (1993) PPC 601 (1993) PPC 620 (1996) POWER2 (1993) PPC: PowerPC

BHT xx } xx: IFA: 00,01: sequential cont10,11: branch. BHT: Branch History Table Figure 2.6: Principle of the 2-bit dynamic prediction 2.2. Local prediction (7)

ANT ANT ANT Strongly Weakly Weakly Strongly AT ANT not not taken taken taken taken 10 11 01 00 AT AT AT Initialised when a branch is taken first Prediction: "Taken" Prediction: "Not Taken" Branch has been : AT: actually taken ANT: actually not taken Figure 2.7: Statetransition diagram of the most frequently used 2-bit dynamic prediction (Smith algorithm) 2.2. Local prediction (8)

IFA: IFA: IFA: Index BHT Tags Index IFA Tags C Tags C C IFA C (Counters) (E.g. two-way set associative) Reduces interferences but increases cost. For large tables most branches will Avoids interference but stronly increases cost. map to a unique entry. For smaller tables multiple branches may map to the same entry, resulting in interferences and thus in degrated 128*4 way BHT/BTAC (Pentium Pro) 16K entry local BHT (Power4) prediction accuracy. 1K*4 way BHT/BTAC (Pentium II, III, 4) 16K entry global BHT (Power4) Examples: 128*2 way BTAC (Power3) 16K entry selector table (Power4) 64 entry BTAC (PPC 604) 2.2. Local prediction (5) Accessing BHTs/BTACs Cache-like access Indexed access Associative access (direct / set associative) Figure 2.5: Alternatives for accessing Branch History Tables or Branch Target Address Buffers

2.2. Local prediction (9) 1-level (local) prediction Dynamicprediction Fixedprediction Staticprediction Always the same prediction Based on the object code Based on the execution history 3-bit 'Alwaysnottaken' 'Alwaystaken' Displacement- Opcode- 1-bit 2-bit prediction approach approach based based prediction prediction 80486 (1989) Pentium (1993) MC 68040 (1990) MC 68060 (1993) SuperSparc (1992) UltraSparc (1995) R4000 (1992) R10000 (1996) R8000 (1994) PPC 604 (1995) POWER1 (1990) PPC 601 (1993) PPC 601 (1993) PPC 620 (1996) POWER2 (1993) PPC: PowerPC Figure 2.8: Early branch prediction mechanisms and their trends indicated by subsequent models of pipelined, 1. and 2. generation superscalars

2.2. Local prediction (10) Local prediction 1-level 2-level Fixed prediction Staticprediction Dynamic prediction Based on the execution history Always the same prediction Based on the object code

2.2. Local prediction (11) 2-level local prediction 2-level local branch prediction (1.-level: branch patterns, 2.-level: history bits) Individual counters Shared counters With a shared global historytable for all patterns With individual history tables for different patterns (Alpha 21264) (Pentium Pro) Local BHT (e.g. 16×2 bit) IFA: IFA: Local BHT Local BHT Local BHT (e.g. 1K×3bit)1 (e.g. 1K×10bit) (e.g. 128×4bit) 6 Branch 1 0 1 1 1 0 0 1 0 1 0 0 1 Branch 0 1 1 0 1 0 e.g. 4-ways each The 21264 uses 3-bit saturating counters whose most significant bit provides the prediction

0 1 0 0 00/01 not taken xx: 10/11 taken 2.2. Local prediction (12) 7 6 0 BTA (linear) BHT Index Tag 127 Way 2 Way 0 Way 3 Way 1 0 1 1 0 15 0 History History Tags History History Tags Tags Tags 6 x x 4-bit 4-bit 4-bit 4-bit 0 Counters Figure 2.9.: The principle of Pentium Pro’s 128x4 way set associative BHT

2.2. Local prediction (13) 127 0 Tag Tag Tag H C H C H H C Tag C Figure 2.10.: The actual layout of Pentium Pro’s 128x4 way set associative BHT

2.3. Global prediction (1) Basic branch prediction mechanism Processor based Compiler hints Local Combined Global (Choice prediction) (2-level)

2.3. Global prediction (1) Global prediction Simple global

2.3. Global prediction (1) Global history (shift register) 0 1 1 0 0 1 1 BHT x Branch history Figure 2.11.: Simple global prediction

2.3. Global prediction (1) Global prediction Simple global Gshare

2.3. Global prediction (1) } Global history 0 1 1 0 0 1 1 XOR ... 1 0 0 1 1 0 0 IFA BHT x Branch history Figure 2.12.: Principle of the Gshare prediction

2.3. Global prediction (1) Global prediction Simple global Gshare Gselect

0 1 1 0 2.3. Global prediction (1) Global history 0 1 1 0 0 1 1 BHT Branch history x ... 1 0 IFA: Figure 2.13.: Principle of the Gselect prediction

2.4. Combined prediction (1) Basic branch prediction mechanism Processor based Compiler hints Local Combined Global (Choice prediction) (2-level)

2.4. Combined prediction (2) Global history IFA: IFA: Local Best choice Global BHT BHT BHT x Local Global Global prediction Local prediction prediction prediction Actualprediction (for updating) Resulting prediction Figure 2.14.: Principle of the combined local and global prediction (as used in the Alpha 21264, or the POWER 4)

2.4. Combined prediction (3) Combined prediction Choice 1. prediction 2. prediction 2-level local dynamic prediction with a Simple 2-level global prediction Global history referenced choice table shared counter table for all patterns Alpha 21264 (1K * 10 bits/1K * 3 bits) (12-bit global history/4K * 2 bits) (12-bit global history/4K * 2-bits) Figure 2.15.: Implementation alternatives of the combined prediction

2.4. Combined prediction (4) • Minimum branch penalty: 7 cycles • Typical branch penalty: 11+ cycles (IQ delay) • 48K bits of target addresses stored in I-cache • 32-entry return address stack • Predictor tables are reset on a context switch Figure 2.16.: The combined predictor of the Alpha 21264 Source: Microprocessor Report, 10/28/96

2.4. Combined prediction (5) Combined prediction Choice 1. prediction 2. prediction 2-level local dynamic prediction with a Simple 2-level global prediction Global history referenced choice table shared counter table for all patterns Alpha 21264 (1K * 10 bits/1K * 3 bits) (12-bit global history/4K * 2 bits) (12-bit global history/4K * 2-bits) Accessed in the same way as the 2-level Gshare global prediction 1-level local dynamic prediction global counter table (11-bit global history is hashed with POWER 4 the IFA, 16K * 1-bit counter table) (16K * 1-bit) (16K * 1-bit) Figure 2.17.: Implementation alternatives of the combined prediction

11-bit global history } 0 1 1 0 0 1 1 XOR 1-bit per group ... 1 0 0 1 1 0 0 18 5 IFA IFA: BHT 14 14 14 16K*1bit 16K*1bit 16K*1bit Global History Selector Table Local History Update Select the better Local Global prediction prediction 2.4. Combined prediction (6) Figure 2.18.: The principle of the combined predictor of the POWER 4

2.5. Overview of the basic branch prediction mechanisms Figure 2.20.: Trends of branch prediction schemes used in 2. and 3. generation superscalars

3. Auxillary branch prediction mechanisms Auxiliary branch prediction mechanisms Backup use of static prediction 1 Pentium Pentium Pentium Pro Pentium Pro P4 Will/Northw. P4 Will/Northw. P4 Prescott P4 Prescott K6 K7 K8 PPC 604 PPC 620 POWER 3 POWER 4 POWER 5 1 Alpha 21164 Alpha 21264 PA-8000 PA-8500/8700 UltraSPARC-III 1: 1. generation superscalars RAS: Return Address Stack 2: Supported by compiler hints 1 Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars

Microarchitecture of Superscalars (3) Branch Prediction