ILP and Dynamic Execution: Branch Prediction & Multiple Issue

ILP and Dynamic Execution: Branch Prediction & Multiple Issue Original by Prof. David A. Patterson

Review Tomasulo • Reservations stations: implicit register renaming to larger set of registers + buffering source operands • Prevents registers as bottleneck • Avoids WAR, WAW hazards of Scoreboard • Allows loop unrolling in HW • Not limited to basic blocks (integer units gets ahead, beyond branches) • Lasting Contributions • Dynamic scheduling • Register renaming • 360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

Tomasulo Algorithm and Branch Prediction • 360/91 predicted branches, but did not speculate: pipeline stopped until the branch was resolved • No speculation; only instructions that can complete • Speculation with Reorder Buffer allows execution past branch, and then discard if branch fails • just need to hold instructions in buffer until branch can commit

Case for Branch Prediction when Issue N instructions per clock cycle • Branches will arrive up to n times faster in an n-issue processor • Amdahl’s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-issue processor

7 Branch Prediction Schemes • 1-bit Branch-Prediction Buffer • 2-bit Branch-Prediction Buffer • Correlating Branch Prediction (two-level) Buffer • Tournament Branch Predictor • Branch Target Buffer • Integrated Instruction Fetch Units • Return Address Predictors

Dynamic Branch Prediction • Performance = ƒ(accuracy, cost of misprediction) • Branch History Table (BHT): Lower bits of PC address index table of 1-bit values • Says whether or not branch taken last time • No address check (saves HW, but may not be right branch) • Problem: in a loop, 1-bit BHT will cause 2 mispredictions (avg is 9 iterations before exit): • End of loop case, when it exits instead of looping as before • First time through loop on next time through code, when it predicts exit instead of looping • Only 80% accuracy even if loop 90% of the time

Dynamic Branch Prediction(Jim Smith, 1981) • Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249) • Red: stop, not taken • Green: go, taken • Adds hysteresis to decision making process T NT Predict Taken Predict Taken T T NT NT Predict Not Taken Predict Not Taken T NT

Prediction Accuracy of a 4096-entry 2-bit Prediction Buffer vs. an Infinite Buffer

BHT Accuracy • Mispredict because either: • Wrong guess for that branch • Got branch history of wrong branch when index the table • 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% • For SPEC92,4096 about as good as infinite table

Improve Prediction Strategy By Correlating Branches • Consider the worst case for the 2-bit predictor if (aa==2) then aa=0; if (bb==2) then bb=0; if (aa != bb) then whatever • single level predictors can never get this case • Correlating or 2-level predictors • Correlation = what happened on the last branch • Note that the last correlator branch may not always be the same • Predictor = which way to go • 4 possibilities: which way the last one went chooses the prediction • (Last-taken, last-not-taken) X (predict-taken, predict-not-taken) if the first 2 fail then the 3rd will always be taken From 柯皓仁, 交通大學

Correlating Branches • Hypothesis: recently executed branches are correlated; that is, behavior of recently executed branches affects prediction of current branch • Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table • In general, (m,n) predictor means record last m branches to select between 2m history tables each with n-bit counters • Old 2-bit BHT is then a (0,2) predictor From 柯皓仁, 交通大學

if (d==0) d = 1; if (d==1) … BNEZ R1, L1 ;branch b1 (d!=0) DAAIU R1, R0, #1 ;d==0, so d=1 L1: DAAIU R3, R1, #-1 BNEZ R3, L2 ;branch b2 (d!=1) … L2: Example of Correlating Branch Predictors From 柯皓仁, 交通大學

A Problem for 1-bit Predictor without Correlators From 柯皓仁, 交通大學

Example of Correlating Branch Predictors (1,1) (Cont.) From 柯皓仁, 交通大學

In general: (m,n) BHT (prediction buffer) • p bits of buffer index = 2p entries of BHT • Use last m branches = global branch history • Use n bit predictor • Total bits for the (m, n) BHT prediction buffer: From 柯皓仁, 交通大學

(2,2) Predictor Implementation 4 banks = each with 32 2-bit predictor entries p=5m=2n=2 5:32 From 柯皓仁, 交通大學

Accuracy of Different Schemes

Tournament Predictors • Adaptively combine local and global predictors • Multiple predictors • One based on global information: Results of recently executed m branches • One based on local information: Results of past executions of the current branch instruction • Selector to choose which predictors to use • Advantage • Ability to select the right predictor for the right branch From 柯皓仁, 交通大學

Misprediction Rate Comparison

Branch Target Buffer (BTB) • To reduce the branch penalty to 0 • Need to know what the address is by the end of IF • But the instruction is not even decoded yet • So use the instruction address rather than wait for decode • If prediction works then penalty goes to 0! • BTB Idea -- Cache to store taken branches (no need to store untaken) • Match tag is instruction address  compare with current PC • Data field is the predicted PC • May want to add predictor field • To avoid the mispredict twice on every loop phenomenon • Adds complexity since we now have to track untaken branches as well

Branch PC Predicted PC Need Address at Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) • Note: must check for branch match now, since can’t use wrong branch address (Figure 3.19, p. 262) PC of instruction FETCH =? Extra prediction state bits Yes: instruction is branch and use predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4)

Integrated Instruction Fetch Units • Integrated branch prediction • The branch predictor becomes part of the instruction fetch unit and is constantly predicting branches • Instruction prefetch • To deliver multiple instructions per clock, the instruction fetch unit will likely need to fetch ahead • Instruction memory access and buffering

Return Address Predictor • Indirect jump – jumps whose destination address varies at run time • indirect procedure call, select or case, procedure return • SPEC89 benchmarks: 85% of indirect jumps are procedure returns • Accuracy of BTB for procedure returns are low • if procedure is called from many places, and the calls from one place are not clustered in time • Use a small buffer of return addresses operating as a stack • Cache the most recent return addresses • Push a return address at a call, and pop one off at a return • If the cache is sufficient large (max call depth)  prefect From 柯皓仁, 交通大學

Dynamic Branch Prediction Summary • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch. • Either different branches • Or different executions of same branches • Tournament Predictor: more resources to competitive solutions and pick between them • Branch Target Buffer: include branch address & prediction • Return address stack for prediction of indirect jump

Getting CPI < 1: Issuing Multiple Instructions/Cycle • Vector Processing • Explicit coding of independent loops as operations on large vectors of numbers • Multimedia instructions being added to many processors • Superscalar • varying number instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) • IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4 • (Very) Long Instruction Words (V)LIW: • fixed number of instructions (4-16) scheduled by the compiler; put operations into wide templates • Intel Architecture-64 (IA-64) 64-bit address • Renamed: “Explicitly Parallel Instruction Computer (EPIC)” • Anticipated success of multiple instructions lead to Instructions Per Clock cycle (IPC) vs. CPI

Getting CPI < 1: IssuingMultiple Instructions/Cycle • Superscalar MIPS: 2 instructions, 1 FP & 1 anything – Fetch 64-bits/clock cycle; Int on left, FP on right Type Pipe Stages Int. instruction IF ID EX MEM WB FP ADD instruction IF ID EX EX EX WB Int. instruction IF ID EX MEM WB FP ADD instruction IF ID EX EX EX WB Int. instruction IF ID EX MEM WB FP ADD instruction IF ID EX EX EX WB • While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: • Exactly 50% FP operations AND No hazards

Multiple Issue Issues • Issue packet: group of instructions from fetch unit that could potentially issue in 1 clock • If instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue • 0 to N instruction issues per clock cycle, for N-issue • Performing issue checks in 1 cycle could limit clock cycle time • Issue stage usually split and pipelined • 1st stage decides how many instructions from within this packet can issue, 2nd stage examines hazards among selected instructions and those already been issued • higher branch penalties => prediction accuracy important

Studies of The Limitations of ILP • Perfect hardware model • Rename as much as you need • Infinite registers • No WAW or WAR • Branch prediction is perfect • Never happen in reality • Jump prediction (even computed such as return) is perfect • Similarly unreal • Perfect memory disambiguation • Almost perfect is not too hard in practice • Issue an unlimited number of instructions at once & no restriction on types of instructions issued • One-cycle latency

What A Perfect Processor Must Do? • Look arbitrary far ahead to find a set of instructions to issue, predicting all branches perfectly • Rename all register uses to avoid WAW and WAR hazards • Determine whether there are any dependences among the instructions in the issue packet; if so, rename accordingly • Determine if any memory dependences exist among the issuing instructions and hand them appropriately • Provide enough replicated function units to allow all the ready instructions to issue

How many instructions would issue on the perfect machine every cycle?

Window Size • The set of instructions that is examined for simultaneous execution is called the window • The window size will be determined by the cost of determining whether n issuing instructions have any register dependences among them • In theory, This cost will be O(n2), refer to p.243, why? • Each instruction in the window must be kept in processor • Window size is limited by the required storage, the comparisons, and a limited issue rate

Effects of Reducing the Window Size

Effects of Realistic Branch and Jump Prediction • Perfect • Tournament-based • Uses a correlating 2 bit and non-correlating 2 bit plus a selector to choose between the two • Prediction buffer has 8K (13 address bits from the branch) • 3 entries per slot - non-correlating, correlating, select • Standard 2 bit • 512 (9 address bits) entries • Plus 16 entry buffer to predict RETURNS • Static • Based on profile - predict either T or NT but it stays fixed • None

The Effect of Branch-Prediction Schemes

Effects of Finite Registers

Models for Memory Alias Analysis • Perfect • Global/Stack Perfect • Perfect prediction for global and stack references and assume all heap references conflict • The best compiler-based analysis schemes can do • Inspection • Pointers to different allocation areas (such as the global area and the stack area) are assumed no conflict • Addresses using same register with different offsets • None • All memory references are assumed to conflict

Effects of Memory Aliasing

ILP and Dynamic Execution: Branch Prediction & Multiple Issue