Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 Dynamic Branch Predictionand Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday: Chapter 3.7-3.9, 4.1 Homework #2: due Friday 14th, 2.8, A.2, A.13, 3.6a&b, 3.10, 4.5, 4.8, (4.13 optional) Project Proposals due Wednesday!

ENGS 116 Lecture 9 Project Proposals • 2 pages • Names and Title • Introduction to problem domain • Research question • goal of your project • Work plan • e.g.: 2 weeks programming, 1 week experiments, 1 week writing • References • books, websites, articles

ENGS 116 Lecture 9 Dynamic Branch Prediction • Control dependences limit ILP • Performance = (accuracy, cost of misprediction) • Branches arrive much faster when multiple instructions are issued per clock • Amdahl’s Law • Want to predict outcome of branch as early as possible • Methods: • Branch history table (1 or more bits) • Correlated branches • Branch target buffer

Lower bits of PC T NT T T NT NT . . . ENGS 116 Lecture 9 Dynamic Branch Prediction: A Simple Approach • Branch History Table (BHT) (aka Branch Prediction Buffer) • Lower bits of PC address index table of 1-bit values • Entry says whether or not branch taken last time • No address check • Problem: In a loop, 1-bit BHT will cause two mispredictions • First time through loop on next time through code, when it predicts exit instead of looping • End of loop case, when it exits instead of looping as before

Taken Not taken Predict taken Predict taken Taken Not taken Taken Not taken Predict not taken Predict not taken Taken Not taken ENGS 116 Lecture 9 Dynamic Branch Prediction: A Better Way Solution: 2-bit scheme where prediction changes only if we get misprediction twice. Helps when target is known before result of condition.

ENGS 116 Lecture 9 BHT General Case • n-bit predictor: • counter can hold values between 0 and • predict taken when value is greater than or equal to half of maximum value: • The counter is incremented on each taken branch • and decremented on each not taken branch

ENGS 116 Lecture 9 BHT Accuracy • Mispredict because either: • Wrong guess for that branch • Got branch history of wrong branch from index table • 4096-entry table: programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%. • 4096 entries about as good as infinite number of entries • 2-bit predictors work nearly as well as more-bit predictors

ENGS 116 Lecture 9 Correlating Branches • Hypothesis: recent branches are correlated; that is, behavior of recently-executed branches affects prediction of current branch if (d == 0) d = 1; if (d == 1) …

ENGS 116 Lecture 9 Correlated Branch Prediction • Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch history table • In general, (m,n) predictor means record last m branches to select between 2m history tables, each with n-bit counters • Thus, old 2-bit BHT is a (0,2) predictor • Global Branch History: m-bit shift register keeping T/NT status of last m branches. • Each entry in table has mn-bit predictors.

ENGS 116 Lecture 9 Correlating Branches • (2,2) predictor • – Behavior of recent branches selects between four predictions of next branch, updating just that prediction Branch address 4 2-bits per branch predictor Prediction 2-bit global branch history

ENGS 116 Lecture 9 Accuracy of Different Schemes(FROM SECOND EDITION) 20% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% 16% 14% 12% 11% Frequency of Mispredictions 10% 8% 6% 6% 6% 6% 5% 5% 4% 4% 2% 1% 1% 0% 0% nasa7 matrix300 tomcatv doducd spice fpppp gcc expresso eqntott li 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

ENGS 116 Lecture 9 Tournament Predictors • Multilevel branch predictor • Use n-bit saturating counter to choose between predictors • Usual choice between global and local predictors

ENGS 116 Lecture 9 Tournament Predictors: DEC Alpha 21264 Tournament predictor using 4K 2-bit counters indexed by local branch address. Chooses between: • Global predictor • 4K entries index by history of last 12 branches (212 = 4K) • Each entry is a standard 2-bit predictor • Local predictor • Local history table: 1024 10-bit entries recording last 10 branches, index by branch address • The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters

ENGS 116 Lecture 9 Branch Target Buffers Branch target calculation is costly and stalls the instruction fetch. BTB stores PCs the same way as caches The PC of a branch is sent to the BTB When a match is found the corresponding Predicted PC is returned If the branch was predicted taken, instruction fetch continues at the returned predicted PC

ENGS 116 Lecture 9 Branch Target Buffers

Enter branch instruction PC and next PC into branch target buffer ENGS 116 Lecture 9 Figure 3.20 The steps involved in handling an instruction with a branch-target buffer Send PC to memory and branch-target buffer IF No Yes Entry found in branch-target buffer? Send out predicted PC No Yes Is instruction ataken branch? ID No Yes Normal instruction execution Branch taken? Mispredicted branch, kill fetched instruction; restart fetch at other target; delete entry from target buffer Branch correctly predicted; continue execution with no stalls EX

ENGS 116 Lecture 9 Multiple Issue Machines • Superscalar: multiple parallel dedicated pipelines: • Varying number of instructions per cycle, scheduled by compiler and/or by hardware (Tomasulo) • IBM PowerPC, Sun UltraSparc, DEC Alpha, IA32 Pentium • VLIW (Very Long Instruction Word): multiple operations encoded in instruction: • Instructions have wide template (4-16 operations) • IA-64 Itanium

ENGS 116 Lecture 9 Getting CPI < 1: Issuing Multiple Instructions/Cycle • Superscalar DLX: 2 instructions, 1 FP & 1 anything else • Fetch 64-bits/clock cycle; integer on left, FP on right • Can only issue 2nd instruction if 1st instruction issues • More ports for FP registers to do FP load & FP op in a pair • 1 cycle load delay expands to 3 instructions in superscalar DLX • Instruction in right half can’t use it, nor instructions in next slot

ENGS 116 Lecture 9 Multiple Issue Challenges • While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: • Exactly 50% FP operations • No hazards • If more instructions issued at same time, greater difficulty in decode and issue • Even 2-way scalar  examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue • VLIW: tradeoff instruction space for simple decoding • The long instruction word has room for many operations • By definition, all the operations the compiler puts in the long instruction word are independent  execute in parallel • E.g., 2 integer operations, 2 FP ops, 2 memory refs, 1 branch  16 to 24 bits per field  7*16 or 112 bits to 7*24 or 168 bits wide • Need compiling technique that schedules across several branches

ENGS 116 Lecture 9 Limits to Multi-Issue Machines • Inherent limitations of instruction-level parallelism • 1 branch in 5: How to keep a 5-way VLIW busy? • Latencies of units: many operations must be scheduled • Easy: More instruction bandwidth • Easy: Duplicate functional units to get parallel execution • Hard: Increase ports to register file (bandwidth) • VLIW example needs 7 reads and 3 writes for integer registers & 5 reads and 3 writes for FP registers • Harder: Increase ports to memory (bandwidth) • Decoding superscalar and impact on clock rate, pipeline depth?

ENGS 116 Lecture 9 Hardware-Based Speculation • Instead of just instruction fetch and decode, also execute instructions based on prediction of branch. • Execute instructions out of order as soon as their operands are available. • Wait with instruction commit until branch is decided. • Re-order instructions after execution and commit them in order • reorder buffer or ROB • register file not updated until commit • Do not raise exceptions until instruction is committed • ROB holds and provides operands until commit.

ENGS 116 Lecture 9 Tomasulo with Speculation • Issue – Empty reservation station and an empty ROB slot. Send operands to reservation station from register file or from ROB. This stage is often referred to as: dispatch • Execute – Monitor CDB for operands, check RAW hazards. When both operands are available, then execute. • Write Result – When available, write result to CDB through to ROB and any waiting reservation stations. Stores write to value field in ROB. • Commit – Three cases: • Normal Commit: write registers, in order commit • Store: update memory • Incorrect branch: flush ROB, reservation stations and restart execution at correct PC

ENGS 116 Lecture 9

ENGS 116 Lecture 9 Problems with speculation • Multi Issue Machines: • Must be able to commit multiple instructions from ROB • More registers, more renaming • How much speculation: • How many branches deep? • What to do on a cache miss? • TLB miss? • Cache interference due to incorrect branch prediction

ENGS 116 Lecture 9 Figure: 3.41 Number of registers available for renaming.

ENGS 116 Lecture 9 Figure: 3.45 Window size: the number of instructions the issue unit may look ahead and schedule from.

ENGS 116 Lecture 9 HW Support for More ILP • Avoid branch prediction by turning branches into conditionally executed instructions: If (X) then A = B op C else NOP • If false, then neither store result nor cause exception • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instruction. • IA-64: 61 1-bit condition fields selected so conditional execution of any instruction • Drawbacks to conditional instructions • Still takes a clock even if “annulled” • Stall if condition evaluated late • Complex conditions reduce effectiveness; condition becomes known late in pipeline X A = B op C

Dynamic Branch Prediction and Speculation