Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining

Lecture 10Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining CS510 Computer Architectures

Review: Tomasulo Summary • Register file is not the bottleneck • Avoids the WAR, WAW hazards of Scoreboard • Not limited to basic blocks (provided branch prediction) • Allows loop unrolling in HW • Lasting Contributions • Dynamic scheduling • Register renaming • Load/store disambiguation CS510 Computer Architectures

Reasons for Branch Prediction • For a branch instruction, predict whether the branch will actually be TAKEN or NOT TAKEN • Pipeline bubbles(stalls)due to branch are the main source of performance degradation • Avoids pipeline bubbles by feeding the pipeline with the instructions in the predicted path(speculative) • Speculative execution significantly improves the performance of the deeply pipelined, wide issue superscalar processors • More accurate branch prediction is required because all speculative work beyond a branch must be thrown away if mispredicted • Performance=f(Accuracy, cost of misprediction) • Accuracy is better with Dynamic Prediction CS510 Computer Architectures

Problem: in a loop, 1-bit BHT will cause two mispredictions: • End of loop branch, • - Last time through the loop, when it exits instead of • looping as before • - First time through loop, • when it predicts exit instead of looping . . Actual outcome . PC Prediction BHT 1-Bit Branch History Table • Simplest of all dynamic prediction schemes • Based on the properties that most branches are either usually TAKEN or usually NOT TAKEN • property found in the iterations of a loop • Keep the last outcome of each branch in the BHT • BHT is indexed using the lower order bits of PC CS510 Computer Architectures

Example main() { int i, j; while (1) for (i=1; i<=4; i++) { j = i++; } } 1 0 N 1 X 1 1 T 1 O 1 1 T 1 O 1 1 T 1 O 0 1 T 0 X 1 1 T 1 O 1 1 T 1 O 0 1 T 0 X 1 0 N 1 X 1 1 T 1 O 1 1 T 1 O 0 1 T 0 X 1 0 N 1 X branch outcome last outcome prediction new last outcome correctness . . . . . . . . . . . . . . . Assume initial last outcome = 1 O : correct, X : mispredict 1-Bit Branch History Table • ‘while’ branch : always TAKEN • 11111111111111111…. • ‘for’ branch : TAKEN x 3, NOT TAKEN x 1 • 1110111011101110…. Prediction accuracy of ‘for’ branch: 50% CS510 Computer Architectures

Branch outcome . . . Automaton PC Prediction BHT 2-Bit Counter Scheme • Prediction is made based on the last two branch outcomes • Each of BHT entries consists of 2 bits, usually a 2-bit counter, which is associated with the state of the automaton(many different automata are possible) CS510 Computer Architectures

T NT Predict TAKEN 1 10 T 11 O 1 11 T 11 O 1 11 T 11 O 1 11 T 11 O 0 11 T 10 X 1 10 T 11 O 1 11 T 11 O 1 11 T 11 O 0 11 T 10 X 1 11 T 11 O 1 11 T 11 O 0 11 T 10 X branch outcome counter value prediction new counter value correctness . . . . . . . . . . . . . . . Predict TAKEN 10 11 T NT T Predict NOT TAKEN NT Predict NOT TAKEN 01 00 assume initial counter value : 11 O : correct, X : mispredict T NT 2-Bit BHT(1) • 2-bit scheme where change prediction only if get misprediction twice • MSB of the state symbol represents the prediction; • 1x: TAKEN, 0x: NOT TAKEN Prediction accuracy of ‘for’ branch: 75 % CS510 Computer Architectures

1 10 T 11 O 1 10 T 11 O 1 11 T 11 O 1 11 T 11 O 0 11 T 10 X 1 10 T 11 O 1 11 T 11 O 1 11 T 11 O 0 11 T 10 X 1 11 T 11 O 1 11 T 11 O 0 11 T 10 X branch outcome counter value prediction new counter value correctness . . . . . . . . . . . . . . . T NT T NT T NT assume initial counter value : 10 T NT 2-Bit BHT(2) • A branch going unusual direction once causes a misprediction only once, rather than twice as in the 1-bit BHT • MSB of the state represents the prediction; • 1x: TAKEN, 0x: NOT TAKEN Predict TAKEN Predict TAKEN 10 11 Predict NOT TAKEN Predict NOT TAKEN 00 01 Prediction accuracy of ‘for’ branch: 75 % CS510 Computer Architectures

PT BHT PC . . . Prediction . . . 2-Level Self Prediction • First level • Record previous few outcomes(history) of the branch itself • Branch History Table(BHT) - A shift register • Second Level • Predictor Table(PT) indexed by history pattern captured by BHT • Each entry is a 2-bit counter CS510 Computer Architectures

... ... ... ... ... ... ... 1 1000 10 T 1100 11 O 1 1100 10 T 1110 11 O 0 1110 10 T 0111 01 X 1 0111 10 T 1011 11 O 1 1011 10 T 1101 11 O 1 1101 10 T 1110 11 O 0 1110 01 N 0111 00 O 1 0111 11 T 1011 11 O 1 1011 11 T 1101 11 O 1 1101 11 T 1110 11 O 0 1110 00 N 0111 00 O 1 0000 10 T 1000 11 O Branch Outcome Self History(BHT) Counter Value(PT) Prediction New BHT entry New PT entry Correctness Initial BHT entry =0000, Self history length=4, Initial PT value=10 2-Level Self Prediction Algorithm Using PC, read BHT Using self history from BHT, read the counter value from PT If (MSB==1), predict T else predict NT After resolving branch outcome, bookkeeping If mispredicted, discard all speculative executions Shift right the new branch outcome into the entry read from BHT Update PT: If(branch outcome==T) increase counter value by 1 else decrease by 1 Prediction Accuracy = 100% CS510 Computer Architectures

BHT Accuracy • Mispredict because either: • Wrong guess for that branch • Got branch history of wrong branch when indexing the table • 4096 entry table, programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% • 4096 about as good as infinite table, but 4096 is a lot of HW CS510 Computer Architectures

BNEZ R1,L1 ; (b1)(d!=0) ADDI R1,R1,#1; d==0, so d=1 L1 : SUBI R3,R1,#1 BNEZ R3,L2; (b2)(d!=1) ..... L2 : Example: if (d==0) d=1; if (d==1) Value of d before b2 1 1 2 Initial value of d 0 1 2 d==0? Y N N b1 NT T T d==1? Y Y N b2 NT NT T If b1 is NT, then b2 is NT 1-bit self history predictor b2 prediction NT T NT T b1 prediction NT T NT T b1 action T NT T NT New b1 prediction T NT T NT b2 action T NT T NT New b2 prediction T NT T NT d=? 2 0 2 0 Sequence of 2,0,2,0,2,0,... All branches are mispredicted Correlating Branches CS510 Computer Architectures

(2,2) correlating predictor • Branch address 4 Pattern History Table(PHT) XX XX XX Prediction 2-bits per predictors Global History Register(GHR; 2 bits) Correlating Branches Idea: TAKEN/NOT TAKEN of recently executed branches (GHR-global history) is related to the behavior of the next branch (as well as the history of that branch behavior)(PHT-self history) - Behavior of the recent branches(GHR) selects from, say, four predictions (PHT) of the next branch, and updating the selected prediction only CS510 Computer Architectures

BNEZ R1,L1 ; branch b1 (d!=0) ADDI R1,R1,#1; d==0, so d=1 L1 : SUBI R3,R1,#1 BNEZ R3,L2; branch b2(d!=1) ..... L2 : Example: if (d==0) d=1; if (d==1) Gloabal Self Prediction bits(XX) NT/NT NT/T T/NT T/T Prediction, if last branch action was T NT T NT T Prediction, if last branch action was NT NT NT T T new b2 prediction NT/T NT/T NT/T NT/T new b1 prediction T/NT T/NT T/NT T/NT b2 prediction NT/NT NT/T NT/T NT/T b2 action T NT T NT d=? 2 0 2 0 b1 prediction NT/NT T/NT T/NT T/NT b1 action T NT T NT Initial self prediction bits NT/NT and Initial last branch was NT. Prediction used is shown in purple Correlating Branches Misprediction only in the firest two predictions CS510 Computer Architectures

4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% 16% 14% 12% 11% Frequency of Mispredictions 10% 8% Frequency of Mispredictions 6% 6% 6% 6% 5% 5% 4% 4% 2% 1% 1% 0% 0% gcc eqntott nasa7 spice fpppp li doducd matrix300 tomcatv espresso Accuracy of Different Schemes CS510 Computer Architectures

PC of instruction to fetch Predicted PC T/NT(as predicted) Look up Number of entries in branch- target buffer No: instruction is not predicted to be a branch. Proceed normally = Yes: then instruction is a branch and predicted PC should be used as the next PC Need Predicted Address as well as Prediction Branch Target Buffer (BTB): Indexed by the Address of branch to get prediction AND branch address (if taken) CS510 Computer Architectures

Send PC tomemory and branch-target buffer BTB Prediction Actual Penalty Y T T 0 Y T NT 2 N - T 2 IF Yes No Entry found in BTB? Send outpredicted PC Is instruction a taken branch? No Yes ID No Yes Taken branch? Normal instruction execution Branch correctly predicted Mispredicted branch, kill fetched instr.; restart fetch at other target; delete entry from BTB. Enter branch instr. address and next PC into BTB EX Branch Target Buffer CS510 Computer Architectures

CS510 Computer Architectures

Getting CPI < 1:Issuing Multiple Instructions/Cycle Two variations • Superscalar: varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW(Tomasulo) • IBM PowerPC, Sun SuperSparc, DEC Alpha, HP 7100 • Very Long Instruction Words (VLIW): fixed number of instructions (16) scheduled by the compiler • Joint HP/Intel agreement in 1998? CS510 Computer Architectures

Integer Instruction FP Instruction InstrType Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Getting CPI < 1:Issuing Multiple Instructions/Cycle • Superscalar DLX: 2 instructions, 1 FP & 1 anything else • Fetch 64-bits/clock cycle; Int on left, FP on right • Can issue the 2nd instruction only if the 1st instruction issues • More ports for FP registers to do FP load & FP op in a pair • 1 cycle load delay expands to 3 instructions in SS • instruction in right half cannot use it, nor instructions in the next slot CS510 Computer Architectures

Unrolled Loop that Minimizes Stalls for Scalar 1 Loop: LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4 10 SD -8(R1),F8 11 SUBI R1,R1,#32 12 SD 16(R1),F12 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -24 Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #32 BNEZ R1, Loop LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles 14 clock cycles, or 3.5 per iteration CS510 Computer Architectures

LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles Loop Unrolling in Superscalar Integer instruction FP instruction Clock cycle Loop: LD F0,0(R1) 1 LD F6,-8(R1) 2 LD F10,-16(R1) ADDD F4,F0,F2 3 LD F14,-24(R1) ADDD F8,F6,F2 4 LD F18,-32(R1) ADDD F12,F10,F2 5 SD 0(R1),F4 ADDD F16,F14,F2 6 SD -8(R1),F8 ADDD F20,F18,F2 7 SD -16(R1),F12 8 SUBI R1,R1,#40 9 SD 16(R1),F16 10 BNEZ R1,LOOP 11 SD8(R1),F20 12 • Unrolled 5 times to avoid delays (+1 due to 2 cycle initiation delay for ADDD to SD) • 12 clocks, or 2.4 clocks per iteration CS510 Computer Architectures

Dynamic Scheduling in Superscalar • Dependencies stop instruction issue • Code compiled for scalar version will run poorly on SS • Good code for superscalar depends on the structure of the superscalar • Simple approach • Issue an integer instruction and a FP instruction to their separate reservation stations, i.e. for Integer FU/Reg and for FP FU/Reg • Issue both only when two instructions do not access the same register - instructions with data dependence cannot be issued together CS510 Computer Architectures

Dynamic Scheduling in Superscalar How to issue two instructions to the reservation stations and keep in-order issue for Tomasulo? • In order issue for the purpose of simplifying bookkeeping • Issue stage runs 2X Clock Rate, so that issue can be made 2 times in an ordinary clock cycle to make issues remain in order but execution can be done on the same ordinary clock cycle • Only FP loads might cause dependency between integer and FP issue: • Replace load reservation station with a load queue; operands must be read in the order they are fetched • Load checks addresses in Store Queue to avoid RAW violation • Store checks addresses in Load Queue to avoid WAR, WAW CS510 Computer Architectures

Performance of Dynamic SS 1 3 4 2(efa) Iter No. Issues Executes Memory access Write Result 1 LD F0,0(R1) 1 ADDD F4,F0,F2 1 SD 0(R1),F4 1 SUBI R1,R1,# 8 1 BNEZ R1,LOOP 2 LD F0,0(R1) 2 ADDD F4,F0,F2 2 SD 0(R1),F4 2 SUBI R1,R1,#8 2 BNEZ R1,LOOP 7 4 1 2 more cycles to complete 3(efa) 2 6 5 3 4 4 5 5 8 7 6(efa) 11 8 2 more cycles to complete 5 10 7(efa) 6 9 8 7 8 9 * 4 clocks per iteration Branches, Decrements still take 1 clock cycle CS510 Computer Architectures

Limits of Superscalar • While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: • Exactly 50% FP operations • No hazards • If more instructions issue at the same time, greater difficulty of decode and issue • Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue • VLIW: trade-off instruction space for simple decoding • The long instruction word has room for many operations • By definition, all the operations the compiler puts in the long instruction word can execute in parallel • E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch • 16 to 24 bits per field => 7x16 or 112 bits to 7x24 or 168 bits wide • Need compiling technique that schedules across several branches CS510 Computer Architectures

Loop Unrolling in VLIW Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branch a LD F0,0(R1) LD F6,-8(R1)1 LD F10,-16(R1) LD F14,-24(R1)2 LD F18,-32(R1) LD F22,-40(R1)ADDD F4,F0,F2 ADDD F8,F6,F2 3 LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F2 4 ADDD F20,F18,F2 ADDD F24,F22,F2 5 SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F2 6 SD -16(R1),F12SD -24(R1),F16 7 SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#48 8 SD 0(R1),F28 BNEZ R1,LOOP 9 • Unrolled 7 times to avoid delays • 7 results in 9 clocks, or 1.3 clocks per iteration • Need more registers in VLIW CS510 Computer Architectures

Limits to Multi-Issue Machines 1. Inherent limitations of ILP in programs • 1 branch in 5 instructions => how to keep a 5-way VLIW busy? • Latencies of units => many operations must be scheduled • Need about Pipeline Depth x No. Functional Units of independent operations to keep machines busy 2. Difficulties in building HW • Duplicate FUs to get parallel execution • Increase ports to Register File - VLIW example • needs 6 reads(2 Integer Unit, 2 LD Units and 2 SD Units) and 3 writes(1 Integer Unit and 2 LD Units) for Int. register • needs 6 reads(4 FP Units, 2 SD Units) and 4 writes(2 FP Units and 2 LD Units) for FP register • Increase ports to memory • Decoding SS and impact on clock rate, pipeline depth CS510 Computer Architectures

Limits to Multi-Issue Machines 3.Limitations specific to either SS or VLIW implementation • Multiple issue logic in SS • VLIW code size: unroll loops + wasted fields in VLIW • VLIW lock step => 1 hazard & all instructions stall • VLIW & binary compatibility is practical weakness CS510 Computer Architectures

Detecting and Eliminating Dependencies • Read Section 4.5 CS510 Computer Architectures

Software Pipelining • Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations • Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (Tomasulo in SW) CS510 Computer Architectures

Start-up code Iter i Iter i+1 Iter i+2 Finish code Read F4(i) Write F4(i+1) SD ADDD LD IF ID EX Mem WB IF ID EX Mem WB IF ID EX Mem WB Read F0(i) Write F0(i+2) SW Pipelining Example After: Software Pipelined version of loop LD F0,0(R1) ADDD F4,F0,F2 LD F0,-8(R1) 1 LOOP SD 0(R1),F4; Stores to M[i] 2 ADDD F4,F0,F2; Adds to M[i-1] 3 LD F0,-16(R1); Loads from M[i-2] 4 SUBI R1,R1,#8 5 BNEZ R1,LOOP SD 0(R1),F4 ADDD F4,F0,F2 SD -8(R1),F4 Before: Unrolled 3 times 1 LOOP LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F0,F2 6 SD -8(R1),F8 7 LD F10,16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 SUBI R1,R1,#24 11 BNEZ R1,LOOP CS510 Computer Architectures

Software Pipelining Number of overlapped operations Time Loop Unrolling Number of overlapped operations . . . Time 100 iterations = 25 loops with 4 unrolled iterations each SW Pipelining Example • Symbolic Loop Unrolling • Less code space • Overhead paid only once vs. each iteration in loop unrolling CS510 Computer Architectures

Summary • Branch Prediction • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch • Branch Target Buffer: include branch address & prediction • Superscalar and VLIW • CPI < 1 • Dynamic issue vs. Static issue • More instructions issue at the same time, larger the penalty of hazards • SW Pipelining • Symbolic Loop Unrolling to get most from pipeline with little code expansion, little overhead • SW pipelining works when the behavior of branches is fairly predictable at compile time CS510 Computer Architectures

Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining

Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining

Presentation Transcript

Superscalar and VLIW Architectures

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW

Dynamic Branch Prediction

Dynamic Branch Prediction

Dynamic Branch Prediction

VLIW, Software Pipelining, and Limits to ILP

Lecture 11: Pipelining and Branch Prediction

Dynamic Branch Prediction

Dynamic Branch Prediction

Dynamic Branch Prediction and Speculation

Dynamic Branch Prediction

Dynamic Hardware Branch Prediction

Dynamic Branch Prediction

Lecture 4: Tomasulo Algorithm and Dynamic Branch Prediction

Dynamic Branch Prediction

Lecture 10 : Branch Prediction and Instruction Delivery

Lecture 7: Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining

Lecture 8: Branch Prediction, Dynamic ILP

Dynamic Branch Prediction

VLIW, Software Pipelining, and Limits to ILP

Lecture 4: Tomasulo Algorithm and Dynamic Branch Prediction