Goal: Reduce the Penalty of Control Hazards

# Goal: Reduce the Penalty of Control Hazards

## Goal: Reduce the Penalty of Control Hazards

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Goal: Reduce the Penalty of Control Hazards • There are 2 problems with branches: • Their existence: The pipeline can’t fetch an instruction until it knows if the current instruction is a branch. • Their outcome (for conditional branches): Until the condition is solved the next instruction’s PC isn’t known. • Solution: Always assume the branch isn’t taken. The next instruction fetched is PC + 4. • Problem: What if we are mostly wrong? Computer Architecture- Branch Prediction 1/15

2. The problem with branches is that we can’t continue fetching instructions if there is a chance that the current instruction being decoded is a branch instruction. It would seem that we would have to combine the IF and ID stages of the pipeline into one long stage. • Even if we did this we still would have to wait until the EX stage to know if the branch is taken or not, another delay. • Assuming the branch isn’t taken allows the pipeline to continue fetching instructions. • However if the branch is taken we have to discard the fetched instructions and fetch new instructions from the branch target. This is expensive. • Is assuming not taken smart? Think about error checking code. Most ifs aren’t taken. But what about loops? Computer Architecture- Branch Prediction

3. Reducing the Delay of Branches • Problem: The decision to branch is made in the EX stage. After the registers are compared. • Solution: Perform the comparison in the ID stage, a wrong prediction would cost only 1 cycle. • Problem: The ALUs needed are used in the EX stage. • Solution: Move the branch target ALU to the ID stage. Compare the registers using XOR not subtraction. Computer Architecture- Branch Prediction 2/15

4. There are two parts to a branch: Computing the branch target and resolving the condition. • The branch target can be computed in the ID stage. The current PC and the immediate offset (in the instruction) are available. We just have to move the ALU that performs this from the EX stage to the ID stage. • Comparing the registers is slightly more complicated. Their values have to be read from the RF and then they must be subtracted. This might be to long for a single cycle. • But do their values have to be subtracted? Performing XOR between the values will result in all 0s if they are equal. ANDing all the bit-results will give 0 for equal and 1 for unequal. This is faster than performing a subtraction which involves carries from stage to stage. • Thus it is possible to read from the RF and perform an equality test in the same stage. Computer Architecture- Branch Prediction

5. Branch in ID Stage Computer Architecture- Branch Prediction 3/15

6. Branch Hazard Example • The following code contains a branch hazard: 36 sub \$10,\$4,\$8 40 beq \$1,\$3,7 (40 + 4 + 7*4 = 72) 44 and \$12,\$2,\$5 48 or \$13,\$2,\$6 . . . 72 lw \$4,50(\$7) • By default the instruction at PC=44 is fetched. If the contents of registers 1 and 3 are equal the next instruction should be PC=72. Computer Architecture- Branch Prediction 4/15

7. Branch Direction Predicted and \$12,\$2,\$5 beq \$1,\$3,7 sub \$10,\$4,\$8 Computer Architecture- Branch Prediction 5/15

8. Instruction Flushed, Branch Target Fetched lw \$4,50(\$7) bubble beq \$1,\$3,7 sub \$10,\$4,\$8 Computer Architecture- Branch Prediction 6/15

9. Dynamic Branch Prediction • Check if the branch was taken or not thelast time it was executed. • Should be at least as good as predict not taken. • A Branch Prediction Buffer or Branch History Table (BHT) is a small memory indexed by the LSBs of the branch’s PC. • Each entry contains a bit that is set (1) when the branch is taken and reset (0) when it isn’t. • During the IF stage the BHT is accessed and according to the bit the next instruction is fetched. • There are several problems with this approach! Computer Architecture- Branch Prediction 7/15

10. Branch History Table (BHT) BHT • The 2 LSBs of the PC are always 0 (why)? • The next log2(n) bits index the table. n is the size of the BHT. • In the case shown bits 2-4 index the BHT. PC 1101 . . . 0010101 00 1 0 1 1 0 0 1 1 3 Computer Architecture- Branch Prediction 8/15

11. The first problem is that we don’t store the PC in the BHT. Thus it is possible that another branch at another address has set the bit. • For instance the BHT contains 64 entries:80 beq \$1,\$3,19 # mapped to entry 20. . .336 bne \$7,\$9,65 # mapped to entry 20. . .2640 beq \$23,\$0,-12 # mapped to entry 20 • All the above addresses map the same entry. (PC/4)%64. • So what! The prediction is just a guess, it has to be validated in the ID stage any how. • Enlarging the BHT solves most of the aliasing problem. • The second problem is with loops. The technique will always mispredict twice on a loop. Even if the loop is called frequently. Look at the next slide for elaboration. Computer Architecture- Branch Prediction

12. Loops and Prediction • Look at the following loop: • for(i=0;i<10;i++) for(j=0;j<10;j++) . . . • In assembly language:L1: . . .L2: . . . bne \$1,\$3,L2 . . . bne \$5,\$7,L1 • The inner branch mispredicts the first and last iterations, resulting in a 80% hit-ratio. The end of the loop miss is inevitable but the first isn’t. Computer Architecture- Branch Prediction 9/15

13. 2-bit Prediction Scheme • Record the last 2 branch decisions. Only when the last two predictions have been wrong change the predicted direction. Computer Architecture- Branch Prediction 10/15

14. Success Rate of a 2-bit BHT • The mispredict rate is very low for loop intensive FP applications (0-3%). But higher for Integer applications (10-18%) with a very large BHT (4K entries). • The following code is very hard to predict:if(aa == 2) aa = 0; //The last ifif(bb == 2) bb = 0; //is true only ifif(aa != bb) . . .; //1st & 2nd ifs are false • Solution: Use a 2-level predictor. Record the m previous branch outcomes. Thus the current branch has 2mn-bit predictors. This is called a (m,n) BHT. Computer Architecture- Branch Prediction 11/15

15. (2,2) Branch History Table (BHT) • Each time a branch is fetched one of 4 BHT banks is accessed according to the global branch history. • If the previous two branches were not-taken, taken the 2nd bank is accessed and updated. • Aliasing is a problem but who cares? Computer Architecture- Branch Prediction 12/15

16. The (m,n) BHT scheme isn’t easy to understand. The idea is that we don’t look only at the current branch history but look at the the history of a branch that was reached in a specific way. • There is no guarantee that the global branch history matches the exact path taken by the code. • So what? We are only predicting the outcome of a branch, this will be verified in the ID stage in any case. • Using a 2 level scheme results in branch misprediction rates of less than 10% for almost all applications. • Thus when writing code you should be aware that: • Branches can be expensive, they result in control hazards. • A seldom or often taken branch will be predicted with a high degree of accuracy. • Branches that can go either way cause the most problems. Computer Architecture- Branch Prediction

17. The Branch Target • When assuming a branch isn’t taken the branch target is available it is PC + 4. • Problem: Branch prediction (BP) is performed in the IF stage, but the branch target is only computed in the ID stage. The benefit of a low misprediction rate is lost. • Solution: Save the branch target with the branch history. This buffer is called the Branch Target Buffer (BTB). • The branch target is read from the BTB during the IF stage. Computer Architecture- Branch Prediction 13/15

18. The Branch Target Buffer (BTB) Computer Architecture- Branch Prediction 14/15

19. Delay Slots Computer Architecture- Branch Prediction 15/15

20. The compiler can try to help reduce the penalty of control hazards by reordering the code. • If it is known that there is a one cycle delay between a branch instruction and the next instruction the compiler will try to schedule an instruction from before the branch that the branch isn’t dependent upon. This is the best case. • If this is impossible an instruction from the branch target or from the “fall through” is scheduled. • An instruction from the branch target will be used if the branch is backward (usually a loop). • An instruction from the fall through will be used if we have a forward branch. • This instruction must be OK to execute even if the branch goes in the unexpected direction. For instance if the register written to will be overwritten in any case. • Branch delay slots are being used less as branch predictors are getting better. Computer Architecture- Branch Prediction