1 / 74

Eliminating Branch Penalty Fall 2001

Eliminating Branch Penalty Fall 2001. Pipelined ALU Instruction Datapath. IF instruction fetch. ID instruction decode/ register fetch. EX execute. MEM memory access. WB write back. IF. ID. EX. M. WB. IF. ID. EX. M. WB. IF. ID. EX. M. WB. IF. ID. EX. M. WB. IF.

hiero
Télécharger la présentation

Eliminating Branch Penalty Fall 2001

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Eliminating Branch PenaltyFall 2001

  2. Pipelined ALU Instruction Datapath IF instruction fetch ID instruction decode/ register fetch EX execute MEM memory access WB write back

  3. IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB Time Branch Hazard Pipeline Diagram • Problem • Instruction fetched in IF, branch condition set in MEM beq $31, target addq $31, 63, $1 addq $31, 63, $2 addq $31, 63, $3 addq $31, 63, $4 target: addq $31, 63, $5 PC Updated

  4. Instr. Reg. Data Mem. File Mem. Stall Until Resolve Branch • Detect when branch in stages ID or EX • Stop fetching until resolve • Stall IF. Inject bubble into ID Stall Control Stall Bubble Transfer Transfer Transfer IF ID EX MEM Perform when branch in either stage

  5. Instr. Reg. Data Mem. File Mem. Taken Branch Resolution • When branch taken, still have instruction Xtra1 in pipe • Need to flush it when detect taken branch in Mem • Convert it to bubble Stall Control Transfer Bubble Transfer Transfer Transfer IF ID EX MEM Perform when detect taken branch

  6. IF ID EX M WB IF ID EX M WB Time Taken Branch Pipeline Diagram • Behavior • Instruction Xtra1 held in IF for two extra cycles • Then turn into bubble as enters ID beq $31, target addq $31, 63, $1 # Xtra1 IF IF IF target: addq $31, 63, $5 # Target PC Updated

  7. Instr. Reg. Data Mem. File Mem. Not Taken Branch Resolution • [Stall two cycles with not-taken branches as well] • When branch not taken, already have instruction Xtra1 in pipe • Let it proceed as usual Stall Control Transfer Transfer Transfer Transfer Transfer IF ID EX MEM

  8. IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB Time Not Taken Branch Pipeline Diagram • Behavior • Instruction Xtra1 held in IF for two extra cycles • Then allowed to proceed beq $31, target addq $31, 63, $1 # Xtra1 IF IF addq $31, 63, $2 # Xtra2 addq $31, 63, $3 # Xtra3 addq $31, 63, $4 # Xtra4 PC Not Updated

  9. Analysis of Stalling • Branch Instruction Timing • 1 instruction cycle • 3 extra cycles when taken • 2 extra cycles when not taken • Performance Impact • Branches 16% of instructions in SpecInt92 benchmarks • 67% branches are taken • Adds 0.16 * (0.67 * 3 + 0.33 * 2) == 0.43 cycles to CPI • Average number of cycles per instruction • Serious performance impact

  10. Stall Control Transfer Bubble Bubble Bubble Transfer IF ID EX MEM Instr. Reg. Data Mem. File Mem. Fetch & Cancel When Taken • Instruction does not cause any updates until MEM or WB stages • Instruction can be “cancelled” from pipe up through EX stage • Replace with bubble • Strategy • Continue fetching under assumption that branch not taken • If decide to take branch, cancel undesired ones Perform when detect taken branch

  11. What’s the Problem? Need to know destination of branch to fetch next instruction Need address here Instruction Fetch Branch Delay Decode Execute Compute address here Memory Access bne r2, #0, r3 Writeback sub r7, r8, r9 add r4, r5, r6

  12. Delayed Branches • One or more instructions after branch get executed regardless of whether branch taken • Exposes branch delay to compiler add r2, r3, r4 sub r7, r8, r9 bne r5, #0, r10 (stall) (stall) div r2, r1, r7 bne r5, #0, r10 add r2, r3, r4 (delay) sub r7, r8, r9 (delay) div r2, r1, r7

  13. Delayed Branching - Pros/Cons • Pros: • Low Hardware Cost • Cons: • Depends on compiler to fill delay slots • Ability to fill delay slots drops as # of slots increases • Exposes implementation details to compiler • Can’t change pipeline without breaking software compatibility • Can’t add to existing architecture and retain compatibility

  14. Prediction • Basic Idea: Predict which way branch will go, start executing down that path bne r2, #0, r4 add r3, r4, r5 sub r7, r8, r9 mul r2, r4, r6 lsh r5, r2, r1 add r8, r8, r9 sub r2, r4, r6 Cycle IF ID EX MEM WB n bne r2,#0,r4 n+1 add r3,r4,r5 bne r2,#0,r4 n+2 sub r7,r8,r9 add r3,r4,r5 bne r2,#0,r4 n+3 add r8,r8,r9sub r7,r8,r9 (squash) bne r2,#0,r4 n+4 sub r2,r4,r6 add r8,r8,r9 (squash) (squash) bne r2,#0,r4 n+5 sub r2,r4,r6 add r8,r8,r9(squash) (squash) n+6 sub r2,r4,r6 add r8,r8,r9(squash) n+7 sub r2,r4,r6 add r8,r8,r9 n+8 sub r2,r4,r6

  15. Branches Limit ILP • Programs average about 5 instructions between branches • Can’t issue instructions if you don’t know where the program is going • Current processors issue 4-6 operations/cycle

  16. Branch Prediction Techniques

  17. Compiler-Static Prediction • Predict at compiler-time whether branches will be taken before execution • Schemes • Predict taken • Would be hard to squeeze into our pipeline • Can’t compute target until ID • Backwards taken, forwards not taken (good performance for loops) • Predict based on sign of displacement • Exploits fact that loops usually closed with backward branches • No run-time adaptation: bad performance for data-dependent branches if(a == 0) b = 3; else b = 4;

  18. Hardware-Dynamic Prediction • Use run-time information to make prediction

  19. Some Interesting Patterns • TTTTTTTTTT • +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 … • Should give perfect prediction • RRRRRRRRR • -1 -1 +1 +1 +1 +1 -1 +1 -1 -1 +1 +1 -1 -1 +1 +1 +1 +1 +1 -1 -1 -1 +1 -1 … • Will mispredict 1/2 of the time • N*N[TNTN] • -1 -1 -1 -1 -1 -1 -1 -1+1 -1 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1… • Should alternate incorrectly • N*T[TNTN] • -1 -1 -1 -1 -1 -1 -1 +1+1 -1 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1… • Should alternate incorrectly • N*N[TTNN] • -1 -1 -1 -1 -1 -1 -1 -1+1 +1 -1 -1 +1 +1 -1 -1 +1 +1 -1 -1 +1 +1 -1 -1 … • N*T[TTNN] • -1 -1 -1 -1 -1 -1 -1 +1+1 +1 -1 -1 +1 +1 -1 -1 +1 +1 -1 -1 +1 +1 -1 -1 …

  20. Pentium-Branch Prediction • Branch Target Buffer • Stores information about previously executed branches • Indexed by instruction address • Specifies branch destination + whether or not taken • 256 entries • Branch Processing • Look for instruction in BTB • If found, start fetching at destination • Branch condition resolved early in WB • If prediction correct, no branch penalty • If prediction incorrect, lose ~3 cycles • Which corresponds to > 3 instructions • Update BTB

  21. PentiumII Operation • Translates instructions dynamically into “Uops” • 118 bits wide • Holds operation, two sources, and destination • Executes Uops with “Out of Order” engine • Uop executed when • Operands available • Functional unit available • Execution controlled by “Reservation Stations” • Keeps track of data dependencies between uops • Allocates resources • Features • Executes operations in parallel • Up to 5 at once • Very deep pipeline • 12–18 cycle latency

  22. Pentium II- Branch Prediction • Two-Level Scheme • Yeh & Patt, ISCA ‘93 • Keep shift register showing past k outcomes for branch • Use to index 2k entry table • Each entry provides 2-bit, saturating counter predictor • Very effective for any deterministic branching pattern Microprocessor Report March 27, 1995

  23. Pentium II-Branch Prediction • Critical to Performance • 11–15 cycle penalty for misprediction • Branch Target Buffer • 512 entries • 4 bits of history • Adaptive algorithm • Can recognize repeated patterns, e.g., alternating taken–not taken • Handling BTB misses • Detect in cycle 6 • Predict taken for negative offset, not taken for positive • Loops vs. conditionals

  24. PAg Prediction Per-Address Branch History Table Global Pattern History Table Branch Address

  25. PAp Prediction Per-Address Branch History Table Per-Address Pattern History Table Branch Address

  26. Pros/Cons of Two-Level Branch Prediction • Pros: • Predicts correlated branch behavior that breaks other predictors • eqntott example • Better overall performance than purely address-based predictors • Cons: • Interference between unrelated branches with same history • example: all loop-end branches will map to same entry in pattern history table • sometimes this is a good thing, sometimes this is a bad thing

  27. Comparing Predictors • Papers present two comparisons: • Highest prediction rate for given history register length • PAp wins (much more total hardware) • Least hardware for 97% accuracy • PAg wins • Note: PAp and PAg do less well if you consider context switches, as they have more state to re-load

  28. Branch Prediction Comparisons • Microprocessor Report March 27, 1995

  29. 21264 Branch Prediction Logic • Purpose: Predict whether or not branch taken • 35Kb of prediction information • 2% of total die size • Claim 0.7--1.0% misprediction

  30. Future Directions

  31. Future Directions

More Related