750 likes | 965 Vues
Eliminating Branch Penalty Fall 2001. Pipelined ALU Instruction Datapath. IF instruction fetch. ID instruction decode/ register fetch. EX execute. MEM memory access. WB write back. IF. ID. EX. M. WB. IF. ID. EX. M. WB. IF. ID. EX. M. WB. IF. ID. EX. M. WB. IF.
 
                
                E N D
Pipelined ALU Instruction Datapath IF instruction fetch ID instruction decode/ register fetch EX execute MEM memory access WB write back
IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB Time Branch Hazard Pipeline Diagram • Problem • Instruction fetched in IF, branch condition set in MEM beq $31, target addq $31, 63, $1 addq $31, 63, $2 addq $31, 63, $3 addq $31, 63, $4 target: addq $31, 63, $5 PC Updated
Instr. Reg. Data Mem. File Mem. Stall Until Resolve Branch • Detect when branch in stages ID or EX • Stop fetching until resolve • Stall IF. Inject bubble into ID Stall Control Stall Bubble Transfer Transfer Transfer IF ID EX MEM Perform when branch in either stage
Instr. Reg. Data Mem. File Mem. Taken Branch Resolution • When branch taken, still have instruction Xtra1 in pipe • Need to flush it when detect taken branch in Mem • Convert it to bubble Stall Control Transfer Bubble Transfer Transfer Transfer IF ID EX MEM Perform when detect taken branch
IF ID EX M WB IF ID EX M WB Time Taken Branch Pipeline Diagram • Behavior • Instruction Xtra1 held in IF for two extra cycles • Then turn into bubble as enters ID beq $31, target addq $31, 63, $1 # Xtra1 IF IF IF target: addq $31, 63, $5 # Target PC Updated
Instr. Reg. Data Mem. File Mem. Not Taken Branch Resolution • [Stall two cycles with not-taken branches as well] • When branch not taken, already have instruction Xtra1 in pipe • Let it proceed as usual Stall Control Transfer Transfer Transfer Transfer Transfer IF ID EX MEM
IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB Time Not Taken Branch Pipeline Diagram • Behavior • Instruction Xtra1 held in IF for two extra cycles • Then allowed to proceed beq $31, target addq $31, 63, $1 # Xtra1 IF IF addq $31, 63, $2 # Xtra2 addq $31, 63, $3 # Xtra3 addq $31, 63, $4 # Xtra4 PC Not Updated
Analysis of Stalling • Branch Instruction Timing • 1 instruction cycle • 3 extra cycles when taken • 2 extra cycles when not taken • Performance Impact • Branches 16% of instructions in SpecInt92 benchmarks • 67% branches are taken • Adds 0.16 * (0.67 * 3 + 0.33 * 2) == 0.43 cycles to CPI • Average number of cycles per instruction • Serious performance impact
Stall Control Transfer Bubble Bubble Bubble Transfer IF ID EX MEM Instr. Reg. Data Mem. File Mem. Fetch & Cancel When Taken • Instruction does not cause any updates until MEM or WB stages • Instruction can be “cancelled” from pipe up through EX stage • Replace with bubble • Strategy • Continue fetching under assumption that branch not taken • If decide to take branch, cancel undesired ones Perform when detect taken branch
What’s the Problem? Need to know destination of branch to fetch next instruction Need address here Instruction Fetch Branch Delay Decode Execute Compute address here Memory Access bne r2, #0, r3 Writeback sub r7, r8, r9 add r4, r5, r6
Delayed Branches • One or more instructions after branch get executed regardless of whether branch taken • Exposes branch delay to compiler add r2, r3, r4 sub r7, r8, r9 bne r5, #0, r10 (stall) (stall) div r2, r1, r7 bne r5, #0, r10 add r2, r3, r4 (delay) sub r7, r8, r9 (delay) div r2, r1, r7
Delayed Branching - Pros/Cons • Pros: • Low Hardware Cost • Cons: • Depends on compiler to fill delay slots • Ability to fill delay slots drops as # of slots increases • Exposes implementation details to compiler • Can’t change pipeline without breaking software compatibility • Can’t add to existing architecture and retain compatibility
Prediction • Basic Idea: Predict which way branch will go, start executing down that path bne r2, #0, r4 add r3, r4, r5 sub r7, r8, r9 mul r2, r4, r6 lsh r5, r2, r1 add r8, r8, r9 sub r2, r4, r6 Cycle IF ID EX MEM WB n bne r2,#0,r4 n+1 add r3,r4,r5 bne r2,#0,r4 n+2 sub r7,r8,r9 add r3,r4,r5 bne r2,#0,r4 n+3 add r8,r8,r9sub r7,r8,r9 (squash) bne r2,#0,r4 n+4 sub r2,r4,r6 add r8,r8,r9 (squash) (squash) bne r2,#0,r4 n+5 sub r2,r4,r6 add r8,r8,r9(squash) (squash) n+6 sub r2,r4,r6 add r8,r8,r9(squash) n+7 sub r2,r4,r6 add r8,r8,r9 n+8 sub r2,r4,r6
Branches Limit ILP • Programs average about 5 instructions between branches • Can’t issue instructions if you don’t know where the program is going • Current processors issue 4-6 operations/cycle
Compiler-Static Prediction • Predict at compiler-time whether branches will be taken before execution • Schemes • Predict taken • Would be hard to squeeze into our pipeline • Can’t compute target until ID • Backwards taken, forwards not taken (good performance for loops) • Predict based on sign of displacement • Exploits fact that loops usually closed with backward branches • No run-time adaptation: bad performance for data-dependent branches if(a == 0) b = 3; else b = 4;
Hardware-Dynamic Prediction • Use run-time information to make prediction
Some Interesting Patterns • TTTTTTTTTT • +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 … • Should give perfect prediction • RRRRRRRRR • -1 -1 +1 +1 +1 +1 -1 +1 -1 -1 +1 +1 -1 -1 +1 +1 +1 +1 +1 -1 -1 -1 +1 -1 … • Will mispredict 1/2 of the time • N*N[TNTN] • -1 -1 -1 -1 -1 -1 -1 -1+1 -1 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1… • Should alternate incorrectly • N*T[TNTN] • -1 -1 -1 -1 -1 -1 -1 +1+1 -1 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1 +1 -1… • Should alternate incorrectly • N*N[TTNN] • -1 -1 -1 -1 -1 -1 -1 -1+1 +1 -1 -1 +1 +1 -1 -1 +1 +1 -1 -1 +1 +1 -1 -1 … • N*T[TTNN] • -1 -1 -1 -1 -1 -1 -1 +1+1 +1 -1 -1 +1 +1 -1 -1 +1 +1 -1 -1 +1 +1 -1 -1 …
Pentium-Branch Prediction • Branch Target Buffer • Stores information about previously executed branches • Indexed by instruction address • Specifies branch destination + whether or not taken • 256 entries • Branch Processing • Look for instruction in BTB • If found, start fetching at destination • Branch condition resolved early in WB • If prediction correct, no branch penalty • If prediction incorrect, lose ~3 cycles • Which corresponds to > 3 instructions • Update BTB
PentiumII Operation • Translates instructions dynamically into “Uops” • 118 bits wide • Holds operation, two sources, and destination • Executes Uops with “Out of Order” engine • Uop executed when • Operands available • Functional unit available • Execution controlled by “Reservation Stations” • Keeps track of data dependencies between uops • Allocates resources • Features • Executes operations in parallel • Up to 5 at once • Very deep pipeline • 12–18 cycle latency
Pentium II- Branch Prediction • Two-Level Scheme • Yeh & Patt, ISCA ‘93 • Keep shift register showing past k outcomes for branch • Use to index 2k entry table • Each entry provides 2-bit, saturating counter predictor • Very effective for any deterministic branching pattern Microprocessor Report March 27, 1995
Pentium II-Branch Prediction • Critical to Performance • 11–15 cycle penalty for misprediction • Branch Target Buffer • 512 entries • 4 bits of history • Adaptive algorithm • Can recognize repeated patterns, e.g., alternating taken–not taken • Handling BTB misses • Detect in cycle 6 • Predict taken for negative offset, not taken for positive • Loops vs. conditionals
PAg Prediction Per-Address Branch History Table Global Pattern History Table Branch Address
PAp Prediction Per-Address Branch History Table Per-Address Pattern History Table Branch Address
Pros/Cons of Two-Level Branch Prediction • Pros: • Predicts correlated branch behavior that breaks other predictors • eqntott example • Better overall performance than purely address-based predictors • Cons: • Interference between unrelated branches with same history • example: all loop-end branches will map to same entry in pattern history table • sometimes this is a good thing, sometimes this is a bad thing
Comparing Predictors • Papers present two comparisons: • Highest prediction rate for given history register length • PAp wins (much more total hardware) • Least hardware for 97% accuracy • PAg wins • Note: PAp and PAg do less well if you consider context switches, as they have more state to re-load
Branch Prediction Comparisons • Microprocessor Report March 27, 1995
21264 Branch Prediction Logic • Purpose: Predict whether or not branch taken • 35Kb of prediction information • 2% of total die size • Claim 0.7--1.0% misprediction