Advanced Microarchitecture

Advanced Microarchitecture Lecture 3: Superscalar Fetch

Fetch Rate is an ILP Upper Bound • To sustain an execution rate of N IPC, you must be able to sustain a fetch rate of N IPC! • Over the long term, you cannot burn 2000 calories a day while only consuming 1500 calories a day. You will starve! • This also suggests that you don’t need to fetch N instructions every cycle, just on average Lecture 3: Superscalar Fetch

Impediments to “Perfect” Fetch • A machine with superscalar degree N will ideally fetch N instructions every cycle • This doesn’t happen due to • Instruction cache organization • Branches • And interaction between the two Lecture 3: Superscalar Fetch

Instruction Cache Organization • To fetch N instructions per cycle from I$, we need • Physical organization of I$ row must be wide enough to store N instructions • Must be able to access entire row at the same time Address Cache Line Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Decoder Alternative: do multiple fetches per cycle Not Good: increases cycle time latency by too much Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Lecture 3: Superscalar Fetch

Fetch Operation • Each cycle, PC of next instruction to fetch is used to access an I$ line • The N instructions specified by this PC and the next N-1 sequential addresses form a fetch group • The fetch group might not be aligned with the row structure of the I$ Lecture 3: Superscalar Fetch

Fragmentation via Misalignment • If PC = xxx01001, N=4: • Ideal fetch group is xxx01001 through xxx01100 (inclusive) xxx01001 00 01 10 11 000 Tag Inst Inst Inst Inst 001 Tag Inst Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Decoder 111 Tag Inst Inst Inst Inst Row width Can only access one line per cycle, means we fetch only 3 instructions (instead of N=4) Fetch group Lecture 3: Superscalar Fetch

Fetch Rate Computation • Assume N=4 • Assume fetch group starts at random location • Then fetch rate = ¼ x 4 + ¼ x 3 + ¼ x 2 + ¼ x 1 = 2.5 instructions per cycle Lecture 3: Superscalar Fetch

Inst Inst Inst Reduces Fetch Bandwidth • It now takes two cycles to fetch N instructions • Halved fetch bandwidth! xxx01001 00 01 10 11 000 Tag Inst Inst Inst Inst 001 Tag Inst Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Decoder Cycle 2 111 xxx01100 Tag Inst Inst Inst Inst 00 01 10 11 000 Tag Inst Inst Inst Inst Cycle 1 Inst Inst Inst 001 Tag Inst Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Decoder Reduction may not be as bad as a full halving 111 Tag Inst Inst Inst Inst Inst Lecture 3: Superscalar Fetch

Cache Line Reducing Fetch Fragmentation • Make |Fetch Group| != |Row Width| Address Tag Inst Inst Inst Inst Inst Inst Inst Inst Tag Inst Inst Inst Inst Inst Inst Inst Inst Decoder Tag Inst Inst Inst Inst Inst Inst Inst Inst If start of fetch group is N or more from the end of the cache line, then N instructions can be delivered Lecture 3: Superscalar Fetch

May Require Extra Hardware Tag Inst Inst Inst Inst Inst Inst Inst Inst Tag Inst Inst Inst Inst Inst Inst Inst Inst Decoder Tag Inst Inst Inst Inst Inst Inst Inst Inst Rotator Inst Inst Inst Inst Aligned fetch group Lecture 3: Superscalar Fetch

Fetch Rate Computation • Let N=4, cache line size = 8 • Then fetch rate = 5/8 x 4 + 1/8 x 3 + 1/8 x 2 + 1/8 x 1 = 3.25 instructions per cycle Lecture 3: Superscalar Fetch

Fragmentation via Branches • Even if fetch group is aligned, and/or cache line size > than fetch group, taken branches disrupt fetch Tag Inst Inst Inst Inst Tag Inst Branch Inst Inst Tag Inst Inst Inst Inst Decoder Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst X X Lecture 3: Superscalar Fetch

Fetch Rate Computation • Let N=4 • Branch every 5 instructions on average • Assume branch always taken • Assume branch target may start at any offset in a cache row 25% chance of fetch group starting at each location 20% chance for each instruction to be a branch Lecture 3: Superscalar Fetch

start of fetch group ¼ x ( 0.2 x 1 + 0.8 x 2 ) start of fetch group ¼ x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x 3 ) ) start of fetch group ¼ x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x ( 0.2 x 3 + 0.8 x 4 ) ) ) Fetch Rate Computation (2) start of fetch group ¼ x 1 instruction = 2.048 Instructions Fetched per Cycle Simplified analysis: doesn’t account for higher probability of fetch group being aligned due to previous fetch group not containing branches Lecture 3: Superscalar Fetch

3 3 2 2 B B12 B13 B10 B11 B10 B11 B12 B13 Ex. IBM RS/6000 PC = B1010 T logic T logic T logic A0 B0 A1 B1 A2 B2 A3 B3 0 0 0 0 One Cache Line 1 1 1 1 A4 B4 A5 B5 A6 B6 A7 B7 2 2 2 2 A8 B8 A9 B9 A10 B10 A11 B11 3 3 3 3 A12 B12 A13 B13 A14 B14 A15 B15 From Tag Check Logic Instruction Buffer Network Lecture 3: Superscalar Fetch

Types of Branches • Direction: • Conditional vs. Unconditional • Target: • PC-encoded • PC-relative • Absolute offset • Computed (target derived from register) • Must resolve both direction and target to determine the next fetch group Lecture 3: Superscalar Fetch

Prediction • Generally use hardware predictors for both direction and target • Direction predictor simply predicts that a branch is taken or not-taken (Exact algorithms covered next lecture) • Target prediction needs to predict an actual address Lecture 3: Superscalar Fetch

Where Are the Branches? • Before we can predict a branch, we need to know that we have a branch to predict! Where’s the branch in this fetch group? I$ PC 1001010101011010101001 0101001010110101001010 0101010101101010010010 0000100100111001001010 Lecture 3: Superscalar Fetch

Huge latency! Clock frequency plummets Simplistic Fetch Engine Fetch PC I$ Target Pred Dir Pred PD PD PD PD + sizeof(inst) Branch’s PC Lecture 3: Superscalar Fetch

Store 1 bit per inst, set if inst is a branch partial-decode logic removed Note: sizeof(inst) may not be known before decode (ex. x86) Branch Identification Predecode branches on fill from L2 I$ Target Pred Dir Pred Branch’s PC + sizeof(inst) … still a long latency (I$ itself sometimes > 1 cycle) Lecture 3: Superscalar Fetch

Line Granularity • Predict next fetch group independent of exact location of branches in current fetch group • If there’s only one branch in a fetch group, does it really matter where it is? X X T T X N X N X One predictor entry per fetch group X One predictor entry per instruction PC Lecture 3: Superscalar Fetch

Better! Latency determined by BPred This is still challenging: we may need to choose between multiple targets for the same cache line Predicting by Line I$ Target Pred Dir Pred br1 br2 X Y Correct Dir Pred Correct Target Pred + br1 br2 sizeof($-line) N N N -- T N T Y Cache Line address -- T T X Lecture 3: Superscalar Fetch

I$ Multiple Branch Prediction PC no LSBs of PC Target Pred Dir Pred sizeof($-line) LSBs of PC + addr0 addr1 addr2 addr3 N N N T Scan for 1st “T” 0 1 Lecture 3: Superscalar Fetch

Direction Prediction • Details next lecture • Over 90% accurate today for integer applications • Higher for FP applications Lecture 3: Superscalar Fetch

Target Prediction • PC-relative branches • If not-taken: next address = branch address + sizeof(inst) • If taken: next address = branch address + SEXT(offset) • Sizeof(inst) doesn’t change • Offset doesn’t change (not counting self-modifying code) Lecture 3: Superscalar Fetch

Taken Targets Only • Only need to predict taken-branch targets • Taken branch target is the same every time • Prediction is really just a “cache” Target Pred + sizeof(inst) PC Lecture 3: Superscalar Fetch

Branch Target Buffer (BTB) Branch Instruction Address (Tag) Branch PC V BIA BTA Branch Target Address Valid Bit Next Fetch PC = Hit? Lecture 3: Superscalar Fetch

Set-Associative BTB PC V tag target V tag target V tag target = = = Next PC Lecture 3: Superscalar Fetch

Cutting Corners • Branch prediction may be wrong • Processor has ways to detect mispredictions • Tweaks that make BTB more or less “wrong” don’t change correctness of processor operation • May affect performance Lecture 3: Superscalar Fetch

000001111beef9810 v f981 00000000cfff9704 00000000cfff9810 v f982 00000000cfff9830 00000000cfff9824 v f984 00000000cfff9900 00000000cfff984c Partial Tags v 00000000cfff981 00000000cfff9704 00000000cfff9810 v 00000000cfff982 00000000cfff9830 00000000cfff9824 v 00000000cfff984 00000000cfff9900 00000000cfff984c Lecture 3: Superscalar Fetch

PC-offset Encoding v f981 00000000cfff9704 v f982 00000000cfff9830 00000000cfff984c v f984 00000000cfff9900 v f981 ff9704 v f982 ff9830 00000000cfff984c v f984 ff9900 If target is too far away, or original PC is close to “roll-over” point, then target will be mispredicted 00000000cf ff9900 Lecture 3: Superscalar Fetch

BTB Miss? • Dir-Pred says “taken” • Target-Pred (BTB) misses • Could default to fall-through PC (as if Dir-Pred said NT) • But we know that’s likely to be wrong! • Stall fetch until target known … when’s that? • PC-relative: after decode, we can compute target • Indirect: must wait until register read/exec Lecture 3: Superscalar Fetch

Stall on BTB Miss PC I$ Decode + BTB displacement ??? Dir Pred T Next PC (unstall fetch) Lecture 3: Superscalar Fetch

Stage 1 Stage 2 Stage 3 Stage 4 Cycle i BTB miss i+1 stall I$ access i+2 stall stall decode i+3 I$ access stall stall rename i+4 Inject nops I$ access stall BTB Miss Timing Cycle i i+1 i+3 i+2 Next PC Current PC Start I$ Access Start I$ Access BTB Lookup (Miss) Decode + Lecture 3: Superscalar Fetch

Fetch continues down path of “foo” Decode-time Correction PC Similar penalty as a BTB miss I$ Decode + BTB displacement foo bar Later, we discover predicted target was wrong; flush insts and resteer (3 cycles of bubbles better than 20+) Dir Pred T Lecture 3: Superscalar Fetch

What about Indirect Jumps? • Stall until R5 is ready and branch executes • may be a while if Load R5 = 0[R3] misses to main memory • Fetch down NT-path • why? PC I$ Decode BTB ??? Get target from R5 Dir Pred T Lecture 3: Superscalar Fetch

No Problem! Subroutine Calls P: 0x1000: (start of printf) 1 FFB 0x1000 A: 0xFC34: CALL printf 1 FC3 0x1000 B: 0xFD08: CALL printf 1 FD0 0x1000 C: 0xFFB0: CALL printf Lecture 3: Superscalar Fetch

Subroutine Returns P: 0x1000: ST $RA  [$sp] 0x1B98: LD $tmp  [$sp] 0x1B9C: RETN $tmp 0 1 1B9 0xFC38 A: 0xFC34: CALL printf X A’:0xFC38: CMP $ret, 0 B: 0xFD08: CALL printf B’:0xFD0C: CMP $ret, 0 Lecture 3: Superscalar Fetch

Return Address Stack (RAS) • Keep track of call stack A: 0xFC34: CALL printf FC38 FC38 BTB P: 0x1000: ST $RA  [$sp] … D004 0x1B9C: RETN $tmp A’:0xFC38: CMP $ret, 0 FC38 Lecture 3: Superscalar Fetch

Overflow • Wrap-around and overwrite • Will lead to eventual misprediction after four pops • Do not modify RAS • Will lead to misprediction on next pop 64AC: CALL printf FC90 top of stack 421C 64B0 ??? 48C8 7300 Lecture 3: Superscalar Fetch

How Can You Tell It’s a Return? • Pre-decode bit in BTB (return=1, else=0) • Wait until after decode • Initially use BTB’s target prediction • After decode when you know it’s a return, treat like it’s a BTB miss or BTB misprediction • Costs a few bubbles, but simpler and still better than a full pipeline flush Lecture 3: Superscalar Fetch

Advanced Microarchitecture