CS 7960-4 Lecture 2

CS 7960-4 Lecture 2 Limits of Instruction-Level Parallelism David W. Wall WRL Research Report 93/6 Also appears in ASPLOS’91

Goals of the Study • Under optimistic assumptions, you can find a • very high degree of parallelism (1000!) • What about parallelism under realistic assumptions? • What are the bottlenecks? What contributes to parallelism?

Dependencies For registers and memory: True data dependency RAW Anti dependency WAR Output dependency WAW Control dependency Structural dependency

Perfect Scheduling For a long loop: Read a[i] and b[i] from memory and store in registers Add the register values Store the result in memory c[i] The whole program should finish in 3 cycles!! Anti and output dependences : the assembly code keeps using lr1 Control dependences : decision-making after each iteration Structural dependences : how many registers and cache ports do I have?

Impediments to Perfect Scheduling • Register renaming • Alias analysis • Branch prediction • Branch fanout • Indirect jump prediction • Window size and cycle width • Latency

Register Renaming • lr1  … pr22  … • …  lr1 …  pr22 • lr1  … pr24  … • If the compiler had infinite registers, you would • not have WAR and WAW dependences • The hardware can renumber every instruction and • extract more parallelism • Implemented models: • None • Finite registers • Perfect (infinite registers – only RAW)

Alias Analysis • You have to respect RAW dependences for • memory as well – • store value to addrA • load from addrA • Problem is: you do not know the address at • compile-time or even during instruction dispatch

Alias Analysis • Policies: • Perfect: You magically know all addresses and only delay loads that conflict with earlier stores • None: Until a store address is known, you stall every subsequent load • Analysis by compiler: (addr) does not conflict with (addr+4) – global and stack data are allocated by the compiler, hence conflicts can be detected – accesses to the heap can conflict with each other

Global, Stack, and Heap main() int a, b;  global data call func(); func() int c, d;  stack data int *e;  e,f are stack data int *f; e = (int *)malloc(8);  e, f point to heap data f = (int *)malloc(8); … *e = c;  store c into addr stored in e d = *f;  read value in addr stored in f This is a conflict if you had previously done e=e+8

Branch Prediction • If you go the wrong way, you are not extracting • useful parallelism • You can predict the branch direction statically or • dynamically • You can execute along both directions and throw • away some of the work (need more resources)

Effect on Performance 1000 instructions 150 are branches With 1% mispredict rate, 1.5 branches are mispredicts Assume 1-cycle mispredict penalty per branch If you assume that IPC is 2, 1.5 cycles are added to the 500-cycle execution time Performance loss = 0.3 x mispredict-rate x penalty Today, performance loss = 0.3 x 5 x 40 = 60%

Dynamic Branch Prediction • Tables of 2-bit counters that get biased towards • being taken or not-taken • Can use history (for each branch or global) • Can have multiple predictors and dynamically pick • the more promising one • Much more in a few weeks…

Static Branch Prediction • Profile the application and provide hints to the • hardware • Hardly used in today’s high-performance • processors • Dynamic predictors are much better • (Figure 7, Pg.10)

Branch Fanout • Execute both directions of the branch – an • exponential growth in resource requirements • Hence, do this until you encounter four branches, • after which, you employ dynamic branch prediction • Better still, execute both directions only if the • prediction confidence is low • Not commonly used in today’s processors.

Indirect Jumps • Indirect jumps do not encode the target in the • instruction – the target has to be computed • The address can be predicted by • using a table to store the last target • using a stack to keep track of subroutine call and returns (the most common indirect jump) • The combination achieves 95% prediction rates • (Figure 8, pg. 12)

Latency • In their study, every instruction has unit latency • -- highly questionable assumption today! • They also model other “realistic” latencies • Parallelism is being defined as • cycles for sequential exec / cycles for superscalar, • not as instructions / cycles • Hence, increasing instruction latency can increase • parallelism – not true for IPC

Window Size & Cycle Width 8 available slots in each cycle Window of 2048 instructions

Window Size & Cycle Width • Discrete windows: grab 2048 instructions, schedule • them, retire all cycles, grab the next window • Continuous windows: grab 2048 instructions, • schedule them, retire the oldest cycle, grab a few • more instructions • Window size and register renaming are not related • They allow infinite windows and cycles (infinite • windows and limited cycles is memory-intensive)

Simulated Models • Seven models: control, register, and memory • dependences (Figure 11, Pg. 15) • Today’s processors: ? • However, note optimistic scheduling, 2048 instr • window, cycle width of 64, and 1-cycle latencies • SPEC’92 benchmarks, utility programs (grep, sed, • yacc), CAD tools

Aggressive Models • Parallelism steadily increases as we move to • aggressive models (Fig 12, Pg. 16) • Branch fanout does not buy much • IPC of Great model: 10 • Reality: 1.5 • Numeric programs can do much better

Cycle Width and Window Size • Unlimited cycle widths buys very little (much less • than 10%) (Figure 15) • Decreasing the window size seems to have little • effect as well (you need only 256?! – are registers • the bottleneck?) (Figure 16) • Unlimited window size and cycle widths don’t help • (Figure 18) • Would these results hold true today?

Memory Latencies • The ability to prefetch has a huge impact on IPC – • to hide a 200 cycle latency, you have to spot the • instruction very early • Hence, registers and window size are extremely • important today!

Loop Unrolling • Should be a non-issue for today’s processors • Reduces dynamic instruction count • Can help if rename checkpoints are a bottleneck

Superscalar Pipeline BPred B T B Rename Table I-Cache PC IFQ checkpoints D-Cache in2 in1 out op R O B LSQ Regfile FU FU FU FU Issue queue

Branch Prediction • Obviously, better prediction helps (Fig. 22) • Fanout does not help much (Fig. 24-b) – not • selecting the right branches? • Fig. 27 does not show a graph for profiled-fanout • plus hardware prediction • Luckily, small tables are good enough for good • indirect jump prediction • Note: Mispredict penalty=0

Mispredict Penalty • Has a big impact on performance (earlier equation) • (Fig. 30) • Pentium4 mispredict penalty = 20 + issueq wait-time

Alias Analysis • Has a big impact on performance – compiler • analysis results in a two-fold speed-up (Fig. 32) • Reality: modern languages make such an analysis • very hard • Later, we’ll read a paper that attempts this in • hardware (Chrysos ’98)

Registers • Results suggest that 64 registers are good • enough (Fig. 33) • Precise exceptions make it difficult to look at • a large speculative window with few registers • Room for improvement for register utilization? • (Monreal ’99)

Instruction Latency • Parallelism almost unaffected by increased • latency (increases marginally in some cases!) • Note: “unconventional” definition of parallelism • Today, latency strongly influences IPC

Conclusions • Branch prediction, alias analysis, mispredict • penalty are huge bottlenecks • Instr latency, registers, window size, cycle width • are not huge bottlenecks • Today, they are all huge bottlenecks because they • all influence effective memory latency…which is • the biggest bottleneck

Questions • Is it worth doing multi-issue? • Is there more ILP left to extract? • What would some of these graphs look like today? • Weaknesses – cache and reg model • Lessons for today’s procs – is it worth doing • multi-issue, parallelism has gone down

Next Week’s Paper • “Complexity-Effective Superscalar Processors”, • Palacharla, Jouppi, Smith, ISCA ’97 • The impact of increased issue width and window • size on clock speed

Title • Bullet

CS 7960-4 Lecture 2

CS 7960-4 Lecture 2

Presentation Transcript

CS 584 Lecture 4

CS 179: Lecture 4 Lab Review 2

CS 7960-4 Lecture 20

CS 7960-4 Lecture 24

CS 161 Lecture 4

CS 7960-4 Lecture 8

CS 519: Lecture 4

CS 7960-4 Lecture 5

CS 140L Lecture 4

CS 140L Lecture 4

CS 425 Lecture 4

CS 7960-4 Lecture 23

CS 7960-4 Lecture 17

CS 160: Lecture 4

CS 7960-4 Lecture 10

CS 7960-4 Lecture 7

CS 7960-4 Lecture 20

CS 7960-4 Lecture 4

CS 7960-4 Lecture 20

CS 7960-4 Lecture 14

CS 7960-4 Lecture 18