Lecture 13: ILP

Lecture 13: ILP Last Time: Branch delays Delay Slots Branch prediction Today Breaking the CPI barrier

Conservatively, the pipeline waits until the branch target is computed before fetching the next instruction (or use delay slot). Alternatively, we can speculate which direction and to what address the branch will go. Need to confirm speculation and back up later. F F F R R R X X X M M M W W W F F R R X X M M W W Control Hazards F F R X M W

F R X M W Example: Speculative Conditional Branch BEQ %1, %2, LOOP ADD %2, %3, %4 ... LOOP: SUB %5, %6, %7 =? PC PC +4 RW A IR Reg File I-Mem D-Mem C E A A DO DO B D DI IR IR IR BEQ

Depending on use, some branches are very predictable loops TTT…TN limit checks almost always pass Some are not very predictable data dependent dispatch with equally likely cases Types of predictors static history multi-bit history pattern for(j=0;j<30;j++) { … } switch(mode) { case 1: … case 2: … default: … … if(a > limit) { … } Branch Prediction

Branch history table indexed by PC stores last direction each branch went may indicate if last instruction at this address was a branch table is a cache of recent branches Buffer size of 4096 entries are common What happens if: Don’t find PC in BHT? Run out of BHT entries? Simple Dynamic Predictor PC IR IM BHT Prediction update

History gives a pattern of recent branches e.g., TTNTTNTTN what comes next? Predict next branch by looking up history of branches for a particular pattern Can combine PC and history register to distinguish between different branches Global History Tables PHT Branch history register 110110 f State Prediction 2 bit saturating counters 110110 PC To PHT xor

Two-level predictor first level - find history (pattern) 2nd level - predict branch for that pattern Correlating predictors Differs from global pattern history table as each branch has it’s own private history What structures must be updated? Per Branch Pattern History Table (Two-Level Predictor) BHT BPT PC f State 110110 Prediction BPT may be Independent for each BHT entry or shared 110110

Local: previous executions of this branch Global: previous execution of all branches Compaq Alpha 21264 Branch Predictor (1998) Local HistoryTable(1024x10) Local Prediction(1024x3) Global Prediction(4096x2) PC Choice Prediction(4096x2) Path History prediction

Branch Prediction Accuracy • Static branch prediction (compiler) - 70% • Can get better with profiling (run program and examine branch directions) • Per branch 2-bit saturating counters (no history) - 85% • Two-level predictor (with history) - 90-95% accuracy • BUT - depends on benchmark

Need to know where to go if the prediction is ‘taken’ predict the target along with the direction JR instruction - target not known until after R stage! May use different target prediction strategy for different types of branches subroutine returns Use a Branch Target Buffer (BTB) to predict target addresses F F R R A A M M W W Branch Target Tables Predict Taken Calculate Target Need to guess target here

Use current PC to index a cache of next PCs Use a push-down stack to record subroutine return addresses The ISA can give hints about where you’re going Digital Alpha has 4 instructions with identical ISA behavior JMP, JSR, RET, JSR_COROUTINE specify predictor’s use of stack include hint of target address JMP R31, (R3), hint Branch Target Prediction (2) Actual Target PC (from ALU) PC +4 IM IR BHT Prediction (Taken/Not) BTB Predicted Target PC Stack

Reducing control hazards • Move target test and computation early in pipeline as possible • Delayed branches • Execute a useful instruction after the branch instead of stall • Branch prediction • Predict taken/not taken • Predict target address (BTB)

Reg Reg Data Memory R4000 Pipeline (I$ start) (I$ finish) (decode/opfetch) (ALU) (D$ start) (D$ finish) (tag check) (write back) IF IS RF EX DS DF TC WB Instruction Memory • How long is load delay? • How long is branch delay? • How many comparators are needed to implement the forwarding decisions? • What instruction sequences will still cause stalls?

Can CPI be < 1? • Hazard Resolution • WAR/WAW hazards  register renaming • RAW hazards  speculation (if possible) • Control hazards • Better branch prediction (or use predication to eliminate branches) • Speculation • Instruction Level Parallelism (ILP) • Compiler schedules ILP (VLIW) • Multi-issue (in-order execution to multiple pipes) • Dynamic scheduling (“Superscalar”) ADD R1,R2,R3 SUB R1,R4,R5 ADD R1,R2,R3 SUB R1’,R4,R5 

ParallelismOvert vs Covert (1 of 2) Dusty Deck Covert approach Programmer doesn’t see parallelism, just speed Compiler Compiler may discover parallelism (e.g., in loops) and/or reorganize code to remove dependencies Sequential Program Code generated usually has sequential semantics for compatibility with ISA Computer Pipelined and/or Multiple Issue Hardware executes in parallel while preserving sequential semantics

Parallelism Overt vs Covert Dusty Deck Problem Parallelizer Program with multiple threads that explicitly communicate and synchronize Compiler Parallel Program Compiler Sequential Program In addition to threads, object code may make ILP explicit, encoding dependencies Parallel Program Pipelined and/or Multiple Issue CPU Exploits only ILP Parallel Computer Exploits ILP and thread-level parallelism

Problem vs Program Find the maximum element of an array, a max = a[0];for(i=1;i<n;i++) { if(a[i] > max) max = a[i];}return max;

Parallelism and Hardware Pipeline Parallel or Interleave Chip 16mm 64-bit ALU 1mm Technology gives us lots offunction units They get only slightly faster each year The wires get slower Pipeline or replicate at bit, word, vector, subroutine levels

Parallelism and Software Independent Operations (ILP) a = (b + c) * (d + e + f); xform clip render Function decomposition 1 xform/clip render Domain decomposition 2 xform/clip render 3 xform/clip render 4 xform/clip render

F F F R R R X X X M M M W W W F F F R R R X X X M M M W W W Compilers for Covert Parallelism 10 instructions13 cycles3 load stallswould be worse if cache miss or if MUL or ADD had latency greater than 1 a = b + c * d;e = f + g; Naive Code LW %1, CLW %2, DMUL %3, %1, %2LW %4, BADD %5, %4, %3SW %5, ALW %6, FLW %7, GADD %8, %6, %7SW %8, E F R X M W F R X M W F R X M W F R X M W CPI = 13/10 = 1.3

Compilers for Covert Parallelism (2) 10 instructions10 cyclesno load stallscan tolerate 2-4 cycles of miss latencycan tolerate 2 cycle ADD and MUL a = b + c * d;e = f + g; Rescheduled Code LW %1, CLW %2, DLW %4, BLW %6, FMUL %3, %1, %2LW %7, GADD %5, %4, %3 ADD %8, %6, %7SW %5, ASW %8, E Move loads to the topSpace ALU operations based on pipeline latency Scheduling can be done by the compileror at run-time by the hardware CPI = 10/10 = 1

Hardware to exploit ILP I-Mem Reg File PC D-Mem 1-Wide D-Mem Reg File I-Mem PC 2-Wide

Compiler Scheduled ILP LW %1, CLW %2, DLW %4, BLW %6, FMUL %3, %1, %2LW %7, GADD %5, %4, %3ADD %8, %6, %7SW %5, ASW %8, E 10 instructions10 cyclesCPI = 1 10 instructions6 cyclesCPI = 0.6 ALU 2 ALU 1 LW %1, C LW %2, D LW %4, B LW %6, FMUL %3, %1, %2 LW %7, GADD %5, %4, %3ADD %8, %6, %7 SW %5, ASW %8, E

Control Flow Restrictions max = a[0] ;for(i=1;i<n;i++) { if(a[i] > max) max = a[i] ;} return max ; LOOP: LW %1, %2 ; // a[i] SGT %3, %1, %8 ; // a[i] > max BEQZ %3, NOMAXADDI %8, %1, #0 ; // update maxNOMAX: ADDI %2, %2, #4 ; // update a[i] ptr ADDI %4, %4, #1 ; // update i SLT %5, %4, %9 ; // i < n BNEZ %5, LOOP Can only reschedule code inside a basic block. Small basic blocks limit opportunities for scheduling IC = 8 CPI = 11/8 = 1.4

Predication to Eliminate Branches max = a[0] ;for(i=1;i<n;i++) { if(a[i] > max) max = a[i] ;}return max ; LOOP: LW %1, %2 ; // a[i] SGT %3, %1, %8 ; // a[i] > maxIF(%3) ADDI %8, %1, #0 ; // update max ADDI %2, %2, #4 ; // update a[i] ptr ADDI %4, %4, #1 ; // update i SLT %5, %4, %9 ; // i < n BNEZ %5, LOOP Predicate conditional to make this all one basic block IC = 7CPI = 9/7 = 1.3 LOOP: LW %1, %2 ; // a[i] ADDI %4, %4, #1 ; // update i ADDI %2, %2, #4 ; // update a[i] ptr SGT %3, %1, %8 ; // a[i] > max SLT %5, %4, %9 ; // i < nIF(%3) ADDI %8, %1, #0 ; // update max BNEZ %5, LOOP Reschedule to eliminate stalls and provide slack CPI = 8/7 = 1.1

Loop Unrolling for ILP max = a[0] ;for(i=1;i<n;i+=2) { if(a[i] > max) max = a[i]; if(a[i+1] > max) max = a[i+1]; }return max ; Unroll the loop to make an even bigger basic block More opportunities for parallelism Lower loop overhead6+4 vs 2(3+4) This loop has a loop-carried dependence through max, more parallelism if loop is not serial. LOOP: LW %1, %2 ; // a[i] LW %11, 4(%2) ; // a[i+1] SGT %3, %1, %8 ; // a[i] > max ADDI %2, %2, #8 ; // update a[i] ptrIF(%3) ADDI %8, %1, #0 ; // update max SGT %12, %11, %8 ; // a[i+1] > maxIF(%12) ADDI %8, %11, #0 ; // update max ADDI %4, %4, #2 ; // update i SLT %5, %4, %9 ; // i < n-1 BNEZ %5, LOOP IC per iteration = 5CPI = 11/10 = 1.1

Multiple ALUs - ILP LOOP: LW %1, %2 // a[i] LW %11, 4(%2) // a[i+1]SGT %3, %1, %8 // a[i] > maxADDI %2, %2, #8 // update a[i] ptrIF(%3) ADDI %8, %1, #0 // update maxSGT %12, %11, %8 // a[i+1] > maxIF(%12) ADDI %8, %11, #0 // update max ADDI %4, %4, #2 // update i SLT %5, %4, %9 // i < n-1 BNEZ %5, LOOP LOOP: LW %1, %2 LW %11, 4(%2) ADDI %2, %2, #8 ADDI %4, %4, #2 SGT %3, %1, %8 SLT %5, %4, %9 IF(%3) ADDI %8, %1, #0 SGT %12, %11, %8 IF(%12) ADDI %8, %11, #0 BNEZ %5, LOOP IC per iteration = 5CPI = 7/10 = 0.7

Compilation and ISA • Efficient compilation requires knowledge of the pipeline structure • latency of each operation type • But a good ISA transcends several implementations with different pipelines • should things like a delayed branch be in an ISA? • should a compiler use the properties of one implementation when compiling for an ISA? • do we need a new interface?

Op Op Op Rd Rd Rd Ra Ra Ra Rb Rb Rb Very-Long Instruction Word (VLIW) Computers IP Instruction Memory Instruction word consists of several conventional 3-operand instructions (up to 28 on the Multiflow Trace), one for each of the ALUs Register File Register file has 3N ports to feed N ALUs. All ALU-ALU communication takes place via register file.

Pros Very simple hardware no dependency detection simple issue logic just ALUs and register file Potentially exploits large amounts of ILP Cons Lockstep execution (static schedule) very sensitive to long latency operations (cache misses) Global register file hard to build Lots of NOPs poor code ‘density’ I-cache capacity and bandwidth compromised Must recompile sources Implementation visible through ISA VLIW Pros and Cons

128-bit instructions three 3-address operations a template that encodes dependencies 128 general registers predication speculative load EPIC - Intel Merced op1 op2 op3 tmp pred op rd rs1 rs2 const

Multiple Issue Instruction Memory Hazard Detect Instruction Buffer Register File

Superficially looks like VLIW but: Dependencies and structural hazards checked at run-time Can run existing binaries must recompile for performance, not correctness More complex issue logic Swizzle next N instructions into position Check dependencies and resource needs Issue M <= N instructions that can execute in parallel Multiple Issue (Details)

Example Multiple Issue Issue rules: at most 1 LD/ST, at most 1 floating op Latency: LD - 1, int-1, F*-1, F+-1 cycle LOOP: LD %F0, 0(%1) // a[i] 1 LD %F2, 0(%2) // b[i] 2 MULTD %F8, %F0, %F2 // a[i] * b[i] 3 ADDD %F12, %F8, %F16 // + c 4 SD %F12, 0(%3) // d[i] ADDI %1, %1, 4 5 ADDI %2, %2, 4 ADDI %3, %3, 4 6 ADDI %4, %4, 1 // increment I SLT %5, %4, %6 // i<n-1 7 BNEQ %5, %0, LOOP 8 Old CPI = 1 New CPI = 8/11 = 0.73

Rescheduled for Multiple Issue Issue rules: at most 1 LD/ST, at most 1 floating op Latency: LD - 1, int-1, F*-1, F+-1 cycle LOOP: LD %F0, 0(%1) // a[i] 1 ADDI %1, %1, 4 LD %F2, 0(%2) // b[i] 2 ADDI %2, %2, 4 MULTD %F8, %F0, %F2 // a[i] * b[i] 3 ADDI %4, %4, 1 // increment I ADDD %F12, %F8, %F16 // + c 4 SLT %5, %4, %6 // i<n-1 SD %F12, 0(%3) // d[i] 5 ADDI %3, %3, 4 6 BNEQ %5, %0, LOOP Old CPI = 0.73 New CPI = 6/11 = 0.55

More complex issue logic check dependencies check structural hazards issue variable number of instructions (0-N) shift unissued instructions over Able to run existing binaries recompile for performance, not correctness Datapaths nearly identical Neither VLIW or multiple-issue can schedule around run-time variation in instruction latency cache misses Dealing with run-time variation requires run-time or dynamic scheduling Multiple Issue vs VLIW

Next Time • Dynamic Scheduling • Out of order issue • Register renaming • gets rid of WAW and WAR hazards • Reservation stations • wait for operands and execution unit • distributed issue decision • Reorder buffer • need to commit or retire instructions in order

Lecture 13: ILP

Lecture 13: ILP

Presentation Transcript

Lecture 4

Lecture 21: The Economics of AMR

Sports Lectures

Lecture 9-10:

Lecture #3

Lecture #5

Function-Oriented Software Design (lecture 5)

Lecture series: Data analysis

Electronic Enterprise: Strategy and Architecture

Mobile Programming Lecture 8

Lecture: Tunnel FET

Lecture series: Data analysis

Lecture 11:

LECTURE Topic 4

SOA Part1 Lecture 5

“Elementary Particles” Lecture 4

Real-Time Systems, COSC-4301-01, Lecture 2

Welcome To NAMASTE LECTURE SERIES

Lecture 30 November 4, 2013

NEUROANATOMY Lecture : 10

QM/MM Modelling