Topic 3

Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein eleg652-F06

Reading List • Slides: Topic3x • Henn&Patt: Chapters 3 & 4 • Other assigned readings from homework and classes eleg652-F06

Instruction Level Parallelism • Parallelism that is found between instructions (or intra instruction) • Dynamic and Static Exploitation • Dynamic: Hardware related. • Static: Software related (compiler and system software) • VLIW and Superscalar • Micro-Dataflow and Tomasulo’s Algorithm eleg652-F06

RISC Concepts: Revisit • Reduced Instruction Set Architecture • “Internal Computing Architecture in which processor instructions are pared down so that most of them can be executed in one clock cycle, theoretically improving computing efficiency” Black Box Pocket Glossary of Computer Terms • Characteristics: • Uniform instruction encoding • Homogenous Register Banks • Simplified Addressing Modes • Simplified data structures • Branch delay slot • Cache • Pipeline eleg652-F06

RISC Concepts: Revisited • What prevents one instruction per cycle (CPI = 1)? • Hazards • Dependencies • Long Latency ops • Cache Trashing eleg652-F06

Pipeline: A Review • Hazards • Any situation that will prevent the smooth flow of the instructions along the pipeline • Types • Structural • Due to limited resources and contention among them • Control • Instructions that change the PC (program counter) • Data • Variables depends on values from previous instruction • Stall • Hazards will “stall” the pipeline • Serious: It can hold up many instructions for many cycles eleg652-F06

RISC Pipeline & Instruction Issue • Instruction Issue • The process of letting an instruction move from ID to EXEC • Issue V.S. Execution • In DLX • ID  Check all data hazards, stall if any exists Typical RISC Pipeline: Instruction Fetch Instruction Decode Execute Memory Op Register Update eleg652-F06

Hazards • Structural Hazards • Non Pipelining Function Units • One Port Register Bank and one port memory bank • Data Hazards • For some • Forwarding • For others • Pipeline Interlock LD R1 A + R4 R1 R7 Need Bubble / Stall eleg652-F06

Structural Hazard Instruction Clock cycle number 1 2 3 4 5 6 7 8 9 Load instruction IF ID EX MEM WB Instruction i+1 IF ID EX MEM WB Instruction i+2IF ID EX MEM WB Instruction i+3IF ID EX MEM WB Instruction i+4IF ID EX MEM A single memory bank for insts and data eleg652-F06

Data Hazards Data is read here Stage Data is written here Stage The ADD instruction writes a register that is a source operand for the SUB instruction. But the ADD doesn’t finish writing the data into the register file until three clock cycles after SUB begins reading it! The SUB instruction may read the incorrect value. Result may be non-deterministic. Solved by forwarding eleg652-F06

B + C  A A + C  B A + D  E E + D  A Flow Dependency RAW Conflicts Anti Dependency WAR Conflicts Data Dependency: A Review B + C  A E + D  A RAR are not really a problem Output Dependency WAW Conflicts eleg652-F06

IF ID EX MEM WB ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R1,R5 IF ID EX MEM WB AND R6,R1,R7 OR R8,R1,R9 IF ID EX MEM WB IF ID EX MEM WB XOR R10,R1,R11 Forwarding Example eleg652-F06

Bypassing Pitfalls The Code LW R1, 32 (R6) ADD R4, R1, R7 SUB R5, R1, R8 AND R6, R1, R7 The Pipeline IF ID EX MEM WB IF ID STALL EX MEM WB IF STALL ID EX MEM WB STALL IF ID EX MEM WB Load Delay Slot cannot be eliminated by forwarding alone Pipeline Interlock: Stall / Bubble for hazards that cannot be solved by forwarding eleg652-F06

Pipelining • Issue  Pass the Instruction Decode stage • DLX: Only issue instruction if there is no hazard • Detect interlock early in the pipeline has the advantage that it never needs to suspend an instruction and undo state changes. eleg652-F06

Instruction Level Parallelism • Static Scheduling • Simple Scheduling • Loop Unrolling • Loop Unrolling + Scheduling • Software Pipelining • Dynamic Scheduling • Out of order execution • Data Flow computers • Speculation eleg652-F06

Constraint Graph S1 2 1 • Directed-edges: data-dependence • Undirected-edges: Resources constraint • An edge (u,v) (directed or undirected) of length e represent an interlock between node uand v, and they must be separated by etime. 3 S2 S3 2 6 operation latencies 4 S4 S5 1 1 S6 eleg652-F06

Code Scheduling For Single Pipeline • Input: A constraint graph G = (V, E) • Output: A sequence of operations in G (v1, v2, v3, v4, v5 ….vn) plus a number of no-op, such that: • If the no-op are deleted then the sequence is a topological sort of G. • Any two nodes in the sequence (x, y) is separated by a distance greater or equal d(x,y) in graph G eleg652-F06

Advanced Pipelining • Instruction Reordering and scheduling within loop body • Loop Unrolling • Code size suffers • Superscalar • Compact code • Multiple issued of different instruction types • VLIW eleg652-F06

VLIW • Very Long Instruction Word • Compiler has all responsibility to schedule instructions • Make hardware simpler • Move complexity to software • Concept developed by John Fisher at Yale’s University in early 1980 eleg652-F06

An Example X[i] + a Loop: LD F0, 0 (R1) ; load the vector element ADDD F4, F0, F2 ; add the scalar in F2 SD 0 (R1), F4 ; store the vector element SUB R1, R1, #8 ; decrement the pointer by ; 8 bytes (per DW) BNEZ R1, Loop ; branch when it’s not zero Load can by-pass the store Assume that latency for Integer ops is zero and latency for Integer load is 1 eleg652-F06

An Example X[i] + a Loop: LD F0, 0 (R1) 1 STALL 2 ADDD F4, F0, F2 3 STALL 4 STALL 5 SD 0 (R1), F4 6 SUB R1, R1, #8 7 BNEZ R1, Loop 8 STALL 9 Load Latency FP ALU Latency Load Latency This requires 9 Cycles per iteration 0 0 1 2 1 BNEZ LD ADDD SD SUB Constrain Graph eleg652-F06

An Example X[i] + a Scheduling Loop: LD F0, 0 (R1) 1 STALL 2 ADDD F4, F0, F2 3 SUB R1, R1, #8 4 BNEZ R1, Loop 5 SD 8 (R1), F4 6 This requires 6 Cycles per iteration 0 0 1 2 1 BNEZ LD ADDD SD SUB eleg652-F06 Constrain Graph

An Example X[i] + a Unrolling Loop : LD F0, 0 (R1) 1 NOP 2 ADDD F4, F0, F2 3 NOP 4 NOP 5 SD 0 (R1), F4 6 LD F6, -8 (R1) 7 NOP 8 ADDD F8, F6, F2 9 NOP 10 NOP 11 SD -8 (R1), F8 12 LD F10, -16 (R1) 13 NOP 14 ADDD F12, F10, F2 15 NOP 16 NOP 17 SD -16 (R1), F12 18 LD F14, -24 (R1) 19 NOP 20 ADDD F16, F14, F2 21 NOP 22 NOP 23 SD -24 (R1), F16 24 SUB R1, R1, #32 25 BNEZ R1, LOOP 26 NOP 27 This requires 6.8 Cycles per iteration eleg652-F06

An Example X[i] + a Unrolling + Scheduling Loop : LD F0, 0 (R1) 1 LD F6, - 8 (R1) 2 LD F10, -16 (R1) 3 LD F14, -24 (R1) 4 ADDD F4, F0, F2 5 ADDD F8, F6, F2 6 ADDD F12, F10, F2 7 ADDD F16, F14, F2 8 SD 0 (R1), F4 9 SD -8 (R1), F8 10 SD -16 (R1), F12 11 SUB R1, R1, #32 12 BNEZ R1, LOOP 13 SD 8 (R1), F16 14 This requires 3.5 Cycles per iteration eleg652-F06

Topic 3a Multi Issue Architectures Beyond Simple RISC eleg652-F06

ILP • ILP of a program • Average Number of Instructions that a superscalar processor might be able to execute at the same time • Data dependencies • Latencies and other processor difficulties • ILP of a machine • The ability of a processor to take advantage of the ILP • Number of instructions that can be fetched and executed at the same time by such processor eleg652-F06

Multi Issue Architectures • Super Scalar • Machines that issue multiple independent instructions per clock cycle when they are properly scheduled by the compiler and runtime scheduler • Very Long Instruction Word • A machine where the compiler has complete responsibility for creating a package of instructions that can be simultaneously issued, and the hardware does not dynamically make any decisions about multiple issue Patterson & Hennessy P317 and P318 eleg652-F06

Multiple Instruction Issue • Multiple Issue + Static Scheduling  VLIW • Dynamic Scheduling • Tomasulo • Scoreboarding • Multiple Issue + Dynamic Scheduling  Superscalar • Decoupled Architectures • Static Scheduling of R-R Instructions • Dynamic Scheduling of Memory Ops • Buffers eleg652-F06

Five Primary Approaches eleg652-F06

Integer instruction FP instruction Clock cycle Loop: LD F0, 0 (R1) 1 LD F6, -8 (R1) 2 LD F10, -16 (R1) ADDD F4, F0, F2 3 LD F14, -24 (R1) ADDD F8, F6, F2 4 LD F18, -32 (R1) ADDD F12, F10, F2 5 SD 0 (R1), F4 ADDD F16, F14, F2 6 SD -8 (R1), F8 ADDD F20, F18, F2 7 SD -16 (R1), F12 8 SD -24 (R1), F16 9 SUB R1, R1, #40 10 BNEZ R1, LOOP 11 SD 8 (R1), F20 12 Two-Issue ArchitectureUnrolled and Scheduled Code The unrolled and scheduled code  2.4 cycles per iteration (5 iters in 12 cycles) eleg652-F06

A VLIW Code Sequence LD LD LD LD LD LD LD a a a a a a a F0 F6 F10 F14 F18 F22 F26 + + + + + + + F4 F8 F12 F16 F20 F24 F28 SD SD SD SD SD SD SD Unrolling 6 times Memory Memory FP FP Integer operation reference 1 reference 2 operation 1 operation 2 /branch LD F0, 0 (R1) LD F6, - 8 (R1) LD F10, -16 (R1) LD F14, -24 (R1) LD F18, -32 (R1) LD F22, -40 (R1) ADDD F4, F0, F2 ADDD F8, F6, F2 LD F26, -48 (R1) ADDD F12, F10, F2 ADDD F16, F14, F2 ADDD F20, F18, F2 ADDD F24, F22, F2 SD 0 (R1), F4SD - 8 (R1), F8 ADDD F28, F26, F2 SD -16 (R1), F12 SD -24 (R1), F16 SD -32 (R1), F20 SD -40 (R1), F24 SUB R1, R1, #48 SD - 0 (R1), F28 BNEZ R1, LOOP 7 iterations in 9 cycles  1.28 cycle per iter eleg652-F06

Trace Scheduling • First Used for VLIW architecture • Trace • A straight line sequence of instructions executed in some data or a sequence of ops which constitute a possible path based on “predicted” branches. • Trace Scheduling • Identify a “most possible” sequence of instructions and then “compact” the instructions in such path • Tools • For Loops: Unrolling • For Branches: Static Branch prediction eleg652-F06

Trace 2 Trace 1 A B C br D E F G H I An ExampleTraces A; B; C; if(D){ E; F; } else{ G; } H; I; Basic Block An instruction sequence which has only one entry point and one exit point (no target for branches or branches in the middle) eleg652-F06

A B C br D A B br D A B C E br D E F C E F H G Undo E G H C G H I I F H I Code Motion & Compensation Code Code Move to the Preceding Block Code Move to the Succeeding Block Original Code eleg652-F06

Trace Scheduling • Similar to Basic Block Scheduling • Their unit is traces not Basic Blocks • Reduce execution time of likely traces • Using Profiling eleg652-F06

Software Pipeline • Reorganizing loops such that each iteration is composed of instruction sequences chosen from different iterations • Use less code size • Compared to Unrolling • Some Architecture has specific software support • Rotating register banks • Predicated Instructions eleg652-F06

Software Pipelining • Overlap instructions without unrolling the loop • Give the vector M in memory, and ignoring the start-up and finishing code, we have: Loop: SD 0 (R1), F4 ;stores into M[i] ADDD F4, F0, F2 ;adds to M[i +1] LD F0, -8 (R1) ;loads M[i + 2] BNEZ R1, LOOP SUB R1, R1, #8 ;subtract indelay slot This loop can be run at a rate of 5 cycles per result, ignoring the start-up and clean-up portions. eleg652-F06

Software Pipelining Iter Time eleg652-F06

Number of Overlapped instructions Time Prologue Epilog Software Pipeline Code Number of Overlapped instructions Time Unrolled Software Pipeline Overhead for Software Pipeline: Two times cost  One for Prolog and one for epilog Overhead for Unrolled Loop: M / N times cost M Loop Executions and N unrolling eleg652-F06

Loop Unrolling V.S. Software Pipelining • When not running at maximum rate • Unrolling: Pay m/n times overhead when m iteration and n unrolling • Software Pipelining: Pay two times • Once at prologue and once at epilog • Moreover • Code compactness • Optimal runtime • Storage constrains eleg652-F06

Comparison of Static Methods eleg652-F06

On a Final Note Loop unrolling, trace scheduling, and software pipelining all aim at exposing fine grain parallelism. “The effectiveness of these techniques and their suitability for various architectural approaches are among the most open research areas in pipelined processor design” - Henn & Patt eleg652-F06

Limitations of VLIW • Limited parallelism (statically schedule) code • Basic Blocks may be too small • Global Code Motion is difficult • Limited Hardware Resources • Code Size • Memory Port limitations • A Stall is serious • Cache is difficult to be used (effectively) • i-cache misses have the potential to multiply the miss rate by a factor of n where n is the issue width • Cache miss penalty is increased since the length of instruction word eleg652-F06

An Open Question “...Whether there are large classes of applications that are not suitable for vector machines, but still offer enough parallelism to justify the VLIW approach rather than a simpler one, such as a superscalar machine?” Henn & Patt 1990 eleg652-F06

An VLIW Example TMS32C62x/C67 Block Diagram Source: TMS320C600 Technical Brief. February 1999 eleg652-F06

An VLIW Example TMS32C62x/C67 Data Paths Assembly Example Source: TMS320C600 Technical Brief. February 1999 eleg652-F06

Topic 3b Introduction to SuperScalar eleg652-F06

Instruction Issue Policy • It determinates the processor look ahead policy • Ability to examine instructions beyond the current PC • Look Ahead must ensure correctness at all costs • Issue policy • Protocol used to issue instructions • Note: Issue, execution and completion eleg652-F06

Issues in Out of Order Execution & Completion 1 R3 := R3 op R5 (1) R4 := R3 + 1 (2) R3 := R5 + 1 (3) R7 := R3 op R4 (4) Flow Dependency 2 Anti Dependency Output Dependency 3 4 (2), (3) cannot be completed out-of order, otherwise, the anti-dependence may be violated, or R3 in (2) may be incorrectly written by (3) – [when (2) was stalled for some reason] eleg652-F06

Issues in Out of Order Execution & Completion 1 R3 := R3 op R5 (1) R4 := R3 + 1 (2) R3 := R5 + 1 (3) R7 := R3 op R4 (4) Flow Dependency 2 Anti Dependency Output Dependency 3 4 (1), (3) cannot be completed out-of-order! Output-dependence has to be checked with all preceding instructions which are already in exec pipes, before an inst is issued, and ensure results to be written in correct order. Otherwise R3 in (4) may get a wrong value. eleg652-F06

Topic 3

Topic 3

Presentation Transcript

Topic 3

Topic 3

Topic-3

Topic 3

Topic 3

Topic 3

Topic 3

Topic 3

TOPIC 3

Topic 3

TOPIC 3

Topic 3

Topic-3

Topic 3

Topic 3.

TOPIC 3

Topic 3

Topic 3

Topic 3

Topic 3

TOPIC 3