Understanding Pipelining and Superscalar Architecture in Advanced Microarchitecture

Advanced Microarchitecture Lecture 2: Pipelining and Superscalar Review

Pipelined Design • Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) • Bandwidth or Throughput = Performance • BW = num. tasks/unit time • For a system that operates on one task at a time: BW = 1 / latency • Pipelining can increase BW if many repetitions of same operation/task • Latency per task remains same or increases Lecture 2: Pipelining and Superscalar Review

Pipelining Illustrated Combinatorial Logic N Gate Delays BW = ~(1/n) Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays BW = ~(2/n) Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates BW = ~(3/n) Lecture 2: Pipelining and Superscalar Review

T Performance Model • Starting from an unpipelined version with propagation delay T and BW=1/T Perfpipe = BWpipe = 1 / (T/k + S) where S = latch delay where k = num stages k-stage pipelined unpipelined T/k S T/k S Lecture 2: Pipelining and Superscalar Review

G Hardware Cost Model • Starting from an unpipelined version with hardware cost G Costpipe = G + kL where L = latch cost incl. control where k = num stages k-stage pipelined unpipelined G/k L G/k L Lecture 2: Pipelining and Superscalar Review

Cost/Performance Tradeoff C/P Cost/Performance: C/P = [Lk + G] / [1/(T/k + S)] = (Lk + G) (T/k + S) = LT + GS + LSk + GT/k Optimal Cost/Performance: find min. C/P w.r.t. choice of k k æ ö Lk + G ç ÷ d ç ÷ GT G T 1 = 0 + 0 + LS - k = - - - - - - - - ç ç o p t dk T L S k2 ç ç + S ç ÷ k è ø Lecture 2: Pipelining and Superscalar Review

“Optimal” Pipeline Depth: kopt x104 Cost/Performance Ratio (C/P) G=175, L=41, T=400, S=22 G=175, L=21, T=400, S=11 Pipeline Depth k Lecture 2: Pipelining and Superscalar Review

Cost? • “Hardware Cost” • Transistor/Gate Count • Should include additional logic to control the pipeline • Area (related to gate count) • Power! • More gates  more switching • More gates  more leakage • Many metrics to optimize • Very difficult to determine what really is “optimal” Lecture 2: Pipelining and Superscalar Review

Good Examples: Automobile assembly line Floating-Point multiplier Instruction pipeline (?) Pipelining Idealism • Uniform Suboperations • The operation to be pipelined can be evenly partitioned into uniform-latency suboperations • Repetition of Identical Operations • The same operations are to be performed repeatedly on a large number of different inputs • Repetition of Independent Operations • All the repetitions of the same operation are mutually independent, i.e., no data dependences and no resource conflicts Lecture 2: Pipelining and Superscalar Review

Instruction Pipeline Design • Uniform Suboperations … NOT! • Balance pipeline stages • Stage quantization to yield balanced stages • Minimize internal fragmentation (some waiting stages) • Identical operations … NOT! • Unifying instruction types • Coalescing instruction types into one “multi-function” pipe • Minimize external fragmentation (some idling stages) • Independent operations … NOT! • Resolve data and resource hazards • Inter-instruction dependency detection and resolution • Minimize performance loss Lecture 2: Pipelining and Superscalar Review

The Generic Instruction Cycle • The “computation” to be pipelined: • Instruction Fetch (IF) • Instruction Decode (ID) • Operand(s) Fetch (OF) • Instruction Execution (EX) • Operand Store (OS) • a.k.a. writeback (WB) • Update Program Counter (PC) Lecture 2: Pipelining and Superscalar Review

The Generic Instruction Pipeline Based on Obvious Subcomputations: IF Instruction Fetch ID Instruction Decode OF/RF Operand Fetch EX Instruction Execute OS/WB Operand Store Lecture 2: Pipelining and Superscalar Review

Balancing Pipeline Stages IF TIF= 6 units • Without pipelining • Tcyc TIF+TID+TOF+TEX+TOS • = 31 • Pipelined • Tcyc max{TIF, TID, TOF, TEX, TOS} • = 9 • Speedup= 31 / 9 • Can we do better in terms of either performance or efficiency? ID TID= 2 units OF/RF TID= 9 units EX TEX= 5 units OS/WB TOS= 9 units Lecture 2: Pipelining and Superscalar Review

Balancing Pipeline Stages • Two methods for stage quantization • Merging multiple subcomputations into one • Subdividing a subcomputation into multiple smaller ones • Recent/Current trends • Deeper pipelines (more and more stages) • To a certain point: then cost function takes over • Multiple different pipelines/subpipelines • Pipelining of memory accesses (tricky) Lecture 2: Pipelining and Superscalar Review

Granularity of Pipeline Stages Finer-Grained Machine Cycle: 11 machine cyc /instruction Coarser-Grained Machine Cycle: 4 machine cyc / instruction IF IF TIF&ID= 8 units IF ID ID OF TOF= 9 units OF OF Tcyc= 3 units OF EX TEX= 5 units EX EX OS OS TOS= 9 units OS OS TIF,TID,TOF,TEX,TOS = (6/2/9/5/9) Lecture 2: Pipelining and Superscalar Review

Hardware Requirements • Logic needed for each pipeline stage • Register file ports needed to support all (relevant) stages • Memory accessing ports needed to support all (relevant) stages IF IF IF ID ID OF OF OF OF EX EX EX OS OS OS OS Lecture 2: Pipelining and Superscalar Review

Pipeline Examples AMDAHL 470V/7 IF PC GEN MIPS R2000/R3000 Cache Read IF Cache Read IF ID ID Decode OF RD Read REG OF Add GEN ALU EX Cache Read Cache Read MEM OS EX EX 1 EX 2 WB OS Check Result Write Result Lecture 2: Pipelining and Superscalar Review

Instruction Dependencies • Data Dependence • True Dependence (RAW) • Instruction must wait for all required input operands • Anti-Dependence (WAR) • Later write must not clobber a still-pending earlier read • Output Dependence (WAW) • Earlier write must not clobber an already-finished later write • Control Dependence (a.k.a. Procedural Dependence) • Conditional branches cause uncertainty to instruction sequencing • Instructions following a conditional branch depends on the execution of the branch instruction • Instructions following a computed branch depends on the execution of the branch instruction Lecture 2: Pipelining and Superscalar Review

Example: Quick Sort on MIPS • bge $10, $9, $36 • mul $15, $10, 4 • addu $24, $6, $15 • lw $25, 0($24) • mul $13, $8, 4 • addu $14, $6, $13 • lw $15, 0($14) • bge $25, $15, $36 • $35: • addu $10, $10, 1 • . . . • $36: • addu $11, $11, -1 • . . . • # for (;(j<high)&&(array[j]<array[low]);++j); • # $10 = j; $9 = high; $6 = array; $8 = low Lecture 2: Pipelining and Superscalar Review

Hardware Dependency Analysis • Processor must handle • Register Data Dependencies • RAW, WAW, WAR • Memory Data Dependencies • RAW, WAW, WAR • Control Dependencies Lecture 2: Pipelining and Superscalar Review

Terminology • Pipeline Hazards: • Potential violations of program dependencies • Must ensure program dependencies are not violated • Hazard Resolution: • Static method: performed at compile time in software • Dynamic method: performed at runtime using hardware Stall, Flush or Forward • Pipeline Interlock: • Hardware mechanism for dynamic hazard resolution • Must detect and enforce dependencies at runtime Lecture 2: Pipelining and Superscalar Review

Pipeline: Steady State t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Lecture 2: Pipelining and Superscalar Review

Pipeline: Data Hazard t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Lecture 2: Pipelining and Superscalar Review

Pipeline: Stall on Data Hazard t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB IF ID RD ALU MEM WB Instj+1 IF ID Stalled in RD RD ALU MEM WB Instj+2 IF Stalled in ID ID RD ALU MEM WB Instj+3 Stalled in IF IF ID RD ALU MEM Instj+4 IF ID RD ALU IF ID RD IF ID IF Lecture 2: Pipelining and Superscalar Review

Different View Lecture 2: Pipelining and Superscalar Review

Pipeline: Forwarding Paths t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB Many possible paths IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Requires stalling even with fwding paths MEM ALU Lecture 2: Pipelining and Superscalar Review

ALU Forwarding Paths src1 IF ID Register File src2 dest = = Deeper pipeline may require additional forwarding paths = = ALU MEM Lecture 2: Pipelining and Superscalar Review

Pipeline: Control Hazard t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB IF ID RD ALU MEM WB Insti+1 IF ID RD ALU MEM WB Insti+2 IF ID RD ALU MEM WB Insti+3 IF ID RD ALU MEM WB Insti+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Lecture 2: Pipelining and Superscalar Review

Pipeline: Stall on Control Hazard t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB IF ID RD ALU MEM WB Insti+1 Stalled in IF IF ID RD ALU MEM Insti+2 IF ID RD ALU Insti+3 IF ID RD Insti+4 IF ID IF Lecture 2: Pipelining and Superscalar Review

nop nop nop ALU nop RD ALU ID RD nop nop nop Pipeline: Prediction for Control Hazards t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB Speculative State Cleared IF ID RD ALU MEM WB Insti+1 IF ID RD ALU nop nop Insti+2 IF ID RD nop nop Insti+3 IF ID nop nop Insti+4 IF ID RD New Insti+2 Fetch Resteered IF ID New Insti+3 IF New Insti+4 Lecture 2: Pipelining and Superscalar Review

Going Beyond Scalar • Simple pipeline limited to execution of CPI ≥ 1.0 • “Superscalar” can achieve CPI ≤ 1.0 (i.e., IPC ≥ 1.0) • Superscalar means executing more than one scalar instruction in parallel (e.g., add + xor + mul) • Contrast to Vector which effectively executes multiple operations in parallel, but they all must be the same (e.g., four parallel additions) Lecture 2: Pipelining and Superscalar Review

Architectures for Instruction Parallelism • Scalar pipeline (baseline) • Instruction/overlap parallelism = D • Operation Latency = 1 • Peak IPC = 1 D D different instructions overlapped Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles Lecture 2: Pipelining and Superscalar Review

Superscalar Machine • Superscalar (pipelined) Execution • Instruction parallelism = D x N • Operation Latency = 1 • Peak IPC = N per cycle D x N different instructions overlapped N Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles Lecture 2: Pipelining and Superscalar Review

Ex. Original Pentium Prefetch 4× 32-byte buffers Decode1 Decode up to 2 insts Decode2 Decode2 Read operands, Addr comp Asymmetric pipes Execute Execute Both u-pipe v-pipe mov, lea, simple ALU, push/pop test/cmp shift rotate some FP jmp, jcc, call, fxch Writeback Writeback Lecture 2: Pipelining and Superscalar Review

Pentium Hazards, Stalls • “Pairing Rules” (when can/can’t two insts exec at the same time?) • read/flow dependence moveax, 8 mov [ebp], eax • output dependence moveax, 8 moveax, [ebp] • partial register stalls mov al, 1 mov ah, 0 • function unit rules • some instructions can never be paired: MUL, DIV, PUSHA, MOVS, some FP Lecture 2: Pipelining and Superscalar Review

Limitations of In-Order Pipelines • CPI of inorder pipelines degrades very sharply if the machine parallelism is increased beyond a certain point • i.e., when N approaches the average distance between dependent instructions • Forwarding is no longer effective • Must stall more often • Pipeline may never be full due to frequency of dependency stalls Lecture 2: Pipelining and Superscalar Review

N Instruction Limit Pentium: Superscalar degree N=2 is reasonable… going much further encounters rapidly diminishing returns Ex. Superscalar degree N = 4 Dependent inst must be N = 4 instructions away Any dependency between these instructions will cause a stall On average, the parent- child separation is only about 5± instructions! (Franklin and Sohi ’92) Average of 5 means there are many cases when the separation is < 4… each of these limits parallelism Lecture 2: Pipelining and Superscalar Review

In Search of Parallelism • “Trivial” Parallelism is limited • What is trivial parallelism? • In-order: sequential instructions do not have dependencies • in all previous examples, all instructions executed either at the same time or after earlier instructions • previous slides show that superscalar execution quickly hits a ceiling • So what is “non-trivial” parallelism? … Lecture 2: Pipelining and Superscalar Review

What is Parallelism? • Work T1: time to complete a computation on a sequential system • Critical Path T: time to complete the same computation on an infinitely-parallel system • Average Parallelism Pavg = T1/ T • For a p-wide system Tp max{T1/p , T} Pavg >> p Tp T1/p x = a + b; y = b * 2 z =(x-y) * (x+y) Lecture 2: Pipelining and Superscalar Review

ILP: Instruction-Level Parallelism • ILP is a measure of the amount of inter-dependencies between instructions • Average ILP = num instructions / longest path code1: ILP = 1 (must execute serially) T1 = 3, T = 3 code2: ILP = 3 (can execute at the same time) T1 = 3, T = 1 code2:r1  r2 + 1 r3  r9 / 17 r4  r0 - r10 code1:r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 Lecture 2: Pipelining and Superscalar Review

ILP != IPC • Instruction level parallelism usually assumes infinite resources, perfect fetch, and unit-latency for all instructions • ILP is more a property of the program dataflow • IPC is the “real” observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine • The ILP of a program is an upper-bound on the attainable IPC Lecture 2: Pipelining and Superscalar Review

ILP=3 ILP=1 ILP=2 Scope of ILP Analysis r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 r11  r12 + 1 r13  r19 / 17 r14  r0 - r20 Lecture 2: Pipelining and Superscalar Review

DFG Analysis A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R1] J: R1 = R1 – 1 K: R3  ST 0[R1] Lecture 2: Pipelining and Superscalar Review

In-Order Issue, Out-of-Order Completion In-order Inst. Stream Execution Begins In-order INT Fadd1 Fmul1 Ld/St Fadd2 Fmul2 Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Fmul3 Out-of-order Completion Issue = send an instruction to execution Lecture 2: Pipelining and Superscalar Review

A B 2: C 3: D 4: 5: 6: E F G 7: H J 8: K Example A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R1] J: R1 = R1 – 1 K: R3  ST 0[R1] Cycle 1: A B C IPC = 10/8 = 1.25 D G E F J H K Lecture 2: Pipelining and Superscalar Review

A B 2: C 3: D E F G 4: 5: 6: H J 7: K Example (2) A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R9 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R9] J: R1 = R9 – 1 K: R3  ST 0[R1] Cycle 1: A B E C D F G H J K IPC = 10/7 = 1.43 Lecture 2: Pipelining and Superscalar Review

Track with Simple Scoreboarding • Scoreboard: a bit-array, 1-bit for each GPR • If the bit is not set: the register has valid data • If the bit is set: the register has stale data i.e., some outstanding instruction is going to change it • Issue in Order: RD Fn (RS, RT) • If SB[RS] or SB[RT] is set  RAW, stall • If SB[RD] is set  WAW, stall • Else, dispatch to FU (Fn) and set SB[RD] • Complete out-of-order • Update GPR[RD], clear SB[RD] Lecture 2: Pipelining and Superscalar Review

Out-of-Order Issue In-order Inst. Stream Need an extra Stage/buffers for Dependency Resolution DR DR DR DR Out of Program Order Execution INT Fadd1 Fmul1 Ld/St Fadd2 Fmul2 Fmul3 Out-of-order Completion Lecture 2: Pipelining and Superscalar Review

OOO Scoreboarding • Similar to In-Order scoreboarding • Need new tables to track status of individual instructions and functional units • Still enforce dependencies • Stall dispatch on WAW • Stall issue on RAW • Stall completion on WAR • Limitations of Scoreboarding? • Hints • No structural hazards • Can always write a RAW-free code sequence Add R1 = R0 + 1; Add R2 = R0 + 1; Add R3 = R0 + 1; … • Think about x86 ISA with only 8 registers Finite number of registers in any ISA will force you to reuse register names at some point  WAR, WAW  stalls Lecture 2: Pipelining and Superscalar Review

Lessons thus Far • More out-of-orderness More ILP exposed • But more hazards • Stalling is a generic technique to ensure sequencing • RAW stall is a fundamental requirement (?) • Compiler analysis and scheduling can help (not covered in this course) Lecture 2: Pipelining and Superscalar Review

Understanding Pipelining and Superscalar Architecture in Advanced Microarchitecture

Understanding Pipelining and Superscalar Architecture in Advanced Microarchitecture

Presentation Transcript

The Microarchitecture Level

Alpha 21264 Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

MicroArchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

AMD Bulldozer Microarchitecture

Advanced Microarchitecture

Microarchitecture

Cairngorm Microarchitecture

Advanced Microarchitecture

Microarchitecture

(Microarchitecture is dead . Long live microarchitecture!)