Computer Applications

Elect 707 Computer Applications Chapter 3 Instruction Level Parallelism 2 Dr. Eng. Amr T. Abdel-Hamid Spring 2014 Text book slides: Computer Architecture: A Quantitative Approach 4th Edition, John L. Hennessy & David A. Patterso with modifications.

Getting CPI < 1 • CPI ≥ 1 if issue only 1 instruction every clock cycle • Scalar vs superscalar processors. • Multiple-issue processors come in 3 flavors: • statically-scheduled superscalar processors, • dynamically-scheduled superscalar processors, and • VLIW (very long instruction word) processors • The 2 types of superscalarprocessors issue varying numbers of instructions per clock • use in-order execution if they are statically scheduled, or • out-of-order execution if they are dynamically scheduled

VLIW: Very Large Instruction Word • VLIW processors issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (Intel/HP Itanium) • Each “instruction” has explicit coding for multiple operations • In IA-64, grouping called a “packet” • In Transmeta, grouping called a “molecule” (with “atoms” as ops)

VLIW: Very Large Instruction Word • Tradeoff instruction space for simple decoding • The long instruction word has room for many operations • By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel • E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch • 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide • Need compiling technique that schedules across several branches

Unrolled Loop that Minimizes Stalls for Scalar 1 Loop: L.D F0,0(R1) 2 L.D F6,-8(R1) 3 L.D F10,-16(R1) 4 L.D F14,-24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8,F6,F2 7 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 9 S.D 0(R1),F4 10 S.D -8(R1),F8 11 S.D -16(R1),F12 12 DSUBUI R1,R1,#32 13 BNEZ R1,LOOP 14 S.D 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration L.D to ADD.D: 1 Cycle ADD.D to S.D: 2 Cycles

Loop Unrolling in VLIW Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branch L.D F0,0(R1) L.D F6,-8(R1) 1 L.D F10,-16(R1) L.D F14,-24(R1) 2 L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4 ADD.D F20,F18,F2 ADD.D F24,F22,F2 5 S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6 S.D -16(R1),F12 S.D -24(R1),F16 7 S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#4 8 S.D -0(R1),F28 BNEZ R1,LOOP 9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Note: Need more registers in VLIW (15 vs. 6 in SS) & much more H/W

Problems with 1st Generation VLIW • Increase in code size • generating enough operations in a straight-line code fragment requires ambitiously unrolling loops • whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding • Operated in lock-step; no hazard detection HW • a stall in any functional unit pipeline caused entire processor to stall, since all functional units must be kept synchronized • Compiler might prediction function units, but caches hard to predict • Binary code compatibility • Pure VLIW => different numbers of functional units and unit latencies require different versions of the code

Putting it ALL together • Seen how the individual mechanisms of: • dynamic scheduling, • multiple issue, • and speculation • Put all three together which yields a microarchitecture quite similar to those in modern microprocessors. • For simplicity, we consider only an issue rate of two instructions per clock. • Let’s assume we want to extend Tomasulo’s algorithm to support a two-issue superscalar pipeline with a separate integer and floating-point units.

Example: Loop: LD R2,0(R1) ;R2=array element DADDIU R2,R2,#1 ;increment R2 SD R2,0(R1) ;store result DADDIU R1,R1,#8 ;increment pointer BNE R2,R3,LOOP ;branch if not last element

Tomasulo’s algorithm

Tomasulo’s algorithm with Speculation

Limits to ILP • Conflicting studies of amount • Benchmarks (vectorized Fortran FP vs. integer C programs) • Hardware sophistication • Compiler sophistication • How much ILP is available using existing mechanisms with increasing HW budgets? • Do we need to invent new HW/SW mechanisms to keep on processor performance curve? • Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints • Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock • Motorola AltaVec: 128 bit ints and FPs • Supersparc Multimedia ops, etc.

The Model Processor Assumptions for ideal/perfect machine to start: • Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided • Branch prediction – perfect; no mispredictions • Jump prediction – all jumps perfectly predicted (returns, case statements) • 2 & 3 Zero control dependencies; perfect speculation & an unbounded buffer of instructions available • Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; => 1 & 4 eliminates all but RAW Also: • perfect caches; • 1 cycle latency for all instructions (FP *,/); • unlimited instructions issued/clock cycle;

Limits to ILP HW Model comparison

Benchmark programs showed • Three of these benchmarks (fpppp, doduc, and tomcatv) are floating-point intensive, and the other three are integer programs. • Two of the floating-point benchmarks (fpppp and tomcatv) have extensive parallelism, • could be exploited by a vector computer or by a multiprocessor. • The program li is a LISP interpreter that has many short dependences.

Upper Limit to ILP: Ideal Machine(Figure 3.1) FP: 75 - 150 Integer: 18 - 60 Instructions Per Clock

Maximum Issue Count: Instruction Window Size • What the perfect processor must do: • Look arbitrarily far ahead to find a set of instructions to issue, predicting all branches perfectly. • Rename all register uses to avoid WAR and WAW hazards. • Determine whether there are any data dependences among the instructions in the issue packet; if so, rename accordingly. • Determine if any memory dependences exist among the issuing instructions and handle them appropriately. • Provide enough replicated functional units to allow all the ready instructions to issue. • # of comparisons to check for Data Dependency:

Limiting the Instruction Window • Limit window size to n (no longer arbitrary) • Window = number of instructions that are candidates for concurrent execution in a cycle • Window size determines • Instruction storage needed within the pipeline • Maximum issue rate • Number of operand comparisons needed for dependence checking is O(n2) • To try and detect dependences among 2000 instructions would require some 4 million comparisons • Issuing 50 instructions requires 2450 comparisons

More Realistic HW: Window ImpactFigure 3.2 Change from Infinite window 2048, 512, 128, 32 FP: 9 - 150 Integer: 8 - 63 IPC

Branch Impact

More Realistic HW: Branch Impact Perfect Tournament BHT (512) Profile No prediction FP: 15 - 45 Change from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle Integer: 6 - 12 IPC

More Realistic HW: Renaming Register Impact (N int + N fp) Change 2048 instr window, 64 instr issue, 8K 2 level Prediction FP: 11 - 45 Integer: 5 - 15 IPC

Performance beyond single thread ILP • There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes) • Explicit Thread Level Parallelism or Data Level Parallelism • Thread: process with own instructions and data • thread may be a process part of a parallel program of multiple processes, or it may be an independent program • Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute • Data Level Parallelism: Perform identical operations on data, and lots of data

Thread Level Parallelism (TLP) • ILP exploits implicit parallel operations within a loop or straight-line code segment • TLP explicitly represented by the use of multiple threads of execution that are inherently parallel • Goal: Use multiple instruction streams to improve • Throughput of computers that run many programs • Execution time of multi-threaded programs • TLP could be more cost-effective to exploit than ILP

New Approach: Multithreaded Execution • Multithreading: multiple threads to share the functional units of 1 processor via overlapping • processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table • memory shared through the virtual memory mechanisms, which already support multiple processes • HW for fast thread switch; much faster than full process switch  100s to 1000s of clocks • When switch? • Alternate instruction per thread (fine grain) • When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)

Fine-Grained Multithreading • Switches between threads on each instruction, causing the execution of multiples threads to be interleaved • Usually done in a round-robin fashion, skipping any stalled threads • CPU must be able to switch threads every clock • Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls • Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads • )

Course-Grained Multithreading • Switches threads only on costly stalls, such as L2 cache misses • Advantages • Relieves need to have very fast thread-switching • Doesn’t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall • Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs • Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen • New thread must fill pipeline before instructions can complete • Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time • Used in IBM AS/400

Multithreaded Categories Simultaneous Multithreading Multiprocessing Superscalar Fine-Grained Coarse-Grained Time (processor cycle) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot

For most apps, most execution units lie idle From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.

Do both ILP and TLP? • TLP and ILP exploit two different kinds of parallel structure in a program • Could a processor oriented at ILP to exploit TLP? • functional units are often idle in data path designed for ILP because of either stalls or dependences in the code • Could the TLP be used as a source of independent instructions that might keep the processor busy during stalls? • Could TLP be used to employ the functional units that would otherwise lie idle when insufficient ILP exists?

Simultaneous Multi-threading ... One thread, 8 units Two threads, 8 units Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

Simultaneous Multithreading (SMT) • Simultaneous multithreading (SMT): insight that dynamically scheduled processor already has many HW mechanisms to support multithreading • Large set of virtual registers that can be used to hold the register sets of independent threads • Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads • Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW • Just adding a per thread renaming table and keeping separate PCs • Independent commitment can be supported by logically keeping a separate reorder buffer for each thread Source: Micrprocessor Report, December 6, 1999 “Compaq Chooses SMT for Alpha”

Multithreaded Pipeline Example • Slide from Joel Emer

Sun Niagara Multithreaded Pipeline

Computer Applications