Understanding Instruction Level Parallelism (ILP) in Computer Architecture

Computer Architecture Slide Sets WS 2012/2013 Prof. Dr. Uwe Brinkschulte Prof. Dr. Klaus Waldschmidt Part 9 Instruction Level Parallelism (ILP) - Concurrency Computer Architecture – Part 9 –page 1 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Classical pipelining allows the termination of up to one instruction per clock cycle (scalar execution) A concurrent execution of several instructions in one clock cycle requires the availability of several independent functional units. These functional units are more or less heterogeneous (that means, they are designed and optimized for different functions). Two major concepts of concurrency on ILP level are existing: Superscalar concurrency VLIW concurrency These concepts can be found as well in combination Concurrency Computer Architecture – Part 9 –page 2 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

The superscalar technique operates on a conventional sequential instruction stream The concurrent instruction issue is performed completely during runtime by hardware. This technique requires a lot of hardware resources. It allows a very efficient dynamic issue of instructions at runtime. On the downside, no long running dependency analysis (as e.g. possible by a compiler) is possible Concurrency - superscalar Computer Architecture – Part 9 –page 3 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

The superscaler technique is a pure microarchitecture technique, since it is not visible on the architectural level (conventional sequential instruction stream) Thus, hardware structure (e.g. the number of parallel execution units) can be changed without changing the architectural specifications (e.g. ISA) Superscaler execution is usually combined with pipelining (superscalar pipeline) Concurrency - superscalar Computer Architecture – Part 9 –page 4 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

The VLIW technique (Very Large Instruction Word) operates on a parallel instruction stream. The concurrent instruction issue is organized statically with the support of the compiler. The consequence is a lower amount of hardware resources. Extensive compiler optimizations are possible to exploit parallelism. On the downside, no dynamic effects can be considered (e.g. branch prediction is difficult in VLIW). Concurrency - VLIW Computer Architecture – Part 9 –page 5 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

VLIW is a architectural technique, since the parallel instruction stream is visible on the architectural level. Therefore, a change in e.g. the level of parallelism leads to a change in the architectural specifications VLIW is usually combined with pipelining VLIW can also be combined with superscaler concepts, as e.g done in EPIC (Explicit Parallel Instruction Computing, Intel Itanium) Concurrency - VLIW Computer Architecture – Part 9 –page 6 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

The main question in designing a concurrent computer architecture is: How many instruction level parallelism (ILP) exists in the code of an application? This question has been analyzed very extensively for the compilation of a sequential imperative programming language in a RISC instruction set. The result of all these analyses is: Programs include a fine grain parallelism degree of 5-7. Degree of parallelism in ILP Computer Architecture – Part 9 –page 7 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Higher degrees in parallelism can be obtained only by code with long basic blocks (long instruction sequences without branches). Numerical applications in combination with loop unrolling is an application class with a higher ILP. Further application classes are embedded system control. A computer architecture for general purpose applications with a higher ILP of 5-7 can suffer from decreasing efficiency because of a lot of idle functional units. Degree of parallelism in ILP Computer Architecture – Part 9 –page 8 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Components ofa superscaler processor Superscalar technique Computer Architecture – Part 9 –page 9 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

A superscalar pipeline: operates on a sequential instruction stream Instructions are collected in a instruction window Instruction issue to heterogeneous execution units is done by hardware microprocessor has several, mostly heterogeneous, functional units in the execution stage of the instruction pipeline. Instruction processing can be done out of sequential instruction stream order. Sequential instruction stream order is finally restored. Superscalar technique Computer Architecture – Part 9 –page 10 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Superscalar technique In-order and out-of-order sections in a superscaler pipeline In-order In-order Out-of-order Computer Architecture – Part 9 –page 11 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Instruction fetch • Loads several instructions (instruction block) from the nearest instruction memory (e.g. instruction cache) to an instruction buffer • Ususally, as many instructions are fetched per clock cycle as can be issued to the execution units (fetch bandwidth) • Control flow conflicts are solved by branch prediction and branch target address cache • The instruction buffer decouples instruction fetch from decode Computer Architecture – Part 9 –page 12 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Instruction fetch • Cache level Harvard architecture • Self-modifying code cannot be implemented efficiently on todays superscaler processors • Instruction cache (single port) mostly simpler organized than data cache (multi port) • In case of branches, instructions have to be fetched from different cache blocks • Solutions to parallelize this: multi-channel caches, interleaved caches, multiple instructions fetch units, trace cache Computer Architecture – Part 9 –page 13 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Decode • Decodes multiple instructions per clock cycle • Decode bandwidth usually equal to fetch bandwidth • Fixed length instruction format simplifies decoding of several instructions per clock cycle • Variable instruction length => multi stage decoding • first stage: determine instruction boundaries • second stage: decode instructions and create one or more microinstructions • complex CISC instructions are splitted to simpler RISC instructions Computer Architecture – Part 9 –page 14 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Register rename • Goal of register renaming: remove false dependencies (output dependency, anti dependency) • Renaming can be done: • statically by the compiler • dynamically by hardware • Dynamic register renaming: • architectural registers are mapped to physical registers • each destination register specified in the instruction is mapped to a free physical register • the following instructions having the same architectural register as source register will get last assigned physical register as input operand by register renaming=> false dependencies between register operands are removed Computer Architecture – Part 9 –page 15 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Register rename • Two possible implementations: • two different register sets are present • architectural registers store the „valid“ values • rename buffer registers store temporary results • on renaming, architectural registers are assigned to buffer registers • only one register set of so-called physical registers is present • these store temporary and valid values • architectural registers are mapped to physical registers • architectural registers themselves are physically non-existent • a mapping table defines which physical register currently operates as which architectural register for a given instruction Computer Architecture – Part 9 –page 16 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Possible implementation: Register rename Mapping table locical destination registers physical destination registers Multi-plexer physical source registers Dependency check locical source registers Mapping has to be done for multipe instructions simultaneously Computer Architecture – Part 9 –page 17 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Instruction window • Decoded instructions are written to the instruction window • The instruction window decouples fetch/decode from execution • The instructions in the instruction window are • free of control flow dependencies due to branch prediction • free of false dependencies due to register renaming • True dependencies and resource dependencies remain • Instruction issue checks in each clock cycle, which instructions from instruction window can be issued to the execution units • These are issued up to the maximum issue bandwidth (number of execution units) • The original instruction sequence is stored in the reorder buffer Computer Architecture – Part 9 –page 18 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Instruction window and issueterminology • issue means the assignment of instructions to execution units or preceeding reservation stations, if present (see e.g. Tomasulo alg.) • if reservation stations are present, the assignment of instructions from reservation stations to the execution units is called dispatch • the instruction issue policy describes the protcoll used to select instructions for issuing • depending on the processor instructions can be issued in-order or out- of-order • the lookahead capability determines, how may instructions in the instruction window can be inspected to find the next issuable instructions • the issuing logic determining executable instructions often is called scheduler Computer Architecture – Part 9 –page 19 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

In-order versus out-of-order issue • Example: • I1 w = a - b • I2 x = c + w • I3 y = d - e • I4 z = e + y • In-order issue: Out-of-order issue: • clock n: I1 clock n: I1, I3 • clock n+1: I2, I3 clock n+1: I2, I4 • clock n+2: I4 • Using in-order issue, the scheduler has to wait after I1 (RAW), then I2 and I3 can be issued in parallel (no dependency), finally I4 can be issued (RAW) • Using out-of-order issue, the scheduler can issue I1 and I3 in parallel (no dependeny), followed by I2 and I4 => one clock cycle is saved RAW RAW Computer Architecture – Part 9 –page 20 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

False dependencies andout-of-order issue • Example: • I1 w = a - b • I2 x = c + w • I3 c = d - e • I4 z = e + c • Out-of-order issue: • I1 w = a - b, I3 c = d - e • I2 x = c + w, I4 z = e + c • Out-of-order issue with register rename: • I1 w = a - b, I3 c2 = d - e • I2 x = c1 + w, I4 z = e + c2 • Out-of-order issue makes a false dependencies (WAR, WAW) critical • Register renaming solves these issues RAW I2 uses old c I4 used new c WAR RAW Different! I2 and I4 use new c Identical! I2 uses old c, I4 uses new c Computer Architecture – Part 9 –page 21 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

There are several possible techniques to determine and issue the next executable instructions, e.g.: Associative memory(central solution) Tomasulo algorithm(decentral solution) Scoreboard(central solution) Scheduling techniques Computer Architecture – Part 9 –page 22 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

The instructions waiting in the instruction window are marked by so called tags. The tags of the produced results are compared with the tags of the operands of the waiting instructions. For comparison, each window cell is equipped with comparators. All comparators are working in parallel. This kind of a memory is called associative memory. A hit of comparison is marked by a ready bit. If the ready bits of an instruction are complete, the instruction is issued. This solves the true dependencies Wake up with associative memory Computer Architecture – Part 9 –page 23 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Wake up with associative memory Computer Architecture – Part 9 –page 24 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Priority based issuing of instructions woken up • If there are more instruction determined for issuing then available execution units (issue bandwidth), a priority selection logic is necessary • This selection logic determines for each execution unit the instruction to issue from the woken up instructions • Therefore, each execution unit needs such a selection unit • This solves the resource dependencies • The hardware complexity of the issue unit rises with the size of the instruction window and the number of execution units Computer Architecture – Part 9 –page 25 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Selection logic for a single execution unit Computer Architecture – Part 9 –page 26 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

The most well-known principle for instruction parallelism of superscalar processors is the Tomasulo algorithm. This algorithm was implemented first in the IBM 360 Computer by R. Tomasulo. The main assumption of the Tomasulo algorithm is, that the semantic of a program is unchanged, if the data dependencies are still existing when modifying the sequence of the instructions. The Tomasulo algorithm is based on the dataflow principle. All waiting instructions in the instruction window can be ordered in a dataflow graph. As consequence, all instructions in one level of the dataflow graph can be issued and executed in parallel and all dependencies in the dataflow graph can be represented by pointers to the functional units. Tomasulo algorithm Computer Architecture – Part 9 –page 27 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Therefore the functional units are equipped with additional registers, so called reservation stations, which store these pointers or the operands itself. Assigning operands and pointers to the reservation stations (issue) solves the resource dependencies As soon as all operands and pointers are available, the function is executed (dispatch) This solves the true data dependencies If all operands are available immediately, issue and dispatch can be done in the same clock cycle, so dispatch usually is not a pipeline stage Different from associative memory, resource dependencies are solved before true data dependencies For a better distinction of the reservation stations from the registers of the original register file, the registers of the register file are regarded as functional units with the identity operation. Tomasulo algorithm Computer Architecture – Part 9 –page 28 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Dataflow graph of instructions in the instruction window Level 0 Level 1 Level 2 register reservation station reservation station functional units identity ... Implementation of the nodes Computer Architecture – Part 9 –page 29 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Simple microarchitecture for demonstrating Tomasulo algorithm a b c d e f y z x register unit = = = = = = = = = reservation stations sub add div mul execution unit functional unit Computer Architecture – Part 9 –page 30 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Simple microarchitecture for demonstrating Tomasulo algorithm c e f sub x add y mul a b d z div a b c d e f y z x = = = = = = = = = I1 x = a / b I2 y = x + z I3 z = c  d I4 x = e - f RAW WAR WAW e x c d f z a b sub add div mul 2. step 3. step 1. step Computer Architecture – Part 9 –page 31 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

First step: instructions I1 - I4 and the available operands are issued to the corresponding reservation stations reservation stations of the results are reserverd for I1, I2 and I3 result reservation station for I4 cannot be reserved because already occupied by result of I1 Second step: instructions I1 and I3 are dispatched because all operands and result space are available result of I1 is transferred to the reservation station where I2 is waiting therefore, result reservation station occupied by I1 so far becomes free and is now reserved for I4 Third step: instructions I2 and I4 are dispatched now and the results are stored Execution of the program sequence on the microarchitecture Computer Architecture – Part 9 –page 32 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

The true data dependencies in a superscalar processor can also be solved solely over the register file. This is the basic idea of the scoreboarding and therefore the principle is very simple. It is a central method within a microarchitecture for controlling the instruction sequence according to the data dependencies. Register, which are in use, are marked by a scoreboard bit. A register is marked as in use if it is destination of an instruction. Only free registers are available for read or write operations. This is a very simple solution for solving data dependencies. Scoreboard (Thornton algorithm) Computer Architecture – Part 9 –page 33 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

The scoreboard bit is set at the instruction issue point of the pipeline. It is set at the request for a destination register and is reset after the write back phase. Each instruction is checked against a conflict with their source operands and a “in use” destination register. In case of a conflict, the instruction will be delayed until the scoreboard bit is reset. With this simple method, a RAW-conflict is solved. Scoreboard (Thornton algorithm) Registerfile Scoreboard bitvector R0 R1 R2 ..... Ri ..... Rn 0 0 1 ..... 1 ..... 0 The length of the scoreboard bit vector is the same as the length of the register file. Computer Architecture – Part 9 –page 34 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

State graph of the scoreboard method 0 Register Ri free (unused) Register Ri is address of the destination operand write back to Ri (destination operand is in Ri) 1 write back to Ri is finished & Ri address of another destination operand Register Ri occupied (in use) Computer Architecture – Part 9 –page 35 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Scoreboard logic OPC R S1 S2 instruction word 31 n 0 31 n 0 scoreboard logic set reset SC bit n SC bit n EX stage RF READ stage Adresse Operand OPC R (R) (S1) (S2) + RF WRITE stage Computer Architecture – Part 9 –page 36 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Instruction window organization Centralized window, single stage Decentralized windows, single stage Centralized or dezentralized windows, two stages Computer Architecture – Part 9 –page 37 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Execution • Out-of-order execution of the instructions in mostly parallel execution units • Results are store in the rename buffers or physical registers • Execution units can be • single cycle units (execution takes a single clock cycle), latency = throughput = 1 • multiple cycle units (execution takes multiple clock cycles), latency > 1 • with pipelining (e.g. arithmetic pipeline), throughput = 1 • without pipelining (e.g. load-/store-unit - possible cache misses), throughput = 1 / latency Computer Architecture – Part 9 –page 38 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Execution • Load-Store-Units • Load- and store-instructions often can take different paths inside the load-store-unit (wait-buffer for stores) • Store instructions need the address (address calculation) and the value to store, while load instructions only need the address • Therefore, load instruction are often brought before store instructions as long as not the same address is concerned address register content store Load-Store-Unit write buffer load Computer Architecture – Part 9 –page 39 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Execution • Load-Store-Units • A load instruction is completed, as soon as the value to load is written to a buffer register • A store instruction is completed, as soon as the value is written to the cache • This cannot be undone! • So store instructions on a speculative path (branch prediction) cannot be completed before the speculation is confirmed to be true • Speculative load instructions are not a problem Computer Architecture – Part 9 –page 40 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Execution • Multimedia Units • perform SIMD operations (subword parallelism) • the same operation is performed on a part of the register set • graphic-oriented multimedia operations • arithmetic or logic operations on packed datatypes like e.g. eight 8-bit, four 16-bit or two 32-bit partial words • pack and unpack operations, mask, conversion and compare operations • video-oriented multimedia operations • two to four simultaneous 32-bit floatingpoint operations Computer Architecture – Part 9 –page 41 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Retire and write back • Retire and write back is responsible for: • commiting or discarding the completed results from execution • rolling back wrong speculation paths from branch prediction • restoring the original sequential instruction order • allowing precise interrupts or exceptions Computer Architecture – Part 9 –page 42 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Some wordings • Completion of an instruction: • The execution unit has finished the execution of the instruction • The results are written to temporary buffer registers and are available as operands for data-dependend instructions • Completion is done out of order • During completion, the position of the instruction in the original instruction sequence and the current completion state is stored in a reorder buffer • The completion state might indicate a preceding interrupt/exception or a pending speculation for this instruction Computer Architecture – Part 9 –page 43 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Some wordings • Commitment of an instruction: • Commitment is done in the original instruction order (in-order) • A result of an instruction can be commited, if • execution is completed • the results of all instructions preceding this instruction in the original instruction order are committed or will be committed within the same clock cycle • no interrupt/exception occured before or during execution • the execution does no longer depend on any speculation • During commitment the results are written permanently to the architectural registers • Committed instructions are removed from the reorder buffer Computer Architecture – Part 9 –page 44 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Some wordings • Removement of an instruction: • The instruction is removed from the reorder buffer without committing it • All results of the instructions are discarded • This is done e.g. in case of misspeculation or a preceding interrupt/exception • Retirement of an instruction • The instruction is removed from the reorder buffer with or without committing it (commitment or removement) Computer Architecture – Part 9 –page 45 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Interrupts and exceptions On an interrupt or exception, the regular program flow is interrupted and an interrupt service routine (exception handler) is called Classes of interrupts/exceptions: Aborts: are very fatal and lead to processor shutdownReasons: hardware failures like defective memory cells Traps: are fatal and normally lead to program terminationReasons: arithmetic errors (overflow, underflow, division by 0), privilege violation, invalid opcode, … Faults: cause the repetition of the last executed instruction after handlingReasons: virtual memory management errors like page faults External interrupts: lead to interrupt handlingReasons: interrupts from external devices to indicate the presence of data or timer events Software interrupts: lead to interrupt handlingReasons: interrupt instruction in program Computer Architecture – Part 9 –page 46 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Usually, exceptions like aborts, traps or faults have higher priorities then other interrupts Interrupts and exceptions main program interrupt routine save status and set interrupt mask Interrupt request return from interrupt restore status Program flow for interrupt/exception handling Computer Architecture – Part 9 –page 47 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

An interrupt or exception is called precise, if the processor state saved at the start of the interrupt routine is identical to a sequential in order execution on a von-Neumann-architecture For out-of-order execution on a superscaler processor this means: all instructions preceding the interrupt causing instruction are committed and therefore have modified the processor state all instructions succeeding the interrupt causing instruction are removed and therefore have not influenced the processor state depending on the interrupt causing instruction, it is either committed or removed Precise interrupts and exceptions Computer Architecture – Part 9 –page 48 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

The reorder buffer stores the sequential order of the issued instructions and therefore allows result serialization during retirement The reorder bandwidth is usually identical to the issue bandwidth Possible reorder buffer organization: contains instruction states only contains instruction states and results (combination of reorder buffer and rename buffer register) Alternate reorder techniques: ceckpoint repair history buffer Reorder buffer Computer Architecture – Part 9 –page 49 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

The reorder buffer can be implemented as a ring buffer Consecutive completed and non speculativeinstructions at the headof the ring buffer canbe committed Reorder buffer head I1 can be committed I2 I3 I4 I5 tail instruction issued & result completed instruction issued & result completed, based on speculation instruction issued empty slot I6 Computer Architecture – Part 9 –page 50 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Understanding Instruction Level Parallelism (ILP) in Computer Architecture