Part 8 Instruction Level Parallelism (ILP) - Pipelining

Computer Architecture Slide Sets WS 2012/2013 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 8 Instruction Level Parallelism (ILP) - Pipelining Computer Architecture – Part 8 –page 1 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Parallel Computing Pipelining Superscalar VLIW EPIC Instruction-Level Parallelism Thread- and Task-Level Parallelism Multithreading Multiprocessing Multi-Cores Cluster of Computers Cloud- and Grid-Computing Computer Architecture – Part 8 –page 2 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Basis of most computer architectures is still the well-known von Neumann or Harvard principle. This principle relies on a sequential operation. In modern high performance processors this sequential operation mode is extended by instruction level parallelism (ILP). ILP can be implemented by two modes of parallelism: Parallelism in time (pipelining) Parallelism in space (concurrency) Architectures with instruction level parallelism (ILP)Pipelining vs. concurrency Computer Architecture – Part 8 –page 3 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

These two techniques of parallelism are an important feature for the high performance in combination with the technological improvement. Parallelism in time (pipelining) means that the execution of instruction is overlapped in time by partitioning the instruction cycle. Parallelism in space (concurrency) means that more than one instruction is executed in parallel, either in order or out of order. Both techniques are combined in modern microprocessors and defines the instruction level parallelism for better performance. Pipelining vs. concurrency Computer Architecture – Part 8 –page 4 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Pipelining vs. concurrency Parallelism in time relies on the assembly line principle, which is also very matured in the automotive production. It can be effective combined with concurrency. Among computer architectures an assembly line is called pipeline pipelining concurrency instruction 1 instruction 1 instruction 2instruction 2 instruction 3 instruction 3 # # t t cycle stage Computer Architecture – Part 8 –page 5 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

"Pipelines accelerate execution speed in the same way like Henry Ford revolutionized car manufacturing with the introduction of the assembly line" (Peter Wayner, 1992) Pipelining means the fragmentation of a machine instruction into several partial operations. These partial operations are executed by partial units in a sequential and synchronized manner. Every processing unit executes only one specific partial operation. All partial processing units are called a pipeline in total. Pipelining vs. concurrency Computer Architecture – Part 8 –page 6 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

1. instruction fetch The instruction addressed by the program counter is loaded from main memory or a cache into the instruction register. The program counter is incremented. 2.instruction decode Internal control signals are generated according to the instructions opcode and addressing modes. 3. operand fetch The operands are provided by registers or functional units. Fragmentation of the instruction cycle Possible fragmentation into 5 stages: Computer Architecture – Part 8 –page 7 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

4. execute The operation is executed with the operands. 5. write back The result is written into a register or bypassed to serve as operand for a succeeding operation. Depending on the instruction or instruction class some stages may be skipped. Fragmentation of the instruction cycle The entirety of stages is called instruction cycle. Computer Architecture – Part 8 –page 8 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

In the first stage, the fetch unit accesses the instruction The fetched instruction is passed to instruction decode unit. While this second unit processes the instruction, the first unit already fetches the next instruction. In best case scenarios n-stage pipelines executes n instructions in parallel. Each instruction is in a different stage of its execution. When the pipeline is filled, the execution of one instruction is finished every clock cycle. A processor capable of finishing one instruction per clock cycle is called a scalar processor Instruction pipelining Computer Architecture – Part 8 –page 9 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Instruction pipelining 1 . instruction write back instruction fetch instruction decode operand fetch execute 2. instruction write back instruction fetch instruction decode operand fetch execute 3. instruction write back instruction fetch instruction decode operand fetch execute clock Computer Architecture – Part 8 –page 10 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Pipeline stages are linked by registers The instruction and the intermediate result is forwarded every clock cycle (in special cases every half clock cycle) to the next pipeline register. A pipeline is as fast as its slowest stage Therefore, an important issue in pipeline design is to assure that the stages consume equivalent amounts of time A high number of pipeline stages (often called superpipeline) leads to short clock cycles and higher speedup But a stall of a long pipeline, e.g. due to a control flow dependency, results in long wait times till the pipeline can be refilled. Thus, a real trade off exists for the designer. Pipeline design principles Computer Architecture – Part 8 –page 11 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Pipelining belongs to the class of fine grain parallelism. It takes place at a microarchitectural level. Definitions: An operation is the application of a function F to operands. An operation produces a result. An operation can be made up of a set of partial operations f1 ... fp (in most cases p = k). It is assumed that the partial operations are applied in sequential order. An instruction defines through its format the function, operands and result. Basic pipeline measures A k-stage pipeline executes n operations of F in cycles tp (n,k) = k + (n – 1) n-1 cycles to execute the remaining n-1 instructions k cycles to execute the first instruction (fill pipline) Computer Architecture – Part 8 –page 12 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Pipeline operation figure shows example: tp(10,5) = 5 + (10-1) = 14 stages 1 2 3 4 5 t start-up or fill processing drain Computer Architecture – Part 8 –page 13 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Basic pipeline measures Pipeline throughput: Pipeline speedup: In a best case scenario where a high number of linear succeeding operations is executed pipeline speedup converts to the number of pipeline stages. Computer Architecture – Part 8 –page 14 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Basic pipeline measures Pipeline efficiency: Pipeline efficiency reaches 1 (peak performance) if a infinite operation stream without bubbles or stalls is executed. This is of course only a best case analysis. Practical evaluation: Hockney numbers: n∞ : pipeline peak performance at infinite number of operations n½ : # of operations at which the pipeline reaches its half peak performance Computer Architecture – Part 8 –page 15 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Pipeline stages F stage instructions and operands results f1 f2 f3 fk . . . Stages are seperated by registers Computer Architecture – Part 8 –page 16 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Partitioning of an operation F: F time tf F 1` 1 time tf/2 time tf f1 f2 2 2´ time tf/2 time tf/2 F If a partitioning of an operation is impossible, F can also be applied in parallel and overlapped over two clock cycles. time tf Computer Architecture – Part 8 –page 17 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Operation example for partitioning F time tf F 1` 1 time tf/2 time tf i+2 i+3 i+1 i f1 f2 i+2 i+1 i+3 i 2 2´ time tf/2 time tf/2 F t+3 t+2 t+4 t+5 t+1 t t+2 t+1 t+5 t+3 t+4 t time tf Computer Architecture – Part 8 –page 18 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Balancing Pipeline Suboperations f2 f1 f2a f2b f2c f3 version 1 f1 << f2 f2 >> f3 f1 f2 f3 f2 f1 f2 f3 version 2 f2 If tfi = max(tf1 ... tfk) determines the clock frequency in an unbalanced pipeline (tfi >> tf1, ... , tfi >> tfk), fi should be partitioned further for better performance Computer Architecture – Part 8 –page 19 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Overall pipelined execution time of an operation F: t (F) = (max (tfi) + tpd + tsu) • k corresponds to clock period # of stages = k max ( tfi ) + k ( tpd + tsu ) max. processing time register delay of a suboperation Clock period: cp = max (tfi) + tpd + tsu Overall execution time, clock frequency Register delays: tpd = propagation delay time tsu = set up time Computer Architecture – Part 8 –page 20 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Architecture of a linear 5-stage pipeline with registers IF ID OF EX WB O R O R O R ALU C R RF I R DC DE IC O R PC WB IC = instruction cache DC = data cache IR = instruction register CR = control register RF = register file, e.g. 3-gate register file DE = decoder (control unit) OR = operand register PC = program counter IF = instruction fetch ID = instruction decode OF= operand fetch EX = execute WB = write back Computer Architecture – Part 8 –page 21 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Pipeline hazards • So far, we have assumed a smooth throughput of operations through the pipeline • But, there are several effects which can cause stalls in pipelined operations • These effects are called pipeline hazards • Pipeline hazards can be caused by • dataflow dependencies • resource dependencies • controlflow dependencies Computer Architecture – Part 8 –page 22 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Pipelined processors have to consider 3 classes of dataflow dependencies. The same dependencies have to be considered in concurrency. true dependency: read after write (RAW) Dataflow dependencies destination (i) = source (i +1) X A + B instruction i Y X + B instruction i+1 X has to be written by instruction i before it is read by the succeeding instruction. A hazard occurs if the distance of two instructions is smaller than the number of pipelines stages. In this case X has to be read before it is created. Computer Architecture – Part 8 –page 23 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

2. anti dependency: write after read (WAR) Dataflow dependencies source (i) = destination (i +1) X Y + B instruction i Y A + C instruction i +1 Y has to be read by instruction i before it is written by the succeeding instruction. A hazard occurs if the order of the instructions is changed in the pipeline. Computer Architecture – Part 8 –page 24 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

3. output dependency: write after write (WAW) Dataflow dependencies destination (i) = destination (i + 1) Y A / B instruction i Y C + D instruction i + 1 Both instructions write their results into the same register. A hazard occurs if the order of the instructions is changed in the pipeline. Computer Architecture – Part 8 –page 25 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Example of a short assembler program containing a true dependency, anti dependencies and a output dependency. I1 ADD R1,2,R2 ; R1 = R2+2 I2 ADD R4,R3,R1 ; R4 = R1+R3 I3 MULT R3,3,R5 ; R3 = R5·3 I4 MULT R3,3,R6 ; R3 = R6·3 Dependency graph I1 true dependency anti dependency output dependency I2 anti dependency I3 I4 Computer Architecture – Part 8 –page 26 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Example of a true dependency hazard(RAW) in a 5-staged pipeline pipeline stages issue point t fetch decode read execute write i issue check i + 1 i+1 i i+1 i read A, B i + 1 read X, C i X:=A op B i+1 Y:=X op C i write X i+1 write Y RAW i: X:=A op B i+1: Y:=X op C Computer Architecture – Part 8 –page 27 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Solutions for true dependency hazards • Software solutions: • Inserting NOOP instructions • Reorder instructions • Hardware solutions: • Pipeline interlocking • ForwardingAny combinations of these solutions are possible as well Computer Architecture – Part 8 –page 28 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Solving a true dependency hazard by inserting NOOPs pipeline stages t fetch decode read execute write i NOOPs inserted by compiler or programmer i NOOP NOOP i read A, B i+1 i X:=A op B i+1 i write X i + 1 read X, C NOOPs i+1 Y:=X op C i+1 write Y The RAW hazard is eliminated through insertion of NOOPs (bubbles) into the pipeline.This was the solution used in first RISC processors. Computer Architecture – Part 8 –page 29 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Solving a true dependency hazard by reordering instructions Sometimes, instead of inserting NOOPs instructions can be reordered to have the same effect Therefore, instructions having no true dependencies and not changing the control flow are arranged in between the conflicting instructions Example: X:=A op B NOOP NOOP Y:=X op C Z:=D op E F:= INP(0) X:=A op B Z:=D op E F:= INP(0) Y:=X op C Computer Architecture – Part 8 –page 30 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Solving a true dependency hazard by pipeline interlocking Pipline interlocking means the pipeline processing is delayed by hardware until the conflict is solved So the compiler or programmer is relieved (used e.g. in MIPS processor,Microprocessor with Interlocked Pipeline Stages) pipeline stages issue point t fetch decode read execute write i issue check i + 1 i+1 i i+1 i read A, B i X:=A op B i write X i + 1 read X, C Interlocking i+1 Y:=X op C i+1 write Y Computer Architecture – Part 8 –page 31 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Forwarding is simple hardware technique to save one delay slot (NOOP). An operand X needed for instruction i + 1 is directly forwardedfrom the output of the ALU to the input. The register file is by passed. If more then one delay slot is necessary, forwarding is combined with interlocking or NOOP insertion. The data forwarding path can also be used to provide operands of waiting instruction from the cache. This shortens the delay slot between a load and an execute instruction using this operand. Data cache access is speed up excessive through this technique. Forwarding Computer Architecture – Part 8 –page 32 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Load and ResultForwarding bypass: load forwarding cache memory register ALU bypass: result forwarding Computer Architecture – Part 8 –page 33 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Hardware realization of the forward path forward control data forwarding path (result forwarding) (S1) (S2) RF read RF write (R) EX load data path (load forwarding) pipeline stages issue point data forwarding t fetch decode read execute write issue check for i + 1 i i+1 i i+1 i read A, B i X:=A op B i+1read X,C i write X 1 NOOP or interlocking i+1Y:=X op C i+1 write Y Computer Architecture – Part 8 –page 34 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

An output dependency hazard may occur if an instruction i needs more time units to execute than instruction i+ 1. Of course this is only possible if the processor consist of several processing units with different numbers of stages. Anti-dependency hazards only occur if the order of instructions is changed in the pipeline. This is never true for ordinary scalar pipelines In superscalar pipelines, this hazard occurs Anti- and output-dependency hazards(false dependencies) Computer Architecture – Part 8 –page 35 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Output dependency hazard(regarding only 3 stages of the 5 stage pipeline) FU2 i +1 C op D t stages issue i issue i+1 FU1 i read A, B i+1 read C, D i 1. A op B i 2. A op B i+1 write Y i 3. A op B FU 2 i write Y RF read RF write FU 1 read execute write Computer Architecture – Part 8 –page 36 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Removing false dependencies False dependencies can always be removed by register renaming This can be done by hardware or by compiler So the hazard will never occur Example: X:= Y op B Y:= A op B Y:= A op C Y:= C op D Renaming the second Y to Z: X:= Y op B Y:= A op B Z:= A op C Z:= C op D Computer Architecture – Part 8 –page 37 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

An intra-pipeline dependency occurs if instructions of two succeeding stages need the same pipeline resource. The succeeding instruction (and the following instructions) have to be delayed till the resource becomes available. This happens e.g. if the common register file lacks a sufficient number of ports or some instructions need more than one clock cycle to run through a particular pipeline resource Examples: a register file with a common read/write port (possible conflict of read in stage 3 with write in stage 5) or a multi-cycle division unit in the execute stage. Resource dependencies • Resource dependencies can be classified in: • intra-pipeline dependencies • instruction class dependencies Computer Architecture – Part 8 –page 38 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

An instruction class dependency occurs if two or more instructions which are in the same pipeline stage need a pipeline resource existing only once. This never happens in a scalar pipeline Superscalar processors with several execution units often face this sort of conflict. A twofold superscalar processor may issue two instructions to two execution units simultaneously. If these instructions need the same (only once existent) execution unit an instruction class dependency arises. Resource dependencies Computer Architecture – Part 8 –page 39 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Every change in control flow is a potential candidate for conflict. Several instruction classes cause changes in control flow: conditional branch jump jump to subroutine, return from subroutine The control flow target is not yet available when the next instruction is to be fetched Especially conditional branches cause severe conflicts The analysis of the condition determines the next instruction to issue, which usually is finished in the last pipeline stages Control flow dependencies Computer Architecture – Part 8 –page 40 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Example of a control flow hazarddue to a conditional branch condition code IF ID OF EX WB CMP CMP BRANCH COND CMP NEXT CORRECT I BRANCH COND CMP BRANCH COND CMP BRANCH COND BRANCH COND Control flow hazards Computer Architecture – Part 8 –page 41 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Solutions for control flow hazards • Software solutions: • Inserting NOOP instructions • Reorder instructions • Hardware solutions: • Pipeline interlocking • Forwarding • Fast compare and jump logic • Branch prediction Computer Architecture – Part 8 –page 42 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

condition code Solution: interlocking or NOOPinsertion IF ID OF EX WB CMP CMP DELAY SLOT1 CMP BRANCH COND CMP BRANCH COND CMP BRANCH COND DELAY SLOT2 BRANCH COND BRANCH COND NEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I NOOP or interlocking NEXT+1 CORRECT I NEXT CORRECT I Penalty: 6 cycles NEXT+1 CORRECT I Computer Architecture – Part 8 –page 43 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Reducing penalty by forwarding the comparison result condition code IF ID OF EX WB CMP BRANCH COND CMP BRANCH COND CMP DELAY SLOT2 BRANCH COND CMP BRANCH COND CMP BRANCH COND NEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I NOOP or interlocking NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I Penalty: 4 cycles NEXT+1 CORRECT I Computer Architecture – Part 8 –page 44 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

CMP BRANCH COND CMP BRANCH COND condition code Reducing penalty by forwarding the next correct instruction address IF ID OF EX WB CMP BRANCH COND CMP BRANCH COND DELAY SLOT2 CMP NEXT CORRECT I BRANCH COND NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I NOOP or interlocking NOOP or interlocking NOOP or interlocking NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I Penalty: 3 cycles NEXT+1 CORRECT I Computer Architecture – Part 8 –page 45 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

CMP CMP BRANCH COND BRANCH COND fast compare logic Reducing penalty by fast compareand jump logic comparison result condition code fast jump logic IF ID OF EX WB CMP BRANCH COND CMP DELAY SLOT2 BRANCH COND NEXT CORRECT I CMP NEXT+1 CORRECT I NEXT CORRECT I BRANCH COND NEXT+1 CORRECT I NEXT CORRECT I NOOP or interlocking NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I Penalty: 2 cycles NEXT+1 CORRECT I Computer Architecture – Part 8 –page 46 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Special logic for compare and jump instructions can reduce the penalty by one cycle. These circuits can be much faster than a more general execution unit (ALU) allowing to complete comparison and jump in one clock cycle. The higher speed of the fast compare logic is possible because normally only simple comparisons like equal, unequal, <0, >0, ≤0, ≥0, =0 are needed. Reducing penalty by fast compareand jump logic Computer Architecture – Part 8 –page 47 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Reducing penalty by fast compareand jump logic + reorder instructions The remaining 2 NOOPs or interlockings can be replaced by reordering code Two independent instructions could be moved after the branch instruction (delayed branch) Example: Z:=D op E F:= INP(0) CMP BRANCH COND NOOP NOOP NEXT INSTR (COND = FALSE) . . . NEXT INSTR (COND = TRUE) CMP BRANCH COND Z:=D op E F:= INP(0) NEXT INSTR (COND = FALSE) . . . NEXT INSTR (COND = TRUE) Computer Architecture – Part 8 –page 48 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Another possibility of avoiding control flow hazards is branch prediction Here, the outcome of the branch (taken or not taken) is predicted before the result of the comparison is known In case of correct branch prediction, the penalty can be reduced up to 0 Firstly, lets assume we would have a perfectly working branch predictor Branch prediction Computer Architecture – Part 8 –page 49 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

CMP CMP BRANCH COND BRANCH COND Reducing penalty by branch prediction branchpredictor prediction result (taken or not taken) next address IF ID OF EX WB CMP BRANCH COND CMP BRANCH COND NEXT CORRECT I CMP NEXT+1 CORRECT I NEXT CORRECT I BRANCH COND NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I NEXT+1 CORRECT I NEXT CORRECT I Penalty: still 2 cycles NEXT+1 CORRECT I Computer Architecture – Part 8 –page 50 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Part 8 Instruction Level Parallelism (ILP) - Pipelining