CSC 4250 Computer Architectures

CSC 4250Computer Architectures September 15, 2006Appendix A. Pipelining

What is Pipelining? • Implementation technique whereby multiple instructions are overlapped in execution • Pipelining exploits parallelism among the instructions in a sequential instruction stream • Recall the formula: CPU time = IC × CPI × cct • Pipelining yields a reduction in the average execution time per instruction; i.e., it decreases the CPI

RISC Architectures • Reduced Instruction Set Computer • All operations on data apply to data in registers • Only operations that affect memory are loads and stores that move data from memory to register or to memory from register, respectively • Instruction formats are few in number with all instructions typically the same in size

Three Classes of Instructions We consider • ALU instructions • Load and store instructions • Branches (no jumps)

ALU Instructions • Take either two registers or a register and a sign-extended immediate, operate on them, and store result into a third register: • DADD R1,R2,R3 Opcode R2 R3 R1 shamt opx rs rt rd Reg[R1] ← Reg[R2] + Reg[R3] • DADDI R1,R2,#3 Opcode R2 R1 Immediate rs rt Reg[R1] ← Reg[R2] + 3

Load and Store Instructions • Take register source (base register) and immediate field (offset). The sum (effective address) is memory address. Second register is destination (load) or source (store) of data. • LD R2,30(R1) Opcode R1 R2 Immediate Reg[R2] ← Mem[30+Reg[R1]] • SD R2,30(R1) Opcode R1 R2 Immediate Mem[offset+Reg[R1]] ← Reg[R2]

Branches • Branches are conditional transfers of control • Branch destination obtained by adding a sign-extended offset to current PC • We consider only comparison against zero: • BEQZ R1,name • BEQZ is pseudo-instruction for BEQ with R0: • BEQ R1,R0,name Opcode R1 R0 Immediate

RISC Instruction Set • At most five clock cycles: • Instruction fetch cycle (IF) • Instruction decode/register fetch cycle (ID) • Execution/effective address cycle (EX) • Memory access/branch completion (MEM) • Write-back cycle (WB)

Instruction Fetch (IF) • Send program counter (PC) to memory and fetch current instruction from memory; • Update PC by adding 4 (why 4?). • Operations: IR ← Mem[PC]; NPC ← PC + 4;

Instruction Decode/Register Fetch (ID) • Decode instruction • Read registers • Decoding is done in parallel with reading registers (fixed-field decoding) • Sign-extend the offset field • Operations: A ← Reg[rs]; B ← Reg[rt]; Imm ← sign-extended immediate field of IR (A and B are temporary registers).

Execution/Effective Address (EX) • ALU operates on the operands prepared in ID, performing one of four possible functions: • Memory ref. (add base register and offset): • ALUOutput ← A + Imm • Register-Register ALU instruction: • ALUOutput ← A func B • Register-Immediate ALU instruction: • ALUOutput ← A op Imm • Branch: • ALUOutput ← NPC + (Imm << 2) • Cond ← (A == 0)

Memory Access/Branch Completion (MEM) • PC is updated: PC ← NPC • Access memory if needed: LMD = Load Memory Data Register LMD ← Mem[ALUOutput] or Mem[ALUOutput] ← B • Branch: If (cond) PC ← ALUOutput

Write Back (WB) • Register-Register ALU: Reg[rd] ← ALUOutput • Register-Immediate ALU: Reg[rt] ← ALUOutput • Load: Reg[rt] ← LMD

Simple RISC Pipeline Clock Number Instr. # 1 2 3 4 5 6 7 8 9 Instr. i IF ID EX ME WB Instr. i+1 IF ID EX ME WB Instr. i+2 IF ID EX ME WB Instr. i+3 IF ID EX ME WB Instr. i+4 IF ID EX ME WB • What are the stages needed for an ALU instruction? • What are the stages needed for a Store instruction? • What are the stages needed for a Branch instruction? • Which stage is expected to take the most time?

Figure A.2. Pipeline

Three Observations on Overlapping Execution • Use separate instruction and data memories, which is typically implemented with separate instruction and data caches. The use of separate caches eliminates a conflict for a single memory that would arise between instruction fetch and data memory access.

Three Observations on Overlapping Execution • The register file is used in two stages: one for reading in ID and one for writing in WB. These uses are distinct. Hence, we need to perform two reads and one write every clock cycle (why two reads?). To handle reads and a write to the same register (and for another reason that will arise), we perform the register write in the first half and the reads in the second half.

Three Observations on Overlapping Execution • To start a new instruction every clock, we must increment and store the PC every clock, and this must be done during the IF stage in preparation for the next instruction. Another problem is that a branch does not change the PC until the MEM stage (this problem will be handled soon).

Pipeline Registers • Prevent interference between two different instructions in adjacent stages in pipeline. • Carry data of a given instruction from one stage to the next. • Registers are triggered by clock edge ─ values change instantaneously on clock edge. • Add pipelining overhead.

Figure A.3. Pipeline Registers

Example • Consider unpipelined processor. Assume 1 ns clock cycle, 4 cycles for ALU operations and branches, and 5 cycles for memory operations. Suppose relative frequencies are 40%, 20%, and 40%, respectively. The pipelining overhead is 0.2 ns. What is the speedup from pipelining?

Answer • Average execution time on unpipelined processor = Clock ×Average CPI = 1 ns × ((40%+20%)×4+40%×5) = 4.4 ns • Speedup from pipelining = 4.4 ns / 1.2 ns = 3.7

CSC 4250 Computer Architectures