CS2504: Computer Organization

Lecture 5: Processor Datapath and Control Dimitris Nikolopoulos CS2504: Computer Organization

Implement MIPS • We will look into the implementation of: • Memory reference instructions (lw, sw) • Arithmetic, logical instructions (add, sub, and, or, slt) • Branch instructions (beq, j) • We will build on two principles: • Making the common case fast • Simplicity favors regularity • We will build structures common in many microprocessors for all markets

Things common in instructions • Fetch code from memory using the PC • Read one or two registers, using field in instruction • Following these steps, execution becomes instruction-specific • Many instructions use the ALU (except from j) • Many instructions write a register • Memory instructions read or write memory

Abstract view of MIPS Missing multiplexers and control logic for registers and memories

Basic implementation of MIPS

Single-cycle datapath • Instruction begins execution in one clock edge, ends on the other clock edge • Easy, but impractical implementation • Simple and complex instructions • Variable number of clock cycles per instruction • Single-cycle datapath needs separate instruction and data memories • Different format of data and instructions • Less expensive design • Cannot use a single-ported memory for instructions and data reads/writes in ony cycle!

Datapath Elements - Instructions We can use the ALU we designed during logic design lectures, hardwired to perform only additions.

Datapath Elements - Instructions Logic to fetch the next instruction in program order. Common case.

Datapath Elements – Register file R instructions need two input and one output registers. Register file always outputs the data of the two read registers. Control is needed to write a register. Writes are edge-triggered, therefore register file can read and write registers in the same cycle. Writes need control signal, register number and data ready before falling clock edge.

Datapath Elements – Loads/stores Loads need to read into register file from address calculated with base plus offset (need ALU). Similarly, store needs to store in memory location calculated with base plus offset (need ALU).

Datapath Elements – Loads/stores Sign extensions needed for handling positive and negative offsets in load/store instructions. Data memory needed to read from and write to.

Datapath Elements – Branches Branch needs to calculate target address, by adding a sign-extended offset to PC+4 (architecture convention). Branch also needs to compare two registers to decide whether it is taken (condition is true, go to target) or not taken (condition is false, execute PC+4). Comparison can be done in standard ALU, by asserting the subtract control signal.

Simple Single-Cycle Datapath No sharing of the datapath between instructions (one at a time). R-instructions and memory instructions can use the same ALU. Value of destination register comes from ALU or memory (via a multiplexer). ALU input may come from register file (R-instruction) or directly from the instruction field (immediate operand, offset). Need MUX

Simple Single-Cycle Datapath Combines instruction fetch logic, R- and memory instructions datapath and branch datapath. Branch uses main ALU for comparisons, so we need one more adder for the branch target.

MIPS ALU Control Memory access instructions need addition. R-instructions use any of the five except from nor, depending on the 6-bit funct field in the instruction. BEQ needs a subtraction Implementing control: Use 6 bits from instruction field funct and 2 bits indicating add (00) for loads/stores, subtract (01) for beq or whatever the function field indicates (10)

MIPS ALU Control Memory access instructions need addition. R-instructions use any of the five ALU ops except from nor, depending on the 6-bit funct field in the instruction. BEQ needs a subtraction Implementing control: Use 6 bits from instruction field funct and 2 bits indicating add (00) for loads/stores, subtract (01) for beq or whatever the function field indicates (10). Note that only small fraction of the possible function values are used here.

MIPS ALU Control Simplification: Use only a few out of the 64 possible function combinations in the funciton field. Lots of don't cares. We can simplify the logic.

MIPS ALU Control • Remember instruction formats: • Bits 31:26 always contain opcode • Registers to read always in 25:21 and 20:16 • Base register for load/store in 25:21 • 16-bit offset for branch, load, store in 15:0 • Destination register in one of two places • 20:16 for a load • 15:11 for R-type instructions (rt register)

MIPS ALU Control Added lines from instruction for read/write registers, ALU control block, RegDst selects destination register.ALUSrc selects ALU input from immediate field or from register file. PCSrc decides on whether to advance the PC or use the branch target. MemtoReg decides on whether data written to register file comes from ALU operation or from data memory..

MIPS ALU Control - R-instruction Flow of R-instruction: Instruction fetched, PC incremented. Two registers read from register file, while control unit decides what operation to perform in parallel. ALU operates on the data based on bits 5:0 of the instruction. Result of ALU written to register file based on bits 15:11 of instruction.

MIPS ALU Control - Load/Store Flow of load instruction: Instruction fetched, PC incremented. One register is read. ALU computes base plus offset from register file and instruction. Sum is used to access data memory. Data from memory written to register held in bits 20:16

MIPS ALU Control – Designing the unit Example: R-format: Destination register used (1). ALUsrc from register file (0). Input to register file from instruction (0). Writes register (1). Does not read memory (0). Does not write memory (0). Does not branch (0). Performs an operation defined by the funct field of the instruction (10).

MIPS ALU Control – Designing the unit

Adding a Jump Instruction

Why not single-cycle? • Single-cycle design means CPI = 1 • Load instruction uses all five units • Instruction memory to fetch instruction • Register file for target register • ALU to calculate base plus offset • Data memory • Performance bound by the slowest instruction!

Why not single-cycle? • Example • Example: • Memory unit latency, 200ps. • ALU and adders: 100ps • Register file (read/write): 50ps • All others: (unrealistically) no delay • 25% loads, 10% stores, 45% ALU instructions, 15% branches, 5% jumps.

Why not single-cycle? • CPU execution time = icount * CPI * clock cycle • CPI = 1 • R-instructions • fetch, access registers, ALU, access registers, 400ps. • Load • fetch, access registers, ALU, access memory, access, 600ps. register • Store • fetch, access registers, ALU, access memory, 550ps.

Why not single-cycle? • CPU execution time = icount * CPI * clock cycle • CPI = 1 • Branch • Fetch, access registers, ALU, 350ps • Jump • Fetch, 200 ps. • Slowest instruction defines clock cycle: 600 ps. This means that all instructions take 600ps, even though some take as few as 200 ps.! • Average clock cycle in example: 447.5 ps.!

Why not single-cycle? • Single-cycle design violates the principle make the common case fast. Fast instructions run as slow as the slowest instrucitons • Each functional unit of the microprocessor can be used at most once per clock cycle. If the instruction mix needs more functional units, they need to be duplicated. • Solution: multi-cycle instructions, intro to pipelining.

The laundry example

Pipelining • Multiple instructions overlapped in execution • The pipeline paradox: • Each task takes the same amount of time as in a non-pipelined implementation • But, a set of tasks executes faster in a pipelined implementation, because many tasks proceed in parallel!

Pipelining in MIPS • Fetch instruction from memory • Read registers, while decoding instruction • Execute operation (ALU) or calculate address (ALU) • Write the result into a register • Classic, 5-stage pipeline (memorize!)

Pipelining improves performance

Understanding pipeline performance • In example, first instruction graduates in 900ps. • Following that, an instruction graduates every 200ps. • What is themaximum speedup from pipelining?

ISA implications for pipelining • Fixed-width instructions, simplify fetch stage • Symmetry between instructions, enables register access in parallel with instruction decoding. Asymmetric instructions need more pipeline stages • Memory operands only in loads/stores. We can still do a load in five stages. If memory operands were added to arithmetic instructions, we would need longer pipeling

Hazards • Broadly defined: A hazard is a conflict between two instructions that attempt to access the same resource, in the same cycle, at different stages of their pipelined execution. • Structural hazard: Hardware does not have enough resources. For example, suppose we had a single memory for instructions and data.

Structural Hazard Assume a single memory for instructions and data and assume that a fourth load instruction is executed. First instruction's data access would conflict with fourth instruction's instruction fetch!

Data hazard • A data hazard occurs because an instruction may need input from an earlier instruction in the pipeline and the input may not be available: add $s0,$t0,$t1 sub $t2,$s0,$t3 • Add instruction does not write $s0 until 5th stage • But, sub instruction needs new value $s0 in its second stage

Alternative pipeline representation Symbols corresponding to pipeline stages: fetch, decode, execute, memory access and register write back. Shading indicates how each element is used by the instruction. Half-shaded resources indicate reads (right shade) or writes (left shade).

Forwarding to resolve data hazards Idea: We forward the needed input for the execute stage as soon as it is produced from the execute stage of the previous instruction. Caution: we only forward ahead in time, not back. In this case, the add produces the needed result at time 600, therefore the result can be recycled back to the ALU and used for the next instruction.

Is forwarding always possible? Result from load not in register file until t=800. If sub follows load in the next cycle, it will need the resilt in the ALU by time t=600. Can't go backwards in time, therefore forwarding will not work in this case, unless we insert a delay (bubble) in the pipeline. First example of non-perfect pipelining!

Resolving data hazards for free Consider the following segment in C: A = B + E; C = B + F; Here is the translation to MIPS: lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) Can you find the data hazards in this example?

Resolving data hazards for free Consider the following segment in C: A = B + E; C = B + F; Here is the translation to MIPS: lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) Can you remove the hazards by just reordering instructions? Key idea: there is no dependence between the two assignments, A and C can be assigned in parallel!

Resolving data hazards for free Consider the following segment in C: A = B + E; C = B + F; Here is the translation to MIPS: lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) Can you remove the hazards by just reordering instructions? Key idea: there is no dependence between the two assignments, A and C can be assigned in parallel!

Control hazards • Instruction following a branch needs to be fetched in next clock cycle. • Unfortunately, we do not know the outcome of the branch and whether we should execute the next instruction or not when we fetch the branch. • The outcome of the branch is known after the ALU stage (3rd in the pipeline).

Resolving control hazards earlier Assume that we throw some hardware (more specifically, ALUs), so that we can calculate the branch condition and the branch target by the end of the second stage of the instruction. Even then, we still need to insert a bubble in the pipeline so that we can safely fetch the next instruction. Assume that all instructions have CPI=1 and branches need two cycles. If YY% of instructions are branches, the CPI of the machine increases to 1.YY.

Neo and the Oracle Oracle: I'd ask you to sit down, but, you're not going to anyway. And don't worry about the vase. Neo: What vase? [Neo turns to look for a vase, and as he does, he knocks over a vase of flowers, which shatters on the floor.] Oracle: That vase. Neo: I'm sorry-- Oracle: I said don't worry about it. I'll get one of my kids to fix it. Neo: How did you know? Oracle: Ohh, what's really going to bake your noodle later on is, would you still have broken it if I hadn't said anything?

Prediction • Modern processors make use of prediction to handle branches, as well as several other events that may cause long pipeline stalls. • If we predict that the branch is not taken, and the branch is indeed not taken, then the pipeline proceeds at full speed. • If the branch is taken, then we will have to insert a bubble

Prediction (always not taken)

Designing predictors • Predict always taken or not taken is like flipping a coin. • But branches are not really random: • Branch in the edge of a loop for example, jumps back all the times, except from the last iteration. • Motivation for dynamic predictors: • Hardware structures that keep information on a per branch basis. For example, they keep the history of outcomes of the same branch during the execution of the program. • Predict based on history of branches.

CS2504: Computer Organization