Chapter 4: The Processor (part A)

Note: about two thirds of slides in this fileare adopted from Mary Jane Irwin’s work: ( www.cse.psu.edu/~mji ) Chapter 4: The Processor (part A)

System Bus Model Revisited • System Components • CPU, Memory, I/O devices • CPU Components • Datapath and control unit • Datapath Components • ALU and registers • System Interconnections • Data Bus, Address Bus, and Control Bus • Program Execution • Instruction Fetch-Execute Cycle

Instruction Execution Cycle • A Program • consists of a sequence of instructions. • For each instruction, an arithmetic or logic operation, or data movement is performed. • Instruction Execution Cycle (Fetch and execute) • Program Counter (PC) holds address of the next instruction to fetch. • Processor fetches the instruction from the memory location pointed to by PC. • Increment PC by 4, or told otherwise such as increment PC by 4 and then be added by the address offset, e.g., jump instruction. • Processor interprets instruction and performs required actions.

Instruction Fetch-Execution Cycle for MIPS 1.IF 2.ID 3.EX 4.MEM 5.WB • Execution and completion of a MIPS instruction usually includes all or most of the following five steps: • IF: Instruction fetching • ID: Instruction decoding / register read • EX: Arithmetic/logical operation execution / address calculation • MEM: Memory access • WB: Write back

A Simplified MIPS Instruction Set • A simplified instruction set of only 9 instructions: • lw, • sw, • add, • sub, • and, • or, • slt, • beq, • j

Implementation of a MIPS Processor (I) • Now we're ready to implement the simplified MIPS processor • We start with an implementation of the datapath, and then followed by the control unit. • For datapath, we start with individual components, then consider the connections between them.

Implementation of a MIPS Processor (I) • Hardware components involved in each of steps: • IF: Instruction fetching • Instruction memory, PC, IR, adder • ID: Instruction decoding / register read • IR, registers, control unit • EX: Arithmetic/logical operation execution • ALU: arithmetic & logic unit, control unit • MEM: Memory access • Data memory, control unit • WB: Write back • Registers, multiplexers, control unit

Implementation of a MIPS Processor (II) • Individual component can be implemented using either combinational logic or timing logic (refer to your digital logical design course): • Combinational circuits design for • ALU • Adder • Multiplexer • Control unit • Sequential logic design for • PC (program counter), IR (instruction register) • Registers • Instruction memory • Data memory

Implementation of a MIPS Processor (III) • Some components can be readily designed and implemented: • Adder • Multiplexer • Registers • Instruction memory • Data memory • For other components, we need to know the instruction set before a truth table can be created for the component. • ALU, Control unit, etc.

Review: MIPS (RISC) Design Principles • Simplicity favors regularity • fixed size instructions • small number of instruction formats • opcode always the first 6 bits • Smaller is faster • limited instruction set • limited number of registers in register file • limited number of addressing modes • Make the common case fast • arithmetic operands from the register file (load-store machine) • allow instructions to contain immediate operands • Good design demands good compromises • three instruction formats

Fetch PC = PC+4 Exec Decode The Processor: Datapath & Control • Highly simplified MIPS instruction set • memory-reference instructions: lw, sw • arithmetic-logical instructions: add, sub, and, or, slt • control flow instructions: beq, j • Generic implementation • use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) • decode the instruction (and read registers) • execute the instruction • All instructions (except j) use the ALU after reading the registers How? memory-reference? arithmetic? control flow?

Aside: Clocking Methodologies • The clocking methodology defines when data in a state element is valid and stable relative to the clock • State elements - a memory element such as a register • Edge-triggered – all state changes occur on a clock edge • Typical execution • read contents of state elements -> send values through combinational logic -> write results to one or more state elements State element 1 State element 2 Combinational logic clock one clock cycle • Assumes state elements are written on every clock cycle; if not, need explicit write control signal • write occurs only when both the write control is asserted and the clock edge occurs

Implementation of a MIPS Processor Datapath (I) • Now we look at constructing a datapath step by step for the five steps: • Instruction fetching • Instruction decoding / register read • Arithmetic/logical operation execution / address calculation • Memory access • Write back

Add 4 Fetch PC = PC+4 Instruction Memory Exec Decode Read Address clock PC Instruction Fetching Instructions • Fetching instructions involves • reading the instruction from the Instruction Memory • updating the PC value to be the address of the next (sequential) instruction • PC is updated every clock cycle, so it does not need an explicit write control signal just a clock signal • Reading from the Instruction Memory is a combinational activity, so it doesn’t need an explicit read control signal

Fetch PC = PC+4 Exec Decode Read Addr 1 Read Data 1 Register File Read Addr 2 Write Addr Read Data 2 Write Data Decoding Instructions • Decoding instructions involves • sending the fetched instruction’s opcode and function field bits to the control unit and Control Unit Instruction • reading two values from the Register File • Register File addresses are contained in the instruction

31 25 20 15 10 5 0 R-type: op rs rt rd shamt funct RegWrite ALU control Fetch PC = PC+4 Read Addr 1 Read Data 1 Register File Read Addr 2 overflow Instruction zero Exec Decode ALU Write Addr Read Data 2 Write Data Executing R Format Operations • R format operations (add, sub, slt, and, or) • perform operation (op and funct) on values in rs and rt • store the result back into the Register File (into location rd) • Note that Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File

RegWrite ALU control MemWrite overflow zero Read Addr 1 Read Data 1 Address Register File Read Addr 2 Instruction Data Memory Read Data ALU Write Addr Read Data 2 Write Data Write Data MemRead Sign Extend 16 32 Executing Load and Store Operations • Load and store operations involves • compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction • store value (read from the Register File during decode) written to the Data Memory • load value, read from the Data Memory, written to the Register File

Executing Branch Operations • Branch operations involves • compare the operands read from the Register File during decode for equality (zero ALU output) • compute the branch target address by adding the updated PC to the 16-bit signed-extended offset field in the instr Branch target address Add Add 4 Shift left 2 ALU control PC zero (to branch control logic) Read Addr 1 Read Data 1 Register File Read Addr 2 Instruction ALU Write Addr Read Data 2 Write Data Sign Extend 16 32

Executing Jump Operations • Jump operation involves • replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits Add 4 4 Jump address Instruction Memory Shift left 2 28 Read Address PC Instruction 26

Creating a Single Datapath from the Parts • Assemble the datapath segments and add control lines and multiplexors as needed • Single cycle design – fetch, decode and execute each instructions in one clock cycle • no datapath resource can be used more than once per instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders) • multiplexors needed at the input of shared elements with control lines to do the selection • write signals to control writing to the Register File and Data Memory • Cycle time is determined by length of the longest path

Add RegWrite ALUSrc ALU control MemWrite MemtoReg 4 ovf zero Read Addr 1 Instruction Memory Read Data 1 Address Register File Read Addr 2 Data Memory Read Address PC Instruction Read Data ALU Write Addr Read Data 2 Write Data Write Data MemRead Sign Extend 16 32 Fetch, R, and Memory Access Portions

31 25 0 J-type: op target address Adding the Control • Selecting the operations to perform (ALU, Register File and Memory read/write) • Controlling the flow of data (multiplexor inputs) 31 25 20 15 10 5 0 R-type: op rs rt rd shamt funct 31 25 20 15 0 • Observations • op field always in bits 31-26 • addr of registers to be read are always specified by the rs field (bits 25-21) and rt field (bits 20-16); for lw and swrs is the base register • addr. of register to be written is in one of two places – in rt (bits 20-16) for lw; in rd (bits 15-11) for R-type instructions • offset for beq, lw, and swalways in bits 15-0 I-Type: address offset op rs rt

Single Cycle Datapath with Control Unit 0 Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] 0 Read Data 1 ALU Write Addr Read Data 2 0 1 Write Data 0 Instr[15 -11] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

R-type Instruction Data/Control Flow 0 Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] 0 Read Data 1 ALU Write Addr Read Data 2 0 1 Write Data 0 Instr[15 -11] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

Load Word Instruction Data/Control Flow 0 Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] 0 Read Data 1 ALU Write Addr Read Data 2 0 1 Write Data 0 Instr[15 -11] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

Branch Instruction Data/Control Flow 0 Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] 0 Read Data 1 ALU Write Addr Read Data 2 0 1 Write Data 0 Instr[15 -11] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

Adding the Jump Operation Instr[25-0] 1 Shift left 2 28 32 26 0 PC+4[31-28] 0 Add Add 1 4 Shift left 2 PCSrc Jump ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] 0 Read Data 1 ALU Write Addr Read Data 2 0 1 Write Data 0 Instr[15 -11] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

Fig.4.24 (p.329) Instr[25-0] 1 Shift left 2 28 32 26 0 PC+4[31-28] 0 Add Add 1 4 Shift left 2 PCSrc Jump ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] 0 Read Data 1 ALU Write Addr Read Data 2 0 1 Write Data 0 Instr[15 -11] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

Implementing the Control Units • The control unit includes mainly • ALU control unit • Main control unit • Using only combinational circuits (simple!) • Inputs are from each instruction’s • op-code field (6 bits [31:26]) for all instructions, and • funct field (6 bits [5:0]) for R-type instructions. • Outputs are the control lines to control • ALU, • Multiplexers, • registers, and • memory

Implementing ALU Control Unit (I) • Inputs: 6-bit function code, plus 2-bit ALUOp from main control unit • Outputs: 4-bit ALU control lines used to decide which operation ALU performs

Implementing ALU Control Unit (II) • Unit Inputs: ALUOp code (2 bits) and Funct field (6 bits) • ALUOp code is generated at the main control unit • Unit Outputs: ALU control lines (4 bits) • The circuit for ALU control unit • Obtained through combinational digital logic design method

Implementing Main Control Unit (I) • Inputs are 6-bit op-code from all instructions • op-code field (6 bits [31:26]) • Outputs (10-bit) are the control lines to control • Memory (2 bits) • MemRead, MemWrite • Multiplexers (5 bits) • RegDst, (Jump), Branch, MemtoReg, ALUSrc, • Registers (1 bit) • RegWrite • and 2-bit ALUOp (2 bits) • ALUOp0, ALUOp1

Implementing Main Control Unit (II) The Outputs of Main Control Unit

Implementing Main Control Unit (III) Truth table for main control unit:

Implementing Main Control Unit (IV) The circuit for main control unit:

Handle the Jump Instruction • For jump instruction, the target address can be formed with the concatenation of • The upper 4 bits of [PC]+4 • The 26-bit immediate field of the jump instruction • The bits 00 • For main control unit, add an output control signal Jump, which is “1” when the 6-bit op-code matches that of instruction j.

Implementation of a MIPS Processor

Instruction Critical Paths • What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except: • Instruction and Data Memory (200 ps) • ALU and adders (200 ps) • Register File access (reads or writes) (100 ps)

Cycle 1 Cycle 2 Clk lw sw Waste Single Cycle Disadvantages & Advantages • One instruction is completed in one single cycle • Cycle time has to be chosen as the max time delay • i.e., 800 ns • Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction • especially problematic for more complex instructions like floating point multiple cycle but • Is simple and easy to understand

How Can We Make It Faster? • Start fetching and executing the next instruction before the current one has completed • Pipelining – (all?) modern processors are pipelined for performance • Remember the performance equation: CPU time = CPI * CC * IC • Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages • A five stage pipeline is nearly five times faster because the CC is nearly five times faster • Fetch (and execute) more than one instruction at a time • Superscalar processing – stay tuned

IFetch Dec Exec Mem WB The Five Stages of Load Instruction • IFetch: Instruction Fetch and Update PC • Dec: Registers Fetch and Instruction Decode • Exec: Execute R-type; calculate memory address • Mem: Read/write the data from/to the Data Memory • WB: Write the result data into the register file Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw

IFetch IFetch IFetch Exec Exec Exec Mem Mem Mem WB WB WB A Pipelined MIPS Processor • Start the next instruction before the current one has completed • improves throughput - total amount of work done in a given time • instruction latency (execution time, delay time, response time - time from the start of an instruction to its completion) is not reduced Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Dec lw Dec sw Dec R-type • clock cycle (pipeline stage time) is limited by the sloweststage • for some stages don’t need the whole clock cycle (e.g., WB) • for some instructions, some stages are wastedcycles (i.e., nothing is done during that cycle for that instruction)

Single Cycle Implementation (CC = 800 ps): Cycle 1 Cycle 2 Clk lw sw Waste Pipeline Implementation (CC = 200 ps): IFetch Dec Exec Mem WB lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type Single Cycle versus Pipeline 400 ps • To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle case). Why ? • How long does each take to complete 1,000,000 adds ?

Pipelining the MIPS ISA • What makes it easy • all instructions are the same length (32 bits) • can fetch in the 1st stage and decode in the 2nd stage • few instruction formats (three) with symmetry across formats • can begin reading register file in 2nd stage • memory operations occur only in loads and stores • can use the execute stage to calculate memory addresses • each instruction writes at most one result (i.e., changes the machine state) and does it in the last few pipeline stages (MEM or WB) • operands must be aligned in memory so a single data transfer takes only one data memory access

IF/ID ID/EX EX/MEM Add Add MEM/WB 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 Read Address PC Read Data Address Write Addr ALU Read Data 2 Write Data Write Data Sign Extend 16 32 MIPS Pipeline DatapathAdditions/Mods • State registers between each pipeline stage to isolate them IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WriteBack System Clock

MIPS Pipeline Control Path Modifications • All control signals can be determined during Decode • and held in the state registers between pipeline stages PCSrc ID/EX EX/MEM Control IF/ID Add MEM/WB Branch Add 4 RegWrite Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 MemtoReg Read Address ALUSrc PC Read Data Address Write Addr ALU Read Data 2 Write Data Write Data ALU cntrl MemRead Sign Extend 16 32 ALUOp RegDst

Pipeline Control • IF Stage: read Instr Memory (always asserted) and write PC (on System Clock) • ID Stage: no optional control signals to set

DM Reg Reg IM ALU Graphically Representing MIPS Pipeline • Can help with answering questions like: • How many cycles does it take to execute this code? • What is the ALU doing during cycle 4? • Is there a hazard, why does it occur, and how can it be fixed?

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Time to fill the pipeline Why Pipeline? For Performance! Time (clock cycles) Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 Inst 0 I n s t r. O r d e r Inst 1 Inst 2 Inst 3 Inst 4

Can Pipelining Get Us Into Trouble? • Yes:Pipeline Hazards • structural hazards: attempt to use the same resource by two different instructions at the same time • data hazards: attempt to use data before it is ready • An instruction’s source operand(s) are produced by a prior instruction still in the pipeline • control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated • branch and jump instructions, exceptions • Can usually resolve hazards by waiting • pipeline control must detect the hazard • and take action to resolve hazards

Chapter 4: The Processor (part A)