CPE555A: Real-Time Embedded Systems

CPE555A:Real-Time Embedded Systems Lecture 2 Ali Zaringhalam Stevens Institute of Technology CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 1

Outline CPE555A – Real-Time Embedded Systems Stevens Institute of Technology RISC ISA Single-cycle CPU Multi-cycle CPU Pipelining Pipeline hazards 2

Von Neumann Machine CPU Input/Output ALU/Datapath Main Memory Control Unit CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 3

Fetch/Execute Cycle Interrupts Disabled Fetch Next Instruction Execute Instruction Handle Interrupts (If Any) Start Interrupts Enabled • The address of the current instruction is the Program Counter (PC) register. • After the instruction is fetched, PC is automatically incremented to point to the next instruction CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 4

Need for Instructions We need a way to tell the processor what steps to take to execute our program. In the Von Neumann Model this includes fetching data from memory performing arithmetic & logical operations on the data storing the results of computation in memory performing input/output In addition the processor must support certain high-level programming constructs. These include modifying the sequential flow of control for if then else and case subroutine calls to support structured programming CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 5

Examples of Instructions CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 6

RISC Instruction Set Architecture Reduced Instruction Set Computer (RISC) is a an important class of Instruction Set Architecture (ISA). Examples of RISC processors PowerPC (Freescale) SPARC (Sun Microsystems) MIPS (MIPS/Silicon Graphics) ARM (heavily used in embedded systems today) The ISAs implemented in these machines are not quite the same but share a large set of common characteristics (to be discussed shortly) CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 7

Summary: MIPS Instruction Formats 6 5 5 16 6 opcode 26 rs rt immediate 6 5 5 5 5 6 opcode immediate opcode rs rt rd shamt func I-Type Format J-Type Format R-Type Format This ISA was designed to allow efficient pipelining of instructions in HW CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 8

What’s in an ISA? Above all an ISA is a set of specifications An ISA gives you a set of requirements on what to build (i.e., support) in a processor. These include: the set of instructions that the processor must support the number of programmable registers instruction format including size and encoding the interface between the processor and the operating system for exception handling what features are required and what features are optional (for example in MIPS integer arithmetic is required but floating-point arithmetic is optional) in short: whatever is required to ensure binarycompatibility between two machines implementing the same ISA CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 9

What Isn’t in an ISA? An ISA doesn’t tell you how to build a processor. Should it be pipelined? How many instructions should be issued per cycle? etc. This permits processor vendors to implement the ISA in different ways based on technology/performance/cost requirements compiler developers to develop compilers to translate to an ISA independent of the processor’s specific implementation this is not entirely true when it comes to performance optimization an ISA to live longer than a specific implementation (a particular processor becomes obsolete long before an ISA is abandoned in favor of a new one) CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 10

Characteristics of RISC Processors Large number of General Purpose Registers Strictly load/store Fixed-size instructions Variable-format instructions Limited number of addressing modes Small instruction set (MIPS32 has 168 instructions vs. ~700 in VAX) CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 11

RISC Alternative: CISC CISC: Complex Instruction Set Computer variable-length, variable format instructions complex instructions memory-register instructions complex addressing modes Example: Intel’s IA32 CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 12

What’s a General-Purpose Register? CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 13

Storage-Device Hierarchy 0.25-0.5 ns 0.5-20 ns 80-250 ns 4 GHZ CPU Cycle T=0.25 ns Increasing Access Time CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 14

Why a Large Number of GPRs? Registers are cheaper to make now Registers offer compiler writers flexibility compiler developers prefer unreserved registers Registers are faster to access than main memory or cache Registers can store variables for as long as necessary. This reduces the need to access memory for data We can address registers with fewer bits compared to addressing main memory. This reduces code density in MIPS we need 5 bits to address 32 registers in a 32-bit machine we need 32 bits to address a memory location CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 15

MIPS Register Organization 32 GPRs (or integer registers), each 32 bits wide reg31 is used to store the return address during procedure calls. At other times they can be used for any purpose. Why not consider more than 32 registers? addressing registers in instructions requires address bits: need n bits to address 2n registers (5 bits to address 32 registers); there is a tradeoff between the number of GPRs and instruction size more registers means more hardware (e.g., gates, wires); more hardware translates into a longer datapath and lower clock cycle CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 16

Data Transfer Instructions 6 5 5 16 opcode rs rt immediate I-Type Format CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 17

Example Note that you must load data from memory into register reg4 before any arithmetic operation. Hence the name “load-store” which means that you cannot use memory operand in ALU instructions. reg2 reg1 temp Word is 4 bytes. So offset is 8x4=32. CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 18

Memory Addressing What are the addressable objects in memory? in most processors today instructions can address and operate on individual bytes but other multi-byte scalars such as word and half-word are also available for access And the issues for multi-byte scalars are... how to organize bytes of a multi-scalar in memory: little endian vs big endian conventions how to access multi-byte scalars: alignment restrictions CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 19

Little & Big Endian Big endian: word address is the address of the most significant byte Little endian: word address is the address of the least significant byte Big Endian Byte B+0 B+1 B+2 B+3 MSB LSB Little Endian Byte B+3 B+2 B+1 B+0 B is some base address CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 20

Example: 0x12345678 Big Endian Little Endian 203 201 200 202 similar to writing English 12 78 34 56 56 34 78 12 memory address B=200 is the base address in this example CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 21

Example Machines Little endian 80x86, VAX, Alpha Big endian SPARC, 680x0, IBM370/390, most RISC Bi-endian processors can be configured to operate in either big- or little-endian modes (e.g., MIPS64) When to worry about endian-ness? byte/bit manipulation within a multibyte scalar (e.g., access 3rd most significant byte of a 4-byte word) data communication between machines of different endian-ness CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 22

MIPS is Strictly Load/Store? CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 23

Addressing Modes Instructions need to know where/what their operands are. So the question is how the operands are supplied to the instruction. MIPS ISA supports three methods for this purpose immediate mode addressing : the operand is encoded directly in the instruction as a constant the address of the operand is encoded in the instruction register mode addressing : the operand is in a register and the address of the register is encoded in the instruction displacement mode addressing : the operand is stored in memory and the address of the memory location is encoded in the instruction CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 24

Addressing Mode Examples Immediate mode add reg4, 7 # Regs[reg4]=Regs[reg4]+7 16-bit field for the constant Register mode add reg4, reg3, reg2 # Regs[R4]=Regs[R3]+Regs[R2] Displacement mode lw reg4, 100(reg1) # Regs[R4]=Mem[Regs[R1]+100] 16-bits for displacement Special cases of displacement mode indirect mode: displacement value=0 lw reg4, 0(reg1) # Regs[R4]=Mem[Regs[R1]] absolute addressing : reg0 as base register (always stores 0) lw reg4, 8700(reg0) # Regs[R4]=Mem[8700] MIPS ISA supports 3 addressing modes explicitly, but effectively we have 5 addressing modes at our disposal. CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 25

I For Immediate Addressing Mode A large percentage of arithmetic operations have one constant operand (e.g., X=X+4) Keeping & loading constants from memory is inefficient (consider storing all integer constants in memory!) ALU instructions with immediate addressing mode are designed to address this need use I-type instruction format encode constant in the instruction’s 16-bit immediate field constants in range -215 to (215-1) can be encoded example: addi R4, R8, 79 6 5 5 16 opcode rs rt immediate I-Type Format CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 26

Conditional Branch Instructions CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 27

Unconditional Branch Instructions • Assume: • f  reg1 • g  reg2 • h  reg3 • i  reg4 • i  reg5 CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 28

Encoding Unconditional Branch Instructions Unconditional PC-region jump encoded as J-Type instruction opcode J: 2 26-bit PC-region offset with respect to PC+4 6 26 6 5 5 5 5 6 opcode=2 Offset added to PC+4 opcode=0 rs rt rd shamt func J-Type Format R-Type Format • Unconditional register jump encoded as R-Type instruction • opcode: 0 • funct • JR: 8 • rs contains branch address CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 29

Procedure Call: Invocation CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 30

Procedure Call: Return 6 26 opcode=3 immediate J-Type Format CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 31

Yet Another Instruction for Procedure Calls: jalr jalr: jump & link register instruction encoded in R-Type format jumps to address in rt opcode = 0; funct = 9 jalr instruction stores the return address in R31 to return at the end of the procedure: jr $ra Used when the procedure’s address is not known at compile time 6 5 5 5 5 6 opcode=0 rs rt rd shamt func=9 R-Type Format CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 32

Von Neumann Machine CPU Input/Output ALU/Datapath Main Memory Control Unit CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 33

Single-Cycle CPU Datapath. Control unit. CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 34

Single-Cycle CPU One clock cycle for each instruction No datapath resource can be used more than once per clock cycle Results in resource duplication for elements that must be used more than once. Examples: Separate memory units for instruction and data Two ALUs for conditional branches CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 35

Shortcomings of Single-Cycle CPU Duplication of datapath elements Separate instruction & data memory Multiple ALUs Clock cycle must have the same length for all instructions Cycle determined by the longest path: load instruction memory (fetch from instruction memory) register (read base address) ALU (compute memory address) data memory (read from data memory) register (write into destination register) Several instructions require a shorter cycle CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 36

Multi-Cycle CPU WB MEM IF EX ID • Break up datapath into smaller functional segments. • Each instruction uses only the functions it needs • Run faster clock cycle CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 37

Additional Datapath Elements Internal registers (invisible to programmer) To store intermediate results from one clock cycle to the next during execution of each instruction Similar to a scratchpad or a temporary variable Instruction Register (IR) Load Memory Data Register (LMDR) A and B registers to store register operands read from the register file ALUout to store result of ALU operation CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 38

Internal & External Registers Internal registers store data from one clock cycle to the next within a single instruction cycle. At the end of a clock cycle, data needed in subsequent clock cycles must be stored in an internal register Data needed by subsequent instructions by the program are stored in the external registers or memory At the end of a clock cycle, data is stored in one, the other or both register classes CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 39

Instruction Steps - 1 In the simplest implementation, instructions take at most 5 clock cycles instruction fetch (IF) instruction decode/register fetch cycle (ID) execution/effective address cycle (EX) memory access/branch completion cycle (MEM) write-back cycle (WB) Which instructions require no less than 5 cycles to complete? CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 40

1. Instruction Fetch Cycle (IF) • PC register content is applied to the instruction memory address bus • Instruction is fetched and saved in the IR register to be used in the ID stage • PC is incremented by 4 to compute the address of the next sequential instruction. IR = Mem[PC] PC <- PC+4 CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 41

2. Instruction Decode/Register Fetch Cycle (ID) 6 5 5 16 opcode rs rt immediate I-Type 6 5 5 5 11 opcode rs rt rd func R-Type • Functions in the ID stage • Decode instruction • Access register file to read the registers and store in A & B (at the next clock edge) for use in the next cycles Fixed -Field Decoding • Decoding is done in parallel with reading the register file because these fields occur in fixed locations for all instructions • Reading registers that will not subsequently be used is harmless A = Regs[rs] B = Regs[rt] Imm= (immediate sign extended) CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 42

3. Execution/Effective Address Cycle (EX) 6 5 5 16 opcode rs rt immediate I-Type 6 5 5 5 11 opcode rs rt rd func R-Type Memory Reference Instruction The ALU adds operands prepared in the last clock cycle in A. The result is the effective address of an operand for load/store. ALUout = A + Sign-Extended(Imm) Register-Register ALU Instruction The ALU performs operation on operands prepared in A and B in the last cycle. ALUout = A func B CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 43

4. Memory Access or ALU Instruction Completion (MEM) 6 5 5 16 opcode rs rt immediate I-Type Memory Reference • Use memory address computed by ALU and stored in ALUout in the previous clock cycle • Access memory and perform read or write depending on load or store LMDR = Mem[ALUout] or Mem[ALUout] = B Regs[IR11…15 ]= ALUout # R-Type ALU Regs[IR16…20 ]= ALUout # I-Type ALU ALU Instruction (R- or I-Type) • Store the result from ALUout in the destination register. CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 44

5. Write-Back Cycle (WB) 6 5 5 16 opcode rs rd immediate I-Type Load Instruction Load data into destination register rd. Data was fetched in an earlier clock cycle and stored in LMDR. Regs[IR11..15] = LMDR What to do in this stage for conditional branch instructions? CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 45

Example Assume the following instruction mix and clock cycles: • load: 23% 5 • store: 13% 4 • branches: 19% 3 • jumps: 02% 3 • ALU: 43% 4 What is the average CPI? CPI = .23 x 5 + .13 x 4 + .19 x 3 + .02 x 3 + .43 x 4 = 4.02 Performance improvement over single-cycle CPU: 5.0/4.02=1.24 CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 46

Multi-Cycle CPU Summary Multi-cycle CPU improves performance but not by much Performance is limited by the high frequency of instructions with high CPI (load, store, ALU) Significant performance gain can be made through pipelining Pipelining model uses same the stages as the multi-cycle CPU CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 47

What is Pipelining? Pipelining is an implementation technique where execution of sequential instructions are overlapped in time It improves instruction execution throughput, but notexecution time of individual instructions Hazard: refers to situations when the next instruction in the pipeline cannot be executed in the following clock cycle CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 48

Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes “Execution time”: 90 minutes A B C D CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 49

Sequential Laundry Sequential laundry takes 6 hours for 4 loads. Minimum=30+40+20 minutes=1.5 hours A B C D 6 PM 7 8 9 11 Midnight 10 Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e r CPE555A – Real-Time Embedded Systems Stevens Institute of Technology 50

CPE555A: Real-Time Embedded Systems