Week 5 Lecture slides

Cosc 3P92 Week 5 Lecture slides Voters quickly forget what a man says. Richard M. Nixon (1913-1994) Former U.S. President

Hardware components MIC(overview) • MAR and MDR are registers which latch the addresses and data prior to processing

Hardware components MIC (overview) • Translate byte address 0, 1, 2, 3… to 4 byte words. • Shift 2 bits left. • Causes word 0, 1, 2, 3 … to be addressed. • Alignment of words.

Hardware components MIC (overview) • Each micro instruction controls • register enables • bus enables • ALU • Memory • Next Micro instruction address

Hardware components MIC (overview)

Memory control • MAR - memory address register • CPU writes addresses of memory to read, write • MBR - memory buffer register • contains data for write or read • both act as ‘latches’ to hold addr, data until memory finished using them.

External command signals Master clock Control Unit Status signals Control signals Execution Unit CPU Control unit • main functions of a control unit: - instruction interpretation - instruction sequencing • the control unit is a finite-state machine.

R0 General purpose registers ALU (arithmetic logic unit) R1 • • • Rn-1 Control Unit SR (status reg) IR (instn reg) Dedicated registers PC (prog cntr) SP (stack ptr) MAR (mem addr reg) Dedicated multiply, division firmware (FP) MBR (mem buffer reg) etc... Typical CPU model Execution unit • An execution unit consists of: • a register section • an ALU • some dedicated hardware or firmware

Buffer reg. A Buffer reg. B R0 PC general purpose regs R1 etc etc special purpose regs ALU Data transfer within a CPU • A single-bus architecture: • To compute R2 <– R0 + R1: 1. A <– R0, 2. B <– R1, 3. R2 <– A+B

BUS A PC R0 etc Special II R1 Special I MBR etc ALU General regs. buffer reg. BUS B Data transfer within a CPU • A two-bus architecture • To compute R2 <– R0 + R1: 1. Buffer <– R0 + R1 (via Bus A and Bus B), 2. R2 <– Buffer (via either Bus A or Bus B).

BUS A BUS B PC R0 etc Special II R1 Special I MBR etc ALU BUS C Data transfer within a CPU • A three-bus architecture: • To compute R2 <– R0 + R1: 1. R2 <– R0 + R1 (via Bus A, Bus B and Bus C).

Next state Inputs Feedback paths Inputs Register Register AND plane Combinational Logic OR plane Register Register Outputs Outputs Register Transfer Model of Finite State Machine PLA Implementation of a Finite State Machine Design of control units • Hardwired approach • The control unit is treated as a synchronous (i.e., clocked) sequential circuit and is implemented as a hardwired state machine.

Microprogramming • Use of memory to implement the control unit • Instructions are implemented as sequences of instructions stored in control memory • Each machine language instruction is interpreted by circuitry, and executed using sequences of microprogram instructions • Micro-programs are much like assembled code, except: • direct mapping between instruction fields and hardware components of the CPU. • control fields are specified. • timing is critical; parallelism can be exploited.

Register Combinational Logic Control values Register Microprogramming • What is being controlled? • data paths: inter-register connections • control points: hardware enabling lines which govern register-to-register communications • idea is that we can control the operation of ALU and micro-control unit using combinations of control fields encoded in micro-instructions

Microprogramming • Each control point specifies a micro-operation • All micro operations which may be executed in parallel can be specified in a single micro instruction. • Factors which determine parallel operations. • Buses must only have 1 input active at a time. • Registers can be either read/written • Not both at the same time.

Microprogramming • Basic microinstruction formats: {Over heads}

Data path • 32-bit registers (none are user-accessible) • B bus: main one to ALU • C bus: from ALU back to registers • H reg: contains other operand for ALU • loaded by performing null op on data, and sending it to H

Data path • ALU control: 6 control lines • shifter: 2 control • 1. logical shift left 8 bits • 2. arithmetic shift right 8 bits

Data path timing • Four sub-cycles: • 1. control signals set up (w) • 2. registers loaded on B bus (x) • 3. ALU and shifter (y) • 4. results available to registers on C (z)

Data path timing • These are implicit sub-cycles: they rely on timing of previous steps • Only real clock signals used: • falling edge of clock (starts the cycle) • rising edge (loading from C in step 4) • ALU is continually processing all intermediate values it sees. It’s output only makes sense at the appropriate time above (after 3) • Can operate and save a register in 1 clock cycle: • load PC to B • inc • save to PC

Memory again • 2 memory buffers: • 32 bit port: MAR, MDR (read, write) • word addresses • 8-bit: MBR • low byte from PC (read only) • byte addresses • can be loaded signed, unsigned onto B bus • call reads into MBR “fetches” • control: • black arrow: enable from C bus • white arrow: enable onto B bus • 2 bus control: • out B • in C • out B / in C • none

Memory again • MAR aligned to words (32 bits, 4 bytes): [4.4] • Memory is available 2 cycles from when read was initiated • avail. at end of 2nd cycle, so 3rd cycle can use them

Microinstructions • 29 signals for data path: • 1. 9 signals to control C bus output into registers • 2. 9 signals to enable registers onto B bus • 3. 9 signals for ALU, shifter functions • 4. 2 signals for memory W/R via MAR/MDR • 5. 1 signal for memory fetch via PC/MBR • Issues: • may load more than 1 reg from C (9 bits) • but never load more than 1 reg onto B (4 bits, encoded will force this) --> 4 signals. • Need 2 more fields for determining next m.i.: • NextAddr (9 bits, addr space of 512) • conditional jumps (3 bits)

Microinstructions • Fields: • Addr: address of next micro-instruction • JAM: determines how next m.i. selected • ALU: ALU, shifter control • C: which registers written from C bus • Mem: memory functions • B: B source (encoded)

Example micro-architecture: Mic-1

Example microarchitecture: Mic-1 • sequencer: executes microinstructions • Two tasks: • set control signals for system • determine next m.i. to execute • control store: contains m.i. for interpreting ISA instns. • each instn a 36-bit word like [4.5] • each m.i specifies its successor • MPC: MicroProgram Counter • 9-bit address of next m.i. to execute • MIR: MicroInstruction Register • 36-bit m.i. being executed • Note that bits in MIR may directly control other parts of the circuit • eg. C

Mic-1 operation cycle • Basic ALU cycle: • 1. set up the inputs to the ALU • 2. let the ALU do its computation • 3. store the results • Clock cycles for Mic-1 • 1. MIR enabled (during subcycle w) • 2. MIR signals control data path (B bus; note H always enabled) (subcycle x) • 3. B and H inputs are stable, and ALU’s computes output ; shifter finishes; N, Z bits stable (subcycle y) • 4. shifter, N, Z outputs loaded from C but into registers • rising clock edge determines end • MIR is reloaded and calculated at this point as well • Memory read is initiated at end too • Note that all the above will complete in 1 cycle • microinstructions can specify all these operations in parallel

Mic-1 sequencing • First, 9-bit next addr field copied into MPC • JAM inspected: • 000 = use MPC as it is • if JAMN (or JAMZ) set, then N bit (or Z) are ORed with high-bit of MPC • hence next address is either: MPC, MPC with high-bit ORed with 1 • JMPC set: MBR byte ORed with low byte of NextAddr field • permits multiway jumps • can quickly branch to instn for just-loaded opcodes (ie. opcode number = address in control store!)

Microinstructions and notation • As in assembler programming, helps to use higher-level notation instead of raw numeric m.i. fields • can specify everything that happens in 1 clock cycle: • permits parallelism: eg. prefetch next instns • Notation: high-level, but directly translatable to single m.i.’s • Examples: • SP=SP+1: incr SP by 1 • MDR = SP: copy SP into MDR • MDR = SP+H; rd : add SP and H, save in MDR, and initiate a read • SP=MDR=SP+1: incr SP, load into both MDR, SP

Microinstructions and notation • Memory takes 2 cycles: MAR=SP; rd : assign value into MDR (another instn) * memory ready now! • next addresses: assume it is the labeled next m.i. after current one (unless a conditional jump) • if (Z) goto L1; else goto L2 : sets JAMZ • L1 and L2 are same low-8 bits (set by assembler) • Summary of legal operations on operands:

Example M.I. implementation: IJVM • A stack-based virtual machine for which Mic-1 is designed to implement. • All instructions access the stack: no general registers are used by compiler • eg. parameter passing [4.8] • eg. arithmetic [4.9] • Recall: • JVM instruction formats: [5.15] • Java memory usage, registers: [4.10] • Complete instruction set: [4.11] • Example translated code: [4.14]

JVM Instruction Formats

Memory area of IJVM

IJVM Instruction Set

Translating Java to IJVM

Implementation (cont) • See overheads (book page 234-236) • Note: • each m.i. contains address of next instn • micro-assembler labels all instns appropriately, and must put them in right control store addresses (equiv. to opcode) • the sequenced instns may reside in any free area of control store! Microassembler auto sets ‘next address fields’. • only explicit ‘goto’s will override this sequencing • Two parts: • 1. fetch next byte for next instn (done at Main1) • 2. branch to that opcode address and carry out instruction • Fetching instructions (Main1) • PC always points to next instruction in Java application program • can be reset by branches (see goto5, T, F,...) • When Main1 executed, assumed next opcode ready. the fetch at Main1 is for next opcode. Hence instns must fetch it if necessary(eg. see bipush2)

Implementation (cont) • Example 1: iadd (“pop 2 words from stack, push their sum”) • iadd1: reads next-to-top word in stack (TOS register already contains top of stack word); bumps down the SP for writing result • iadd2: sets TOS ready for addition (put in H) • iadd3: add next-to-top value (read in iadd1) to H, update TOS, save result in MDR for writing • Example 2: dup (“copy top stack word and push it”) • dup1: incr SP pointer, copy to MAR • dup2: save TOS (top stack word) to new SP, write it • note: can’t write it in dup1, because both SP and MDR must be updated thru data path, and not both at once

Implementation (cont) • Example 3: goto offset (“unconditional branch”) • [Fig 4.22] • goto1: save addr of opcode to OPC (old PC) • goto2: get the 2nd byte of offset (1st byte already in MBR) • goto3: shift 1st byte left 8 bits • goto4: OR low byte into high byte • goto5: add 16-bit offset to (old) PC; get next opcode • goto6: goto Main1 • Note: pause needed in goto6 (must wait 2 extra cycle)

Improving performance • 1. Faster clock, transistors, electrical circuits • 2. simpler organization yields shorter clock cycles • eg. get rid of (B bus) decoder • 3. Merge interpreter loop with microcode (pt 2) • [4.23], [4.24] • saves extra cycles if done in all instns • significant speedup! • 4. Three-busses • [4.25], [4.26] • reduces need for separate instns to load H reg

2 Bus v.s. 3 Bus

Improving performance • 5. Instruction fetch unit [4.27] • in Mic-1, ALU is used to increment PC and fetch instns • this uses up instn. cycles • IFU can be used: • 1. pre-fetches all instns outside of main data path • 2. pre-fetches operands: if they are required, they are there (else garbage, but ignored anyway)

Fetch Unit

Improving performance • Instruction fetch unit (cont) • shift register: always loaded with next bytes from memory • MBR1 (1 byte, as before); and new MBR2 (2 bytes) • values from shift reg dumped into both MBR1, MBR2 after every instn read; if needed, they are quickly put onto data path as req’d • need some fetching logic to know when to read more bytes into shift register, when to refresh MBR1, MBR2 • IMAR: separate memory addr reg (separate from MAR) • own dedicated incrementer (no need for ALU) • IFU must keep PC incremented properly, depending on instn length (if MBR1, MBR2 used) • branches may reset PC as well (from C)

Improving performance • Mic-2: • A, B buses • IFU • new IJVM [4.30, See overheads] • smaller, faster • MBR1 always has next opcode (due to IFU)

Mic-2

Improving performance: 6. Pipelining • divide instn. execution into modular steps and carry out different steps for seql. instns simultaneously • “instruction-level parallelism” • superscalar: single pipeline with parallel functional units • most instns take more than 1 cycle to complete • with pipelining: n instns in n cycles • To implement it: [4.31] • add latch to A, B, C buses • they keep values stable during sub-cycles: can use values in 3 sections of the data path • (i) loading before ALU (A, B) • (ii) doing ALU, shift, and loading C latch • (iii) storing C back into registers

Mic-3

Week 5 Lecture slides