The Microarchitecture Level

The Microarchitecture Level Chapter 4

Implementation of IJVM Using the Mic-1 • TOS register • At the beginning and end of each instruction, TOS contains the value of the memory location pointed to by SP, the top word on the stack. • Having this value in TOS saves memory references. • It can also mean more memory operations in a few instructions, e.g. after POP, a new word from top of stack must be fetched and put in TOS. • OPC register • It is a temporary register • Used to save the address of the opcode for a branch instruction while PC is incremented to access parameters. • The microprogram has a main loop that fetches, decodes and executes IJVM instructions in this case. • The microinstruction at the Main loop branches to the microcode for executing the IJVM instruction and fetches the bytes following the opcode, which may be operand byte or the next opcode.

Implementation of IJVM Using the Mic-1 • The Need for NEXT_ADDRESS field • All the control store addresses corresponding to opcodes must be reserved for the first word of the corresponding instruction interpreter. • For example, the code that interprets POP starts at 0x57 and the code that interprets DUP starts at 0x59. • The code for POP has 3 microinstructions, so if placed as consecutive words, it would interfere with the start of DUP. • How interpreter works • Assume MBR contains 0x60, the opcode for IADD • The microinstruction in main loop does the following: • 1. Increment the PC, so it contains the address of next byte after the opcode • 2. Initiate a fetch of the next byte into MBR. This byte might be an operand or the next opcode. • 3. Perform a multiway branch to the address contained in MBR • The fetch of the next byte will be available in third microinstruction.

Implementation of IJVM Using the Mic-1 • IADD instruction • The main loop branches to the microinstruction labeled iadd1. • 1. The TOS is already present. The next-to-top word of the stack is fetched from memory. • 2. The TOS must be added to the next-to-top word fetched from memory. • 3. The result, which is to be pushed on the stack, must be stored back into memory, as well as stored in the TOS register. • BIPUSH instruction • The byte followed by opcode is to be interpreted as a signed integer. • This byte fetched into MBR in Main1, is sign extended to 32-bits and pushed onto the top of the stack. • Before returning to Main1, PC is incremented to get the next opcode • Figure 4-18. The BIPUSH instruction format

Implementation of IJVM Using the Mic-1 • Figure 4-19. (a) ILOAD with a 1-byte index. (b) WIDE ILOAD with a 2-byte index. • ILOAD instruction • ILOAD has a byte that is unsigned index following the opcode, to identify the word in the local variable space that is to pushed onto the stack. • It requires both read from the local variable and a write to push it onto the stack. • The byte being unsigned, the upper 24 bits of the B bus are supplied with zeros. • ISTOREinstruction • A word is removed from the top of the stack and stored at the location specified by the sum of LV and the index contained in the instruction. • To be able to access a variable in the complete variable space, a special opcode WIDE, known as a prefix byte, followed by the ILOAD and ISTORE opcode. • The definitions of ILOAD and ISTORE are modified to a 16-bit index.

Implementation of IJVM Using the Mic-1 • WIDE ILOAD/ISTORE instructions • WIDE branches to wide1, which fetches the first byte after the opcode. • Since WIDE ILOAD requires a different microcode than ILOAD, a second multiway branch is done at wide1 that ORs 0x100 with the opcode while putting it into MPC. • The interpretation of WIDE ILOAD starts at 0x115 (instead of 0x15). • In this way every WIDE opcode starts at an address 256 words higher in the control store than the corresponding regular opcode. • The index is constructed by concatenating two index bytes. • From here the operation is same as iload3 to iload5 of ILOAD and simple branch is required to iload3. • The same situation occurs for WIDE_ISTORE. • LDC_W instruction • Loads a constant from a constant pool • It has 16-bit unsigned offset • It is indexed off CPP rather than LV

Implementation of IJVM Using the Mic-1 (4) Figure 4-20. The initial microinstruction sequence for ILOAD and WIDE ILOAD. The addresses are examples.

Implementation of IJVM Using the Mic-1 • IINC instruction • Increments a local variable by a constant using two operands, each of 1 byte • INDEX specifies the offset from the beginning of the local variable frame. • It reads the variable, increments it by CONST, a value contained in the instruction, and stores it back in the same location. • CONST is a signed 8-bit constant, in the range –128 to +127. • Figure 4-21. The IINC instruction has two different operand fields

Implementation of IJVM Using the Mic-1 • Figure 4-22. The situation at the start of various microinstructions. (a) Main1. (b) goto1. (c) goto2. (d) goto3. (e) goto4. • GOTO instruction • The address of the next instruction is found by adding the signed 16-bit offset to the address of the opcode. • The offset is relative to the value that PC had at the start of the instruction decoding, not the value after 2 offset bytes have been fetched. • The offsets used in the goto IJVM instruction are signed 16-bit values, with a minimum of –32768 and a maximum of +32767

Implementation of IJVM Using the Mic-1 • Conditional branch instructions: IFLT, IFEQ, IF_ICMPEQ • The first two pop the top word from the stack, branching if the word is less than zero or equal to zero, respectively. • IF_ICMPEQ pops the top two words off the stack and braches if and only if they are equal. • In iflt4 the word saved in OPC is run through the ALU without being stored and the N bit latched and tested. • If the test is successful a branch takes place to T, and to F otherwise • In IFEQ, Z bit is tested instead of N bit. • The assembler for MAL has to make sure that addresses for T and F differ only in the leftmost bit. • In IF_ICMPEQ, two operands are read.

Implementation of IJVM Using the Mic-1 • INVOKEVIRTUAL and IRETURN instructions • The INVOKEVIRTUAL instruction invokes a method by using a 16-bit offset to determine the address of the method to be invoked. • The offset is an offset in the Constant Pool. • The first two bytes of each method gives the number of parameter words (including OBJREF). • The second two bytes give the size of the local variable frame. • To restore the machine to its previous state – the old PC and old LV values are stored immediately above the newly-created local variable frame. • IRETURN uses the address stored in new LV to retrieve the link information. • Then it restores SP, LV and PC to their previous values and copies the return value from the top of the current stack onto the top of the original stack.

Design of the Microarchitecture Level • Speed versus Cost • Given a circuit technology and an ISA, there are three basic approaches for increasing the speed of execution: • 1. Reduce the number of clock cycles needed to execute an instruction • 2. Simplify the organization so that the clock cycle can be shorter • 3. Overlap the execution of instruction • Path length: The number of clock cycles needed to execute a set of operations • For example, by adding an incrementerto PC, we do not have to use ALU to advance PC, eliminating cycles. • However, this capability does not help as much as a read operation is also performed during the same cycle. • To speed up the instruction fetching, it is important to overlap execution of instructions. • Separating the circuitry for fetching the instructions – 8-bit memory port from the main data path can speed up fetching of opcodes in advance

Speed versus Cost • Simple overlap of instruction fetch and execution is very effective in speeding up the execution • In terms of cost, bigger, more complex chips are much more expensive than smaller, simpler ones. • There are many fast circuit designs for an adder, but they may need more space and thus more cost. • The length of a clock cycle is decided by the sequence of operations that must be performed seriallyin a single clock cycle. • The amount of decoding performed influences the length of clock cycles. • For example, the decoder for the B bus will add delays in the critical path. • For a high-performance implementation, using a decoder is probably not a good idea; for a low-cost one, it might be.

Reducing the Execution Path Length Label Operations Comments pop1 MAR=SP=SP-1; rd Read in next-to-top word on stack pop2 Wait for new TOS to be re from memory pop3 TOS=MDR; goto Main1 Copy new word to TOS Main1 PC=PC+1; fetch; goto(MBR) MBR holds opcode; get next byte; dispatch • Merging the Interpreter Loop with the Microcode • The main loop can be overlapped with the previous instruction and in some cases it can be reduced to nothing. • Consider each sequence of microinstructions that terminates by branching to Main1. • At each of these places, the main loop microinstruction can be placed at the end of the sequence, with the multiway branch now replicated at many places. • Figure 4-23. Original microprogram sequence for executing POP.

Reducing the Execution Path Length Label Operations Comments pop1 MAR=SP=SP-1; rd Read in next-to-top word on stack Main1.pop PC=PC+1; fetch MBR holds opcode; fetch next byte pop3 TOS=MDR; goto(MBR) Copy new word to TOS; dispatch on opcode • In the figure below, the sequence has been reduced to three instructions by merging the main loop instructions. • The end of the sequence branches directly to the specific code for the subsequent instruction. • This trick reduces the execution time of the next microinstruction by one cycle. • It is equivalent to speeding the clock from 250 MHz (4 nsec microinstructions) to 333 MHz (3 nsec microinstructions) for free. • Figure 4-24. Enhanced microprogram sequence for executing POP

Reducing the Execution Path Length Label Operations Comments iload1 H=LV MBR contains index; Copy LV to H iload2 MAR=MBRU+H; rdMAR=address of local variable to push iload3 MAR=SP=SP+1 SP points to new top of stack; prepare write iload4 PC=PC+1; fetch; wr Inc PC; get next opcode; write top of stack iload5 TOS=MDR; goto Main1 Update TOS Main1 PC=PC+1; fetch; goto(MBR) MBR holds opcode; get next byte; dispatch • A Three-bus Architecture • The ALU can have two full input buses, an A bus and a B bus, giving three buses in all. • All (or most) of the registers should have access to both input buses. • The advantage is that it becomes possible to add two registers in one cycle. • Figure 4-25. Mic-1 code for executing ILOAD

Reducing the Execution Path Length Label Operations Comments iload1 MAR=MBRU+LV; rdMAR=address of local variable to push iload2 MAR=SP=SP+1 SP points to new top of stack; prepare write iload3 PC=PC+1; fetch; wr Inc PC; get next opcode; write top of stack iload4 TOS=MDR Update TOS iload5 PC=PC+1; fetch; goto(MBR) MBR holds opcode; get next byte; dispatch • Figure 4-26. Three-bus code for executing ILOAD. • A Three-bus Architecture • An Instruction fetch Unit • For every instruction the following common operations may occur: • 1. The PC is passed through the ALU and incremented • 2. The PC is used to fetch the next byte in the instruction stream • 3. Operands are read from memory • 4. Operands are written to memory • 5. The ALU does a computation and the results are stored back

Reducing the Execution Path Length • If an instruction has additional fields (for operands), each field must be explicitly fetched, 1 byte at a time, and assembled before it can be used • Fetching and assembling a field ties up the ALU for at least one cycle per byte to increment the PC, and then again to assemble the index. • In many cases the ALU is simply used as a path to copy a value from one register to another. • These cycles might be eliminated by introducing additional data paths not going through the ALU. • In Mic-1, an IFU (Instruction Fetch Unit) can independently increment PC and fetch bytes from the byte stream before they are needed. • This unit requires only an incrementer, a circuit simpler than full adder • An IFU can also assemble 8- and 16-bit operands so that they are ready for immediate use whenever needed.

Reducing the Execution Path Length • Two ways to accomplish the assembling of bytes: • 1. The IFU can actually interpret each opcode, determining how many additional fields must be fetched, and assemble them into a register ready for use by the main execution unit. • 2. The IFU can take advantage of the stream nature of the instructions, and make available at all times the next 8- and 16-bit pieces whether or not doing so makes sense. The main execution unit can then ask for whatever it needs. • The second scheme is shown below: • Figure 4-27. A fetch unit for the Mic-1.

Reducing the Execution Path Length • There are two MBRs: the 8-bit MBR1 and the 16-bit MBR2. • The IFU keeps track of the most recent byte or bytes consumed by the main execution unit. • The IFU automatically senses when the MBR1 is read, prefetches the next byte, and loads it into MBR1 immediately. • It has two interfaces to the B bus: MBR1 and MBR1U. • MBR2 provides the same functionality but holds the next 2 bytes. • It also has two interfaces to the to the B bus: MBR2 and MBR2U • The IFU fetches a stream of bytes by using a conventional 4-byte memory port, fetching entire 4-byte words ahead of time and loading the consecutive bytes into a shift register. • The shift register maintains a queue of bytes from memory to feed MBR1 and MBR2 • MBR1 holds the oldest byte in the shift register and MBR2 holds the oldest 2 bytes to form a 16-bit integer.

Reducing the Execution Path Length • Whenever MBR1 is read, the shift register shifts right 1 byte. • Whenever MBR2 is read, it shifts right 2 bytes. • Then MBR1 and MBR2 are reloaded from the oldest byte and pair of bytes, respectively. • If there is sufficient room in the shift register for another word, the IFU starts a memory cycle to read it. • The design of the IFU can be modeled by an FSM(Finite StateMachine). • Figure 4-28. A finite-state machine for implementing the IFU.

Reducing the Execution Path Length • All FSMs consist of two parts: states, shown as circles, and transitions, shown as arcs from one state to another. • The FSM has seven states corresponding to how many bytes are currently in the shift register, a number between 0 and 6. • There are three different events here. • The first event is 1 byte being read from MBR1, reducing the state by 1 • The second event is 2 bytes being read from MBR2, which reduces the state by two • When the FSM moves into states 0, 1 or 2, a memory reference is started to fetch a new word, which advances the state by 4. • To work correctly, the IFU must block when it is asked to do something it cannot do. • For example, supply the value of MBR2 when there is only 1 byte in the shift register and memory is fetching a new word. • Also it can do only one thing at a time, so incoming events must be serialized • Finally, whenever PC is changed, the IFU must be updated.

Reducing the Execution Path Length • The IFU has its own memory address register, IMAR, which has its own incrementer. • The IFU must monitor the C bus so that whenever PC is loaded, the new PC value is also copied into IMAR. • Since the new value in PC may not be on a word boundary, the IFU has to fetch the necessary word and adjust the shift register appropriately. • With the IFU, the main execution unit writes to PC only when it is necessary to change the sequential nature of the instruction byte stream • The IFU keeps PC current by sensing when a byte or bytes have been read from MBR1 or MBR2, respectively. • PC has a separate incrementer that can increment by 1 or 2 bytes depending on how many bytes have been consumed. • The trade-off here is more hardware for a faster machine.

End Chapter 4

The Microarchitecture Level