Instruction Set Principles

Instruction Set Principles • The ISA - the portion of the machine visible to the programmer or compiler writer • Now that most programs are written in high level languages, the compiler writer becomes an important component in the computer architecture design process • In this chapter, we will look at broad comparisons among ISAs and focus on RISC ISAs

Classifying an ISA • We classify an ISA based on how they are affected by CPU internal storage (registers): • stack architecture • accumulator architecture • general-purpose register architecture - two forms, register-register (load-store) and register-memory • Almost all modern computers (designed after 1980) are of the last category with the others being found mostly in older computers

Example • C = A + B • Stack Accumulator Register- Register- Memory Register Push A Load A Load R1, A Load R1, A Push B Add B Add R1, B Load R2, B Add Store C Store C, R1 Add R3, R1, R2 Pop C Store C, R3 • Note that for Stack and Accumulator, internal storage is implicit. In register-memory, only one register is used. Only in register-register, is there a need for several internal registers.

Reasons for GPR architectures • Registers are much faster than memory, by moving most of the calculations directly into registers, it gives the computer a greater speed • Registers are easier for a compiler to use and can be used more effectively than other forms of internal storage • By moving variables into registers, memory access becomes less of a bottleneck • Compilers may wish to rearrange instructions, if using a stack or accumulator architecture, this may result in errors!

How many registers? • Compilers will reserve registers for: • holding temporary values in an expression • parameter passing • commonly used variables • Are instructions 2-operand or 3-operand? • How many operands may be memory addresses (rather than Immediate data values or registers storing the data)

Advantages • Register-register: simple, fixed-length instructions, simple code-generation model, all instructions take similar number of clock cycles • Register-memory: data can be accessed without first being loaded into CPU, instruction format tends to be easy to encode, yields good density in program code • Memory-memory: compact code, doesn’t waste registers for temporary values

Disadvantages • Register-register: higher instruction count, some instructions are short, bit encoding may be wasteful • Register-memory: operands not equivalent (one’s contents will be destroyed), encoding register number and memory address may restrict number of registers, CPI varies depending on operand location • Memory-memory: large variation in instruction size, large variation in work per instruction, memory access creates bottleneck

Memory Addressing • 2 things must be defined for operand references: • How memory addresses are interpreted • How memory addresses are specified • Interpretation: alignment problem, is the address the first byte, second byte, third byte, fourth byte of the word? • Little Endian - byte x..x00 least significant byte • Big Endian - byte x..x00 most significant byte • This is an issue only if accesses can be made to sizes less than a word (which is typical in many computers, so it must be addressed)

Addressing Modes • In GPR machines, an addressing mode can specify a constant (Immediate), a register or a location in memory, or some combination of these • There are many different modes (fig 2.5, p 75) • Numerous addressing modes can significantly reduce the instruction count of a program • However, many of these addressing modes will add to the CPI of the instruction due to the time it takes to compute the effective address

PC-relative addressing - an addressing mode usually used in specifying a location for code, not data. Often used in branches. • Data addressing modes include Register (value stored in a register), immediate (value given in instruction), displacement, register indirect, indexed, direct (absolute), memory indirect, autoincrement and autodecrement, and scaled • Most data addressing modes are either a variation of indirect addressing or displacement (e.g., see Vax comparison, fig 2.6, p 76)

Questions in Addressing Modes • Major question for displacement: how big is the displacement? Most of the SPEC92 benchmarks use displacements of no more than 15 bits • Major questions for immediate: how often is this mode used and how big is the datum? See figures 2.8 and 2.9 page 78 and 79 to answer these questions.

ISA types of instructions • Principle categories of instructions: • arithmetic (integer) and logical operations • data transfer (load, store) • control (branch, jump, proc call/return, trap) • system (OS system call, v.memory) • floating point (+, *) • decimal (BCD operation) • strings (move, compare, search) • graphics (pixel operation, compress/decompr)

Control Flow • These types of instructions require a destination address usually a PC-relative branching (although for Proc returns, the return address might be stored in the run-time stack) • Conditional Branches • Jumps (unconditional) • Procedure calls and returns • Conditional branches are the most common

PC-Relative Branching • Two advantanges of this approach: • displacement is usually small allowing for smaller instructions • displacement does not necessarily need to be known at compile time allowing for easy use of run-time loaded libraries and virtual functions in objects • this idea is known as position independence and is not available if a different form of addressing is used

Branching Questions • What is the form of condition? Usually comparison is simple equality or inequality test (many times, to 0) • These types of comparisons might be treated as a special case • See figure 2.14 for mechanisms to implement this • What is the distance of the branch? Figure 2.13 shows that most branches are < 11 bits in size

Procedure Calls/Returns • Require saving caller’s status (registers) • Two conventions: • Caller’s saving - calling proc performs saving • Callee’s saving - called proc performs saving • There are cases where one convention is more appropriate (or efficient) than the other • More sophisticated compilers will select the appropriate convention based on circumstances to optimize execution speed

Types and Sizes of Operands • Integer - half word (2 bytes), full word (4 bytes) • Floating Point - single-precision (1 word), double-precision (2 words) • Character (1 byte, using ASCII) • Character Strings (any number of characters) • Packed Decimal (1 BCD byte, 2 digits) • Unpacked Decimal (1 BCD byte, 1 digit) • NOTE: with 64 bit computers coming out, these typical sizes will change

Encoding an Instruction Set • Concerns: what bit codes represent each instruction? These are op codes • How is operand addressing specified? Will there be separate bytes of the instruction to specify this or will it be part of the operand? • Need a compromise between • having enough bits to specify 2 or 3 operands, by memory location, register, displacement • want same length instructions • without wasting bits in the instruction (op code)

Concerns • Want as many registers as possible BUT the more registers, the more bits needed to address between them • Many addressing modes are seldom used, should they be omitted from the ISA? • Instruction sizes should be based on bytes (for instance, don’t want a 20 bit op code, instead prefer 16 or 24 bit) • As instruction size (op code) increases, so does the size of the program!

Three Variations on Encoding • Variable (e.g., VAX) - op code includes number of operands, each operand is specified independently as mode and an extra field (e.g., used as displacement, or register selection or whatever) • Fixed (e.g., MIPS, Sparc) - fixed number of operand addresses no matter what the op code is • Hybrid (e.g., IBM 360/370, 80x86) - have different types of instructions based on op code and use some set number of variations such as 1 operand instructions, 2 operand instructions, etc...

Vax example • Consider the following Vax instruction: • add13 r1, 737(r2), (r3) - uses 6 bytes in all • Add operation (1 byte) says to add the two operands as designated by the second and third fields and store the result in the operand denoted by the first field • first field - register R1 (1 byte) • second field - base displacement, add 737 to contents of r2 for location of operand (3 bytes, 1 for register, 2 for displacement) • third field - fetch operand at location stored in r3 (1 byte) • Vax instruction lengths vary from 1 to 53 bytes!

The Role of the Compiler • Since most programming is done using high-level languages today, the compiler plays a very important role • In earlier times, the ISA was designed in part to make assembly language programming easier - such as having instructions that do multiple things • Now, ISAs are designed to be the target of a compiler - how to make compilation more efficient?

Goal of Compilers • Highest priority is correctness • Next priority is the speed of the compiled code • Other priorities include • fast compilation • useful debugging facilities and support • interoperability among other languages • A useful goal of a compiler designer is to have the compiler make multiple passes, each pass performing a finer level of optimization

Optimizations: • High-level - operates on source code: such as procedure inlining, loop transformation • Local - optimizes a single-line of code (such as a block or expression) • Global - extends local optimizations across branches, optimizes loops • Register allocation - optimize the use of registers, minimize memory fetches • Machine-dependent - take advantage of the specific architecture • See Figure 2.19, page 93

Optimization Examples • Subexpression Elimination - taking a subexpression that is used more than once and storing the first result in a register to be used again -- if stored in memory, the cost of the memory fetch may cancel the gain obtained by saving the expression’s result! • Graph coloring - an algorithmic technique to determine how values can be distributed. For optimization, it is used to determine what variables can be kept in the available registers. The algorithm is NP-complete, but heuristics can be applied that work well.

Phase-ordering • One problem with optimizations performed in segments is phase-ordering: • One transformation, done to optimize at one level, may directly affect the possible optimization at another level • Example: Expanded a procedure at the high-level without knowing the size of the procedure • Example: register allocation is performed near the end of the optimization techniques but subexpression elimination requires the allocation of registers • It is sometimes difficult to separate simpler optimizations from transformations performed by the code generator

Impact of Compiler Technology • Compiler technology has affected computer architectures by dictating • how variables are allocated and addressed • the number of type of registers needed • Variable allocation techniques: • Stack (local variables) • Global data area (global variables, constants) • Heap (dynamically allocated variables accessed through pointers) • Aliasing - how is it dealt with?

Helping the Compiler Writer • Programs are locally simple but globally complex. Simple translation processes will not provide efficient code. Make the frequent case fast and the rare case correct. Some useful properties: • Regularity • Provide Primitives, not Solutions • Simplify trade-offs among alternatives • Provide instructions that bind the quantities known at compile time as constants

Introduction to DLX • To demonstrate the issues described in this chapter, and to provide an ISA for use in future descriptions of an efficient architecture, DLX is introduced: • A RISC architecture which fits the various concepts described in this chapter • Derived from previous RISC architectures and designed for pipeline efficiency and efficiency as a compiler target • Easy to understand (unlike CISC and some RISC architectures)

DLX as an ISA • General-purpose register load-store architecture with at least 16 general purpose registers (and separate floating point regs) • Support displacement, immediate and register deferred addressing with address offset size of 12-16 bits and immediate data of 8-16 bits • Support simple instructions as described in 2.4 • Support 8, 16, 32 bit int. and 32 and 64 bit fl.pt. • Use fixed instruction encoding for efficient performance • Minimal instruction set

DLX Registers • 32 32-bit general purpose registers R0..R31 • 32 single-precision floating point registers F0..F31 which can be used as double precision by using pairs (F0, F2, F4…F30) • R0 is always 0 (even if loaded with a different value) -- used for addressing and loading immediate values that use 0 • Some special purposes for registers as used in branching and other occasions

DLX Data types • 32 bit words • Integers are stored in 2’s complement and there are 8 bit, 16 bit and 32 bit integers • Single and double precision floating point • Half words are available as used in C • DLX operations work on 32 bit integers and single and double precision floating points. All operands are converted from their given format to one of these and then converted back afterwards. Bytes and words are loaded with leading 0’s or the sign bit.

Addressing Modes • All instructions reference 2 or 3 registers • 16-bit fields used for addressing • Immediate and Displacement addressing modes • Register deferred accomplished using displacement with a displacement of 0 • Absolute addressing accomplished using displacement with R0 as the base • Big Endian mode used with alignment • Instruction format (3 of them) given in figure 2.21, page 99

DLX Operations • Complete list given in figure 2.25, p. 104 • Types: load, store, ALU operations, branch and jump, floating point operations • All ALU operations are integer operations • Loads and stores can specify any registers (whether integer or floating point). Base addresses are specified using integer registers

Load and Store Examples • LW R1, 30(R2) -- load into R1 the value stored at memory location 30+Register[R2] • LW R1, 1000(R0) -- load into R1 value stored at memory location 1000 (R0 is always 0) • LF F0, 50(R3) -- load floating point register F0 with the 4 bytes starting at 50+Register[R3] • SW R3, 500(R4) -- store the value currently in R3 to memory location 500+Register[R4] • SD F0, 40(R3) -- store the double precision floating point value in F0 and F1 to the 8 bytes starting at 40+Register[R3]

ALU Operation Examples • ADD R1, R2, R3 -- add the contents of R2 and R3, storing them in R1 • ADDI R1, R2, #3 -- add 3 to the contents of R2 and store in R1 • LHI R1, #42 -- load 42 into the upper half of R1, storing 0’s in the lower half • SLT R1, R2, R3 -- set R1 to 1 if R2 < F3 and 0 otherwise

Branch Examples • J name -- Jump to address PC + name where name is a positive or negative 25 bit 2’s complement number • JAL name -- same but store current PC+4 in R31 as a return value (for return from proc) • JALR R2 -- same except that the location to jump to is stored in R2 (not PC+Reg[R2]) • BEQZ R4, name -- If R4=0, branch to PC+name • BNEZ R4, name -- If R4 <>0, branch to PC+name

DLX Effectiveness • The simplified nature of DLX means that to accomplish any task, more operations will be needed than in many ISAs (for instance, need to load registers with memory values) • But, the CPI of these operations is less than in many other ISAs making up for this • Additionally, the static size of all DLX operations makes it easier to deal with pre-fetching and pipelining • We will see the ease of pipelining DLX in Ch. 3

Fallacies/Pitfalls • Pitfall - designing high-level ISA features • Pitfall - giving too much semantics to an ISA instruction limits how it can be used • Fallacy - there is such thing as a typical program • Fallacy - an architecture with flaws cannot be successful • Fallacy - you can design a flawless architecture

Conclusions from History • Early architectures had limited ISAs due to hardware limitations • As technology advanced, ISAs were able to be more and more complex, more closely matching the features in high level languages • In the 70’s, the concern was to reduce software cost (by making more features available in the ISA) • In the 80’s, the concern is machine performance. This is partially accomplished through hardware innovations and partially through ISA and compiler innovations

ISA Design today • Load-store architectures • Restrictions on addressing modes to ensure fixed-size instruction lengths • Reduce CPI as much as possible (make the common case fast, the rare case correct) • Many registers (at least 16, maybe 32)

What to expect in the future • 64-bit addresses and therefore 64-bit registers, double precision using 128 bits • Can load-store architectures improve performance when simulating 80x86 architectures? • Replace conditional branching with conditional completion and to use more forms of branch prediction • Improving cache performance when it comes to cache misses (e.g., using predictive approaches) • Better floating point implementations

Instruction Set Principles