Instruction Set Principles & Examples (Appendix B)

Instruction Set Principles & Examples (Appendix B)

“Instruction Set Architecture is the structure of a computer that a machine language programmer (or a compiler) must understand to write a correct (timing independent) program for that machine.” IBM, Introducing the IBM 360 (1964) “The portion of a computer visible to programmers or compiler writers” From Text • The ISA defines: – Instructions that the processor can execute – Data Transfer mechanisms—how to access data – Control Mechanisms (branch, jump, call, etc.) – “Bridge” between programmer/compiler & hardware

Hierarchical view of a computer Application Answers per day/week/month Operations per second Programming Language ISA is the computer architecture visible at assembly language level Compiler Millions of instructions per second: MIPS Millions of FP operations per second: MFLOPS ISA Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins Performance metrics

How about JVM or CLR? SUN’s JVM (Java Virtual Machine) is an additional interface layer between program and ISA. Microsoft .NET CLR (common language runtime) has this same role. JVM  platform independency CLR  Language independency What is the main advantage doing this for each?

Evolution of Instruction Set Architectures Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation—ISA High-level Language Based Concept of a Processor Family (Stack architecture) ) (IBM 360 1964) (Burroughs B5000 1963) General Purpose Register Machines Complex Instruction Sets Load/Store Architecture (Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76) CISC RISC (IBM 390, Intel x86, Pentium) (Mips, Sparc, HP-PA, 88000, IBM RS6000, …1987) current x86 = mix of CISC & RISC =externally CISC, internally RISC Next? VLIW (IA-64), “EPIC” or …? VLIW(Very Long Instruction Word) EPIC—Explicitly Parallel Instruction Computing

Intel = CISC(externally) RISC(internally) • “ Moore’s Law led Intel to use a RISC instruction set internally while supporting 80x86 instruction set externally” • “Intel processors use hardware to translate from 80x86 instructions to RISC-like instructions and then execute the translated operations inside the chip” This takes a couple of extra cycles (with PLA implementation) and a few hundred thousand transistors • Above from Text • Actually may have some advantages: External ISA is kept for compatibility, internal ISA can be tweaked for each generation (Transmeta)

Evolution of Instruction Sets • Major advances in computer architecture are typically associated with landmark instruction set designs • Ex: Stack vs General Purpose Registers (GPR)—60s &70s • Ex: CISC vs RISC—80s • Ex. VLIW(Very Long Instruction Word)—90s • Ex. EPIC—Future? • Design decisions must take into account: • Technology—IC—how many transistors on a chip? • machine organization—datapath, control, memory • programming languages • compiler technology • operating systems

Design Space of ISA Five Primary Dimensions—factors to consider Number of explicit operands (0,1,2,3) Operand storage Where besides memory? accumulator, stack, registers Effective address How is memory location specified? (addressing modes) PC-relative, indexed, immediate, … Type & size of operands char, int, float, vector, …8/16/32/64 How is it specified? Big/Little-Endian Operations add, sub, mul, jump, call, … How is it specified? Opcodes!

Basic ISA Classes Accumulator: Earliest computers 1 address add A acc  acc + mem[A] 1+x address addx A acc  acc + mem[A + x] Stack: popular in 60-70s--Burroughs B5000 , HP 3000— design closer to high level languages—now disappeared(because too much language dependency 0 address add tos  tos + next General Purpose Register: CISC & RISC 2 address add A B A  A + B 3 address add A B C A  B + C Load/Store: RISC 3 address add Ra Rb Rc Ra  Rb + Rc load Ra Rb Ra  mem[Rb] store Ra Rb mem[Rb]  Ra VLIW: Many RISC style instructions in one word

Primary Adv. and Disadv. of Each Class of Machines Accumulator (single register) Adv: Minimizes internal state of machine (simpler context switching) Short instructions. Why? Operand is implied (as shown in previous slide) Dis: Since accumulator is the only available temporary storage, memory traffic is highest for this approach. Stack (No general purpose registers) Adv: Simple model of expression evaluation (reverse polish). Simplify compiler. Short instructions can yield good code density. Dis: A stack cannot be randomly accessed. This limitation makes it difficult to generate efficient code. It’s also difficult to implement efficiently, since the stack becomes a bottleneck.

Registers Primary Adv. and Disadv. of Each Class of Machines—cont’d Adv: Most general, efficient, flexible model for code generation. Dis: All operands must be named (e.g.add Ra Rb Rc), leading to longer instructions. While many early machines used stack or accumulator-style architectures (why?), more modern machines (designed in last 10-20 years and still in use) use a general-purpose register architecture. • Registers are faster than memory—same speed as CPU • Registers are easier for compilers to use (e.g. for variables, intermediate results, etc. can be stored in registers) than other forms (like stacks) • Registers can be used more effectively by compilers than other forms of internal storage such as Accumulators

Machine Types

Code for C = A + B for stack machine

Intel IA-64 128 general purpose registers

How Many Registers? • Registers are faster than memory, so, have as many as possible? No! Why? Answers below: • One reason registers are faster is that there are fewer of them Small is fast (hardware truism) • Another is that they are directly addressed (no address calculation) – More of them, means larger specifiers (Longer instruction length) • Not everything can be put in registers –Structures, arrays, anything pointed-to--Although compilers are getting better at putting more things in registers – More registers means more saving/restoring—in context switching • Upshot trend to more registers: 8 (x86) 32 (MIPS)  128 (IA64)

Issues in Memory Addressing • Interpreting Memory Addresses -- Little endian—Intel, VAX -- Big endian—IBM 370, Motorola 68000, Sun Sparc -- Bi-endian (can be configured either way)—ARM, PowerPC, DEC Alpha, MIPS, PA+RISC, IA-64 -- Byte ordering can be a problem when exchanging data among different machines—TCP/IP’s network byte order is in Big Endian order -- Memory Alignment—byte, half word, word, double word, etc. • Addressing Modes -- Many modes

Big Endian vs Little Endian • For example, consider the number 1025 stored in a 4-byte integer: • 00000000 00000000 00000100 00000001 Address Big-Endian Little-Endian 00 00000000 00000001 01 00000000 00000100 02 00000100 00000000 03 00000001 00000000 LSB MSB MSB LSB

Addressing Modes • How to specify the location of an operand (effective address) • Addressing modes can: – Significantly reduce instruction counts – Increase the average CPI – Increase the complexity of building a machine • VAX machine has been used for benchmark data since it supports the richest set of addressing modes—20 addressing modes ! • Addressing modes can be classified based on: – source of the data (register, immediate, or memory) – the address calculation (direct, indirect, indexed)

Addressing Modes Figure B.6—Addressing modes used in recent computers

Design of Addressing Modes • Now, the question is how to design the Addressing modes? • Which addressing modes should we include? • How should we proceed? • Answers: Careful analysis—following slides

Addressing Modes Addressing mode usage patterns in three programs on a VAX machine (Figure B.7 of text—4th Ed.) Why VAX machines are often used for running benchmarks?

Displacement & Immediate Addressing • From the Figure B.7 above, Displacement & Immediate addressing modes occur most frequently • Therefore, it would be important to find the optimal size for the “displacement field” and “immediate field” in the instructions

Displacement in Addressing Statistics for Alpha with SPEC CPU2000 Suggests Displacement size of 12-16 bits

Immediate Addressing mode Figure 2.9(3rd Ed.) About ¼ of data transfers and ALU operations have immediate operands Should be carefully designed!

Immediate Addressing mode Figure 2.10(3rd Ed.) Size distribution of immediate addresses Suggests 8- 16 bits for Immediate address

Memory Alignment • Processors often require data-types to be aligned on addresses that are a multiple of their size: • bytes can be aligned everywhere • 4 byte integers aligned on addresses divisible by 4

Alignment Restrictions • For objects larger than one byte, some computers require alignment on object-sized boundaries. • Some machines allow the misalignment • Allowing misaligned accesses complicates hardware • A misaligned memory access needs more than one memory accesses—program runs slower half word word double-word byte 0 1 2 3 4 5 6 7 8 … Memory “misaligned”

Data Alignment on IA-32 • IA32 does not require alignment, but Intel recommends alignment for performance improvement. • Data Alignment in struct The requirements for data alignment affect the memory layout of struct variables in strange ways. Example: struct S1 { int i; char c; int j; } Since integers i and j need to be on 4-byte boundaries, we need to insert a 3-byte gap between c and j by compiler

Summary: Memory Addressing From the analysis of current processors, we can predict what the future machines may implement: • Addressing modes: The most popular in current usages are: -- Displacement -- Immediate -- Register indirect • Size of Displacement to be at least 12-16 bits—these sizes would capture 75% to 99% of the displacements • Size of Immediate field to be at least 8-16 bits—these sizes would capture 50% to 80% of the immediate

Operations in the Instruction Set • Arithmetic and logical – integer arithmetic and logical operations: add, and, subtract, and, or … • Data transfer – loads/stores • Control – branch, jump, procedure call and return, traps • System – operating system call, virtual memory management instructions • Floating point – floating-point operations: add, sub, multiply • Decimal – decimal add, decimal multiply, decimal-to-character conversions • String – string move, string compare, string search • Graphics – pixel and vertex operations

Fig 2.16(3rd Ed) Make common case fast! According to who’s law?

Instructions for Control Flow Major aspects: • They are in category of “Most frequently executed instructions” • PIC – Position Independent Code (e.g. pc-relative jump address) • Caller vs. Callee saving of state Figure 2.19(3rd Ed.) Breakdown of control flow instructions into three classes (Alpha processor)

Frequency of Compare Types Statistics for SPECCPU2000 < and  dominate

How Conditions are checked? • Compute condition first – Condition codes– 80x86, ARM, PowerPC, SPARC CMP R1, R2 BGE LOOP (Forces CMP and BR to be adjacent) – Condition in General Purpose register– Alpha, MIPS CMP R3, R1, R2 BGE R3, LOOP (Any register can be used, simple) – Condition in “condition” register • Fuse condition check and branch—PA-RISC, VAX BGE R1, R2, LOOP (Reduces instruction count, but complicates pipelining)

Conditions from ALU IA-32 Condition register (EFLAGS)

Procedure Calls • Procedure calls require both control transfer and state storage. • Storage options: • Caller saving – calling procedure saves state • Callee saving – called procedure saves the registers it wants to use • Most modern systems use a combination of both.

Encoding an Instruction Set • Encoding affects: • Size of compiled program • Implementation of decoding of the processor • Encoding is influenced by ( the design factors for encoding): • Number of instructions: size of the Opcode • Number of addressing modes • Number of operands • Number of registers: size of the operand fields • Variable instruction length vs. Fixed instruction length • Intel x86 instructions are between 1 and 17 bytes long

IA-32 Instruction format

Popular Encoding Methods • Variable method: Allows all addressing modes to be used by all operations. In general, smallest code representation (unused fields are not included) • Fixed: Always has same number of operands. Combines the operations and addressing modes into opcode. With simpler decoding, better performance than Variable length method • Hybrid: Mix of the two. For embedded applications, full 32 bit instructions became a burden (code size). Therefore, new hybrid version of RISC instructions sets, support both 16-bit and 32-bit instructions with code size reduction of up to 40% -- Appendix C

Figure 2.23 Three basic variations in instruction encoding: Variable length, fixed length, and hybrid Variable length encoding requires more complex decoding than fixed length—execution time of every instruction is a bit longer!

Summary for 5 recent architectures for desktop From Appendix C

RISC vs. CISC (ISA design issue) RISC = Reduced Instruction Set Computer • Small (reduced) instruction sets based upon 80/20 rule • Fixed-length instructions that often execute in a single cycle • Operations performed only on registers • Memory accesses via Load & Store • Simpler chip that can run at higher clock speed • Also the saved chip area (by not having complex instructions) can be used to speed up; e.g. more registers, larger cache, multi-level cache, pipeline, etc. CISC = Complex Instruction Set Computer • Large instruction sets due to: a) To support complex functions b) To provide backward compatibility for old programs • Complex, variable-length instructions  smaller code space than RISC • Usually many addressing modes including Memory-to-memory operations

RISC vs CISC • Hot debates from early 80s through 90s • Based on IBM’s John Cocke’s observation—so called 80/20 rule • 20% of instructions did 80% of the work—with experiments on a large pool of programs • Application of Amdahl’s Law—Make common case faster • IBM 801 • Berkeley RISC-1 (Patterson) • Stanford MIPS (Hennessy)

Design Principles  CISC(Patterson, 1985) • Richer instruction sets would simplify compilers. • Richer instruction sets would alleviate the software crisis. • Richer instruction sets would improve architecture quality. • Since execution speed was proportional to program size, architectural techniques that led to smaller programs also led to faster computers.

Design Principles  RISC(Patterson, 1985) • Functions should be kept simple unless there is a very good reason to do otherwise. • Simple decoding and pipelined execution are more important than program size. • Compiler technology should be used to simplify instructions rather than to generate complex instructions.

A “Typical” RISC(Patterson) • 32-bit fixed format instruction (3 formats) • 32 32-bit general-purpose registers (R0 contains zero, double-precision numbers take two registers) • Single address mode for load/store: base + displacement (no indirection) • Simple branch conditions • Delayed branch to avoid pipeline penalties Examples: DLX, SPARC, MIPS, HP PA-RISC, DEC Alpha, IBM/Motorola PowerPC, Motorola M88000

X86 (a CISC) • Variable length ISA (1-17 bytes) • FP Operand Stack • 2 operand instructions (extended accumulator) – Register-register and register-memory support • Scaled addressing modes • Has been extended many times (as AMD has recently done with x86-64) • Intel, initially went to IA-64(Itanium), now backtracked to EM64T

After All the Dust is settled • Turns out it (RISC or CISC) doesn’t matter much anymore Why? • Can decode CISC instructions into internal “micro-ISA” – This takes a couple of extra cycles (PLA implementation) and a few hundred thousand transistors – Pentium 4 caches these micro-Ops • Actually may have some advantages – External ISA for compatibility, internal ISA can be tweaked each generation

Impact of Compiler Technologyon Architectural Decisions The interaction of compilers and high-level languages significantly affects how programs use an instruction set. In other words, the ISA and a compiler are very closely related. 1. How are variables allocated and addressed? How many registers are needed to allocate variables appropriately? 2. What is the impact of optimization techniques on instruction mixes? 3. What control structures are used and with what frequency?

Structure of Recent Compilers Typically 2 – 4 levels

Instruction Set Principles & Examples (Appendix B)