Chapter 4 Instruction Set Examples

Chapter 4Instruction Set Examples Advanced Computer Architecture COE 501

DLX Architecture • Introduced by Hennessey and Patterson in 1990 • Derived from many different instruction set architectures from MIPS, Sun, IBM, Intel, HP, AMD, etc. • DLX is a typical RISC architecture. • 32-bit fixed length instructions • 3 instruction formats • Load/store architecture • Simple branch conditions (no condition codes) • DLX registers • 32 32-bit general-purpose registers (R0 = 0) • 32 32-bit (or 16 64-bit) floating point registers • Special purpose registers (e.g., FP Status and PC)

DLX Design Decisions • DLX is based on the following design decisions • Use general purpose registers with a load-store architecture • Support commonly used addressing modes • displacement, immediate, and register differed • Support simple instructions that occur frequently • load, store, add, subtract, move, and, shift, compare equal, branch, jump, call, and return • Support commonly required data sizes • 8 (byte), 16 (half word), and 32-bit (word) integers • 32 (float) and 64-bit (double) floating point • Use fixed length instructions that are easy to decode • Provide plenty of general purpose registers and separate floating point registers

DLX Instruction Formats Register-Register (R-type) ADD R1, R2, R3 6 5 11 10 31 26 25 21 20 16 15 0 Op rs1 rs2 rd func (ALI reg. operations, read/write special registers and moves) Register-Immediate (I-type) SUB R1, R2, #3 31 26 25 21 20 16 15 0 immediate Op rs1 rd (ALU imm. operations, loads and stores, conditional branch, jump (and link) Jump / Call (J-type) JUMP end 31 26 25 0 offset added to PC Op (jump, jump and link, trap and return from exception)

Examples of DLX Instructions • Data Transfer • LW R1, 30(R2) Regs[R1] <= Mem[30 + Regs[R2]] • SD F0, 40(R3) Mem[40 + Regs[3]] <= Regs[F0] Mem[44 + Regs[3]] <= Regs[F1] • Loads and stores also for bytes, half words, and floats • How would you perform a register move? a no-op? • Arithmetic and Logic • SUB R1, R2, R3 Regs[R1] <= Regs[R2] - Regs[R3] • SLLI R1, R2, #5 Regs[R1] <= Regs[R2] << #5 • LHI R1, #42 Regs[R1] <= 42##016 • SLT R1, R2, R3 if (Regs[R2] < Regs[R3]) Regs[1] <= 1 else Regs[1] <= 0 - How would you load a 32 bit immediate into a register? 16

Examples of DLX Instructions • Control • JALR R2 Regs[31] <= PC+4, PC <= Regs[R2] • JR R3 PC <= Regs[R3] • BENZ R4, name if (Regs[R4] != 0) PC <- name • else PC <- PC + 4 • How would you implement a subroutine call and return? • Floating Point • MULF F1, F2, F3 Regs[F1] <= Regs[F2] * Regs[F3] • ADDD F0, F2, F4 Regs[F0&F1] < Regs[F2&F3] + Regs[F4&F5] • What would be difficult about adding a floating-point multiply and add instruction to DLX?

DLX Instruction SetAppendix C.3 • Data transfer • Load/store word • Load/store halfword or byte (singed/unsigned loads) • Load/store floating point single/double • Register moves • Arithmetic and Logic • Add/subtract (signed or unsigned, reg. or imm.) • Multiply/divide (signed or unsigned, operands in FP reg.) • And, or, xor (reg. or imm.) • Load high word (loads upper half of a reg. with imm.) • Shifts (LL, RL, RA) (reg. or imm.) • Set conditionals (LT, GT, LE, GE, EQ, NE) (reg. or imm.)

DLX Instruction Set • Control • Conditional branch on register (compare with zero) • Conditional on FP status bit (bit true or false) • Jump, jump register (26 bit imm. or reg.) • Jump and link, jump and link register (26 bit imm. or reg.) • Trap, return from exception (trap to and return from O.S.) • Floating Point • Add, subtract, multiply, divide (single or double) • FP converts (convert between single, double, and integer) • FP compares (single or double, sets bit in FP status)

DLX Instruction Usage

DLX Summary • Simple load/store architecture • Only accesses memory on loads/stores • All other operations use registers and immediate • Designed for pipeline efficiency • Fixed length instruction encoding • Simple instructions • Easy to compile to • Simple, frequently used instructions • Orthogonal instruction set • Few addressing modes • Reduces execution time by • reducing CPI • reducing clock rate

History of the Intel 80x86 • 1971: Intel invents microprocessor - 4004 • 1975: 8080 introduced • 8-bit microprocessor, used in the Altair personal computer • Accumulator machine • 1978: 8086 introduced • 16 bit microprocessor • Accumulator plus dedicated registers • 1980: IBM selects 8088 as basis for IBM PC • 8088 is 8-bit external bus version of 8086 • 1980: 8087 floating point coprocessor • adds 60 floating point instructions • 80 bit floating point registers • uses hybrid stack/register scheme

History of the Intel 80x86 • 1982: 80286 introduced • 24-bit address • memory mapping & protection • 1985: 80386 introduced • 32-bit address, 32-bit GP registers • Support for multitasking • 1989: 80486 introduced • Built in math coprocessor • More powerful cache and instruction pipelining • 1992: Pentium introduced • Superscalar processor (multiple instructions per cycle) • 1995: Pentium Pro introduced • More aggressive superscalar with register renaming, branch prediction, and speculative execution

History of the Intel 80x86 • Pentium II introduced • Incoporated MMX Technology • 57 new instructions for processing video, audio, and graphics • Support single instructions that operate on multiple data (SIMD) • 8 data of 8 bits, 4 data of 16 bits, or 2 data by 32 bits • Pentium III introduced • Features Internet Streaming SIMD extensions • Improve performance of 3D graphics and internet applications • Allow one instruction to executed on 4 pairs of 32-bit floating point data • Itanium introduced • Release of IA-64 architecture • New 64-bit RISC-like architecture • 128-bit instructions bundles (3 instructions per bundle) • Intel architecture was due to the desire for backward compatability • Highly irregular architecture • Over 50 million sold per year

Intel 80x86 Integer Registers

X86 Operand Types • x86 instructions typically have two operands, where one operand is both a source and a destination operand. • Possible combinations include Source/destination typeSecond source type Register Register Register Immediate Register Memory Memory Register Memory Immediate • No memory-memory or immediate-immediate • Immediates can be 8, 16, or 32 bits

Intel 80x86 Floating Point Registers • Operations on the top of stack and one register within the stack

Usage of Intel 80x86 Floating Point Registers NASA 7 Spice Stack (2nd operand ST(1)) 0.3% 2.0% Register (2nd operand ST(i), i>1) 23.3% 8.3% Memory 76.3% 89.7% Above are dynamic instruction percentages (i.e., based on counts of executed instructions) Stack unused by Solaris compilers for fastest execution

80x86 Instruction Format • Instructions sizes vary from 1 to 17 bytes

80x86 Instructions • Data movement (move, push, pop) • Arithmetic and logic (logic ops, tests CCs, shifts, integer and decimal arithmetic) • Control flow (branches, jumps, calls, returns) • String instructions (move and compare) • FP data movement (load, load const., store) • Arithmetic instructions (add, subtract, multiply, divide, square root, absolute value) • Comparisons (can send result to ALU) • Transcendental functions (sin, cos, log, etc.)

80x86 Addressing Mode Usage for 32-bit Mode Addressing Mode Gcc Espr. NASA7 Spice Avg. Register indirect 10% 10% 6% 2% 7% Base + 8-bit disp 46% 43% 32% 4% 31% Base + 32-bit disp 2% 0% 24%10% 9% Indexed 1% 0% 1% 0% 1% Based indexed + 8b disp 0% 0% 4% 0% 1% Based indexed + 32b disp 0% 0% 0% 0% 0% Base + Scaled Indexed 12%31% 9% 0% 13% Base + Scaled Index + 8b disp 2% 1% 2% 0% 1% Base + Scaled Index + 32b disp 6% 2% 2% 33%11% 32-bit Direct 19% 12% 20% 51% 26%

80x86 Length Distribution

Instruction Counts: 80x86 vs. DLX SPEC pgm x86 DLX DLX÷86 gcc 3,771,327,742 3,892,063,460 1.03 espresso 2,216,423,413 2,801,294,286 1.26 spice 15,257,026,309 16,965,928,788 1.11 nasa7 15,603,040,963 6,118,740,321 0.39 • DLX tends to perform more instructions for integer programs, while the 80x86 performs more instructions for floating point programs • 80x86 performs many more data transfers • Two to four times more for floating point programs • About 1.25 times more for integer programs • Why so many more for floating point

Comparison • How would you expect the x86 and MIPS architectures to compare on the following. • CPI on SPEC benchmarks • Ease of design and implementation • Ease of writing assembly language & compilers • Code density • Overall performance • What other advantages/disadvantages are there to the two architectures.

Graphics and Multimedia Instruction Set Extensions • Several companies have extended their computer’s instruction sets to better support graphics and multimedia applications. • Intel’s MMX Technology • Intel’s Internet Streaming SIMD Extensions • AMD’s 3DNow! Technology • Sun’s Visual Instruction Set • Motorola’s and IBM’s AltiVec Technology • These extensions improve the performance of • Computer-aided design • Internet applications • Computer visualization • Video games • Speech recognition

MMX Data Types • MMX Technology supports operations on the following 64-bit integer data types. Packed byte (eight 8-bit elements) Packed word (four 16-bit elements) Packed double word (two 32-bit elements) Packed quad word (one 64-bit elements)

SIMD Operations • MMX Technology allows a Single Instruction to work on Multiple pieces of Data (SIMD). PADD[W]: Packed add word • In the above example, 4 parallel adds are performed on 16-bit elements. • Most MMX instructions only require a single cycle. A3 A2 A1 A0 B3 B2 B1 B0 A3+B3 A2+B2 A1+B1 A0+B0

Saturating Arithmetic • Both wrap-around and saturating adds are supported. • With saturating arithmetic, results that overflow are set to the largest value. PADD[W]: Packed wrap-around add PADDUS[W]: Packed saturating add

Pack and Unpack Instructions • Pack and unpack instructions provide conversion between standard data types and packed data types PACKSS[DW]: Packed signed with saturating double to packed word

Multiply-Add Operations • Many graphics applications require multiply-accumulate operations • Vector Dot Products • Matrix Multiplies • Fast Fourier Transforms (FFTs) • Filter implementations PMADDWD: Packed multiply-add word to double

Vector Dot Product • A dot product on an 8-element vector can be performed using 9 MMX instructions • Without MMX 40 instructions are required 0 a0*c0+..+ a3*c3 0 a4*c4+..+ a7*c7 a0*c0+..+ a7*c7

Packed Compare Instructions • Packed compare instructions allow a bit mask to be set or cleared • This is useful when images with certain qualities need to be extracted.

MMX Instructions • MMX Technology adds 57 new instructions to the x86 architecture. • Some of these instructions include • PADD(b, w, d) Packed addition • PSUB(b, w, d) Packed subtraction • PCMPE(b, w, d) Packed compare equal • PMULLw Packed word multiply low • PMULHw Packed word multiply high • PMADDwd Packed word multiply-add • PSRL(w, d, q) Pack shift right logical • PACKSS(wb, dw) Pack data • PUNPCK(bw, wd, dq) Unpack data • PAND, POR, PXOR Packed logical operations

Application Without MMX With MMX Speedup Video 155.52 268.70 1.72 Image Processing 159.03 743.90 4.67 3D geometry 161.52 166.44 1.03 Audio 149.80 318.90 2.13 Overall 156.00 255.43 1.64 Performance Comparison • The following shows the performance of Pentium processors with and without MMX Technology

MMX Technology Summary • MMX technology extends the Intel x86 architecture to improve the performance of multimedia and graphics applications. • It provides a speedup of 1.5 to 2.0 for certain applications. • MMX instructions are hand-coded in assembly or implemented as libraries to achieve high performance. • MMX data types use the x86 floating point registers to avoid adding state to the processor. • Makes it easy to handle context switches • Makes it hard to perform MMX and floating point instructions at the same time • Only increase the chip area by about 5%.

Questions on MMX • What are the strengths and weaknesses of MMX Technology? • How could MMX Technology potentially be improved? • How did the developers of MMX preserve backward compatibility with the x86 architecture? • Why was this important? • What are the disadvantages of this approach? • What restrictions/limitations are there on the use of MMX Technology?

Internet Streaming SIMD Extensions • Intel’s Internet Streaming SIMD Extensions (ISSE) • Help improve the performance of video and 3D applications • Are designed for steaming data, which is used once and then discarded. • 70 new instructions beyond MMX Technology • Adds new 128-bit registers • Provide the ability to perform parallel floating point operations • Four parallel operations on 32-bit numbers • Reciprocal and reciprocal root instructions - normalization • Packed average instruction – Motion compensation • Provide data prefetch instructions • Make certain applications 1.5 to 2.0 times faster.

Chapter 4 Instruction Set Examples