Lecture 7: Multiplication and Floating Point

Lecture 7: Multiplication and Floating Point EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr. Rozier (UM)

LAB 2

Lab Phases: Recursive • Phase 1 – Factorial • Phase 2 - Fibonacci

Lab Phases: Arrays • Phase 4 – Sum Array • Phase 5 – Find Item • Phase 6 – Bubble Sort

Lab Phases: Trees Array representation: [1,2,3,4,5,6,7,0,0,0,0,0,0,0,0] • Phase 7 – Tree Height • Phase 8 – Tree Traversal [1,2,5,0,0,4,0,0,3,6,0,0,7,0,0] 1 2 3 4 5 6 7

INTEGER ARITHMETIC

Addition and Subtraction

Half adder Inputs Outputs A B C S 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0

Full Adder

Full Adder Inputs Outputs A B Cin Cout S 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 1 0 1 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 1

In Groups, Implement a 1-bit Full Adder Inputs Outputs A B Cin Cout S 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 1 0 1 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 1

Getting a Full Adder Half Adder Full Adder

Putting Together Multiple Bits

Making it Faster Carry Look Ahead Adder

Making it Even Faster Kogge-Stone Adder Carry-Select Adder

X B2U(X) B2T(X) 0000 0 0 0001 1 1 0010 2 2 0011 3 3 0100 4 4 0101 5 5 0110 6 6 0111 7 7 1000 8 –8 1001 9 –7 1010 10 –6 1011 11 –5 1100 12 –4 1101 13 –3 1110 14 –2 1111 15 –1 How do we get subtraction?

X B2U(X) B2T(X) 0000 0 0 0001 1 1 0010 2 2 0011 3 3 0100 4 4 0101 5 5 -1 x 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 0110 6 6 0111 7 7 + ~x 1000 8 –8 0 1 1 0 0 0 1 0 1001 9 –7 1010 10 –6 1011 11 –5 1100 12 –4 1101 13 –3 1110 14 –2 1111 15 –1 How do we get subtraction?

Start with long-multiplication approach 1000 × 1001 1000 0000 0000 1000 1001000 Multiplication multiplicand multiplier product Length of product is the sum of operand lengths

Multiplication Hardware Initially 0

Perform steps in parallel: add/shift Optimized Multiplier • One cycle per partial-product addition • That’s ok, if frequency of multiplications is low

Uses multiple adders Cost/performance tradeoff Faster Multiplier • Can be pipelined • Several multiplication performed in parallel

Computing Exact Product of w-bit numbers x, y Either signed or unsigned Ranges Unsigned: 0 ≤ x * y ≤ (2w – 1) 2 = 22w – 2w+1 + 1 Up to 2w bits Two’s complement min: x * y ≥ (–2w–1)*(2w–1–1) = –22w–2 + 2w–1 Up to 2w–1 bits Two’s complement max: x * y ≤ (–2w–1) 2 = 22w–2 Up to 2w bits, but only for (TMinw)2 Maintaining Exact Results Would need to keep expanding word size with each product computed Done in software by “arbitrary precision” arithmetic packages Multiplication

Standard Multiplication Function Ignores high order w bits Implements Modular Arithmetic UMultw(u , v) = u · v mod 2w • • • • • • • • • • • • • • • Unsigned Multiplication in C u Operands: w bits * v u · v True Product: 2*w bits UMultw(u , v) Discard w bits: w bits

SUN XDR library Widely used library for transferring data between machines ele_src malloc(ele_cnt * ele_size) Code Security Example #2 void* copy_elements(void *ele_src[], int ele_cnt, size_t ele_size);

XDR Code void* copy_elements(void *ele_src[], int ele_cnt, size_t ele_size) { /* * Allocate buffer for ele_cnt objects, each of ele_size bytes * and copy from locations designated by ele_src */ void *result = malloc(ele_cnt * ele_size); if (result == NULL) /* malloc failed */ return NULL; void *next = result; int i; for (i = 0; i < ele_cnt; i++) { /* Copy object i to destination */ memcpy(next, ele_src[i], ele_size); /* Move pointer to next memory region */ next += ele_size; } return result; }

What if: ele_cnt = 220 + 1 ele_size = 4096 = 212 Allocation = ?? How can I make this function secure? XDR Vulnerability malloc(ele_cnt * ele_size)

Standard Multiplication Function Ignores high order w bits Some of which are different for signed vs. unsigned multiplication Lower bits are the same • • • • • • • • • • • • • • • Signed Multiplication in C u Operands: w bits * v u · v True Product: 2*w bits TMultw(u , v) Discard w bits: w bits

Operation u << kgives u * 2k Both signed and unsigned Examples u << 3 == u * 8 u << 5 - u << 3 == u * 24 Most machines shift and add faster than multiply Compiler generates this code automatically • • • Power-of-2 Multiply with Shift k u • • • Operands: w bits * 2k 0 ••• 0 1 0 ••• 0 0 u · 2k True Product: w+k bits 0 ••• 0 0 Discard k bits: w bits UMultw(u , 2k) ••• 0 ••• 0 0 TMultw(u , 2k)

Multiply on ARM MUL{<cond>}{S} Rd, Rm, Rs Rd = Rm * Rs MLA{<cond>}{S} Rd, Rm, Rs, Rn Rd = Rm * Rs + Rn

Check for 0 divisor Long division approach If divisor ≤ dividend bits 1 bit in quotient, subtract Otherwise 0 bit in quotient, bring down next dividend bit Restoring division Do the subtract, and if remainder goes < 0, add divisor back Signed division Divide using absolute values Adjust sign of quotient and remainder as required Division quotient dividend 1001 1000 1001010 -1000 10 101 1010 -1000 10 divisor remainder n-bit operands yield n-bitquotient and remainder

Division Hardware Initially divisor in left half Initially dividend

One cycle per partial-remainder subtraction Looks a lot like a multiplier! Same hardware can be used for both Optimized Divider

Can’t use parallel hardware as in multiplier Subtraction is conditional on sign of remainder Faster dividers (e.g. SRT devision) generate multiple quotient bits per step Still require multiple steps Faster Division

Division in ARM • ARMv6 has no DIV instruction.

Division in ARM • ARMv6 has no DIV instruction. N = D x Q + R with 0 <= |R| < |D| N/D = Q + R

An Algorithm for Division

FLOATING POINT

What is 1011.1012? Carnegie Mellon Fractional binary numbers

Representation Bits to right of “binary point” represent fractional powers of 2 Represents rational number: Carnegie Mellon Fractional Binary Numbers • • • • • •

Carnegie Mellon Fractional Binary Numbers: Examples • Value Representation • 5 3/4 101.112 • 2 7/8 010.1112 • 63/64 001.01112 • Observations • Divide by 2 by shifting right • Multiply by 2 by shifting left • Numbers of form 0.111111…2 are just below 1.0 • 1/2 + 1/4 + 1/8 + … + 1/2i + … ➙ 1.0 • Use notation 1.0 – ε

Limitation Can only exactly represent numbers of the form x/2k Other rational numbers have repeating bit representations Value Representation 1/3 0.0101010101[01]…2 1/5 0.001100110011[0011]…2 1/10 0.0001100110011[0011]…2 Carnegie Mellon Representable Numbers

Defined by IEEE Std 754-1985 Developed in response to divergence of representations Portability issues for scientific code Now almost universally adopted Two representations Single precision (32-bit) Double precision (64-bit) Floating Point Standard

S: sign bit (0  non-negative, 1  negative) Normalize significand: 1.0 ≤ |significand| < 2.0 Always has a leading pre-binary-point 1 bit, so no need to represent it explicitly (hidden bit) Significand is Fraction with the “1.” restored Exponent: excess representation: actual exponent + Bias Ensures exponent is unsigned Single: Bias = 127; Double: Bias = 1203 IEEE Floating-Point Format single: 8 bitsdouble: 11 bits single: 23 bitsdouble: 52 bits S Exponent Fraction

Consider a 4-digit decimal example 9.999 × 101 + 1.610 × 10–1 1. Align decimal points Shift number with smaller exponent 9.999 × 101 + 0.016 × 101 2. Add significands 9.999 × 101 + 0.016 × 101 = 10.015 × 101 3. Normalize result & check for over/underflow 1.0015 × 102 4. Round and renormalize if necessary 1.002 × 102 Floating-Point Addition

Now consider a 4-digit binary example 1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375) 1. Align binary points Shift number with smaller exponent 1.0002 × 2–1 + –0.1112 × 2–1 2. Add significands 1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1 3. Normalize result & check for over/underflow 1.0002 × 2–4, with no over/underflow 4. Round and renormalize if necessary 1.0002 × 2–4 (no change) = 0.0625 Floating-Point Addition

Much more complex than integer adder Doing it in one clock cycle would take too long Much longer than integer operations Slower clock would penalize all instructions FP adder usually takes several cycles Can be pipelined FP Adder Hardware

FP Adder Hardware Step 1 Step 2 Step 3 Step 4

FP multiplier is of similar complexity to FP adder But uses a multiplier for significands instead of an adder FP arithmetic hardware usually does Addition, subtraction, multiplication, division, reciprocal, square-root FP  integer conversion Operations usually takes several cycles Can be pipelined FP Arithmetic Hardware

Lecture 7: Multiplication and Floating Point