Floating Point

CPSC 252 Computer Organization Ellen Walker, Hiram College Floating Point

Representing Non-Integers • Often represented in decimal format • Some require infinite digits to represent exactly • With a fixed number of digits (or bits), many numbers are approximated • Precision is a measure of the degree of approximation

Scientific Notation (Decimal) • Format: m.mmmm x 10^eeeee • Normalized = exactly 1 digit before decimal point • Mantissa (m) represents the significant digits • Precision limited by number of digits in mantissa • Exponent (e) represents the magnitude • Magnitude limited by number of digits in exponent • Exponent < 0 for numbers between 0 and 1

Scientific Notation (Binary) • Format: 1.mmmm x 2^eeeee • Normalized = 1 before the binary point • Mantissa (m) represents the significant bits • Precision limited by number of bits in mantissa • Exponent (e) represents the magnitude • Magnitude limited by number of bits in exponent • Exponent < 0 for numbers between 0 and 1

Binary Examples • 1/16 1.0 x 2^-4 (mantissa 1.0, exponent -4) • 32.5 1.000001 x 2^5 (mantissa 1.000001, exponent 5)

Quick Decimal-to-Binary Conversion (Exact) • Multiply the number by a power of 2 big enough to get an integer • Convert this integer to binary • Place the binary point the appropriate number of bits (based on the power of 2 from step 1) from the right of the number

Conversion Example • Convert 32.5 to binary • Multiply 32.5 by 2 (result is 65) • Convert 65 to binary (result is 1000001) • Place the decimal point (in this case 1 bit from the right) (result is 100000.1) • Convert to binary scientific notation (result is 1.000001 x 2^5)

Floating Point Representation • Mantissa - m bits (unsigned) • Exponent - e bits (signed) • Sign (separate) - 1 bit • Total = 1+m+e bits • Tradeoff between precision and magnitude • Total bits fit into 1 or 2 full words

Implicit First Bit • Remember the mantissa must always begin with “1.” • Therefore, we can save a bit by not actually representing the 1 explicitly. • Example: • Mantissa bits 0001 • Mantissa: 1.0001

Offset Exponent • Exponent can be positive or negative, but it’s cleaner (for sorting) use an unsigned representation • Therefore, represent exponents as unsigned, but add a bias of –((2^(bits-1))-1) • Examples: 8 bit exponent • 00000001 = 1(+ -127) = -126 • 10000000 = 128 (+ -127) = 1

IEEE 754 Floating Point Representation (Single) • Sign (1 bit), Exponent (8 bits), Magnitude (23 bits) • What is the largest value that can be represented? • What is the smallest positive value that can be represented? • How many “significant bits” can be represented? • Values can be sorted using integer comparison • Sign first • Exponent next (sorted as unsigned) • Magnitude last (also unsigned)

Double Precision • Floating point number takes 2 words (64 bits) • Sign is 1 bit • Exponent is 11 bits (vs. 8) • Magnitude is 52 bits (vs. 23) • Last 32 bits of magnitude is in the second word

Floating Point Errors • Overflow • A positive exponent becomes too large for the exponent field • Underflow • A negative exponent becomes too large for the exponent field • Rounding (not actually an error) • The result of an operation has too many significant bits for the fraction field

Special Values • Infinity • Result of dividing a non-zero value by 0 • Can be positive or negative • Infinity +/- anything = Infinity • Not A Number (NaN) • Result of an invalid mathematical operation, e.g. 0/0 or Infinity-Infinity

Representing Special Values in IEEE 754 • Exponent ≠0, Exponent ≠ FF • Ordinary floating point number • Exponent = 00, Fraction = 0 • Number is 0 • Exponent = 00, Fraction ≠ 0 • Number is denormalized (leading 0. Instead of 1.) • Exponent = FF, Fraction = 0 • Infinity (+ or -, depending on sign) • Exponent = FF, Fraction ≠ 0 • Not a Number (NaN)

Double Precision in MIPS • Each even register can be considered a register pair for double precision • High order bit in even register • Low order bit in odd register

Floating Point Arithmetic in MIPS • Add.s, add.d, sub.s, sub.d [rd] [rs] [rt] • Single and double precision addition / subtraction • rd = rs +/- rt • 32 floating point registers $f0 - $f31 • Use in pairs for double precision • Registers for add.d (etc) must be even numbers

Why Separate Floating Point Registers? • Twice as many registers using the same number of instruction bits • Integer & floating point operations usually on distinct data • Increased parallelism possible • Customized hardware possible

Load/ Store Floading Point Number • Lwc1 32 bit word to FP register • Swc1 FP register to 32 bit word • Ldc1 2 words to FP register pair • Sdc1 register pair to 2 words • (Note last character is the number 1)

Floating Point Addition • Align the binary points (make exponents equal) • Add the revised mantissas • Normalize the sum

Changing Exponents for Alignment and Normalization • To keep the number the same: • Left shift mantissa by 1 bit and decrement exponent • Right shift mantissa by one bit and increment exponent • Align by right-shifting smaller number • Normalize by • Round result to correct number of significant bits • Shift result to put 1 before binary point

Addition Example • Add 1.101 x 2^4 + 1.101 x 2^5 (26+52) • Align binary points 1.101 x 2^4 = 0.1101 x 2^5 • Add mantissas 0.1101 x 2^5 1.1010 x 2^5 10.0111 x 2^5

Addition Example (cont.) • Normalize: 10.0111 x 2^5 = 1.00111 x 2^6 (78) • Round to 3-bit mantissa: 1.00111 x 2^6 ~= 1.010 x 2^6 (80)

Rounding • At least 1 bit beyond the last bit is needed • Rounding up could require renormalization • Example: 1.1111 -> 10.000 • For multiplication, 2 extra bits are needed in case the product’s first bit is 0 and it must be left shifted (guard, round) • For complete generality, add “sticky bit” that is set whenever additional bits to the right would be >0

Round to Nearest Even • Most common rounding mode • If the actual value is halfway between two values round to an even result • Examples: • 1.0011 -> 1.010 • 1.0101 -> 1.010 • If the sticky bit is set, round up because the value isn’t really halfway between!

Floating point addition

Floating Point Multiplication • Calculate new exponent by adding exponents together • Multiply the significands (using shift & add) • Normalize the product • Round • Set the sign

Adding Exponents • Remember that exponents are biased • Adding exponents adds 2 copies of bias! (exp1 + 127) + (exp2 + 127) = (exp1+exp2 + 254) • Therefore, subtract the bias from the sum and the result is a correctly biased value

Multiplication Example • Convert 2.25 x 1.5 to binary floating point (3 bits exponent, 3 bits mantissa) • 2.25 = 10.01 * 2^0 = 1.001 * 2^1 • Exp = 100 (because bias is 3) • 2.25 = 0 100 001 • 1.5 = 1.100 * 2^0 • Exp = 011, Mantissa: 100 • 1.5 = 0 100 100

1. Add Exponents • 0 100 001 x 0 011 100 • Add Exponents (and subtract bias) 100 + 011 – 011 = 100

2. Multiply Significands • 0 100 001 x 0 011 100 • Remember to restore the leading 1 • Remember that the number of binary places doubles 1.001 1.100 ------------------------ .100100 1.001000 ---------------- 1.101100 x 2^1

Finish Up • Product is 1.1011 * 2^1 • Already normalized • But, too many bits, so we need to round • Nearest even number (up) is 1.110 • Result: 0 100 110 • Value is 1.75 * 2 = 3.5

Types of Errors • Overflow • Exponent too large or small for the number of bits allotted • Underflow • Negative exponent is too small to fit in the # bits • Rounding error • Mantissa has too many bits

Overflow and Underflow • Addition • Overflow is possible when adding two positive or two negative numbers • Multiplication • Overflow is possible when multiplying two large absolute value numbers • Underflow is possible when multiplying two numbers very close to 0

Limitations of Finite Floating Point Representations • Gap between 0 and the smallest non-zero number • Gaps between values when the last bit of the mantissa changes • Fixed number of values between 0 and 1 • Significant effects of rounding in mathematical operations

Implications for Programmers • Mathematical rules are not always followed • (a / b) * b does not always equal a • (a + b) + c does not always equal a + (b + c) • Use inequality comparisons instead of directly comparing floating point numbers (with ==) • if ((x > –epsilon) && (x < epsilon)) instead of if(x==0) • Epsilon can be set based on problem or knowledge of representation (e.g. single vs. double precision)

The Pentium Floating Point Bug • To speed up division, a table was used • It was assumed that 5 elements of the table would never be accessed (and the hardware was optimized to make them 0) • These table elements occasionally caused errors in bits 12 to 52 of floating point significands • (see Section 3.8 for more)

A Marketing Error • July 1994 - Intel discovers the bug, decides not to halt production or recall chips • September 1994 - A professor discovers the bug, posts to Internet (after attempting to inform Intel) • November 1994 - Press articles, Intel says will affect “maybe several dozen people” • December 1994 - IBM disputes claim and halts shipment of Pentium based PCs. • Late December 1994 - Intel apologizes

The “Big Picture” • Bits in memory have no inherent meaning. A given sequence can contain • An instruction • An integer • A string of characters • A floating point number • All number representations are finite • Finite arithmetic requires compromises

Floating Point

Floating Point

Presentation Transcript

Floating Point

Floating-Point Arithmetic

Floating Point Representation

Decimal Floating Point

Floating point

Floating Point

IA32 Floating Point

Floating Point

Floating Point

Floating Point

Floating Point

Floating Point

Floating point

Floating Point

Floating point

Floating Point

Floating Point

Floating Point

Floating-Point Representation

Floating Point

Floating Point

Floating Point Numbers