Floating Point Arithmetic

Floating Point Arithmetic

History of Floating Point • Defining Floating Point Arithmetic • Floating Point Representation • Floating Point Format • Floating Point Precisions • Floating Point Operation • Special values • Error Analysis • Exception Handling • FPU Data Register Stack Table of Contents

History • 8086: first computer to implement IEEE FP • separate 8087 FPU (floating point unit) • 486: merged FPU and Integer Unit onto one chip • Summary • Hardware to add, multiply, and divide • Floating point data registers • Various control & status registers • Floating Point Formats • single precision (C float): 32 bits • double precision (C double): 64 bits • extended precision (C long double): 80 bits IA32 Floating Point

Representable numbers • Scientific notation: +/- d.d…d x rexp • sign bit +/- • radix r (usually 2 or 10, sometimes 16) • significandd.d…d (how many base-r digits d?) • exponent exp (range?) • others? • Operations: • arithmetic: +,-,x,/,... • how to round result to fit in format • comparison (<, =, >) • conversion between different formats • short to long FP numbers, FP to integer • exception handling • what to do for 0/0, 2*largest_number, etc. • binary/decimal conversion • for I/O, when radix not 10 • Language/library support for these operations Defining Floating Point Arithmetic

It describes a system for representing real numbers which supports a wide range of values. • A number in which the decimal point can be in any position. • Example: A memory location set aside for a floating-point number can store 0.735, 62.3, or 1200. Floating Point Representation

Radix point – or radix character is the symbol used in numerical representations to separate the integer part of a number (to the left of the radix point) from its fractionalpart (to the right of the radix point). Radix point is a general term that applies to all number bases. Ex: In base 10 (decimal): 13.625 (decimal point) In base 2 (binary): 1101.101 (binary point) • Fixed point - a number in which the position of the decimal point is fixed. A fixed-point memory location can only accommodate a specific number of decimal places, usually 2 (for currency) or none (for integers). For example, amounts of money in U.S. currency can always be represented as numbers with exactly two digits to the right of the point (1.00, 123.45, 0.76, etc.). In comparison with:

Floating Point Representation

Binary Cases Floating Point Representation Biased Exponent Significand or Mantissa Sign bit • where: • Sis the fraction mantissa or significand. • E is the exponent. • B is the base, in Binary case

The IEEE has standardized the computer representation for binary floating-point numbers in IEEE754. This standard is followed by almost all modern machines. IEEE 754: Floating Point in Modern Computer

IEEE 754 format – Defines single and double precision formats (32 and 64 bits) – Standardizes formats across many different platforms – Radix 2 – Single » Range 10-38 to 10+38 » 8-bit exponent with 127 bias » 23-bit mantissa – Double » Range 10-308 to 10+308 » 11-bit exponent with 1023 bias » 52-bit mantissa IEEE 754: Floating Point Format

IEEE 754 Format Parameters

IEEE 754:16-bit: Half (binary16)32-bit: Single (binary32), decimal3264-bit: Double (binary64), decimal64128-bit: Quadruple (binary128), decimal128 Other: • Minifloat • Extended precision • Arbitrary-precision Floating Point Precisions

Single Precision, called "float" in the C language family, and "real" or "real*4" in Fortran. This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits). • Double precision, called "double" in the C language family, and "double precision" or "real*8" in Fortran. This is a binary format that occupies 64 bits (8 bytes) and its significand has a precision of 53 bits (about 16 decimal digits). • The other basic formats are quadruple precision (128-bit) binary, as well as decimal floating point (64-bit) and "double " (128-bit) decimal floating point. Floating Point Precisions

Guard Bits – prior to a floating-point operation, the exponent and signicand of each are loaded into ALU registers. The register contains additional bits, called guard bits, which are used to pad out the right end of the significand with 0s. • Rounding – the precision of the result is the rounding policy. The result of any operation on the significands is generally stored in a longer registers. PRECISION CONSIDERATIONS

THE USE OF GUARD BITS

Round to nearest : The result is rounded to the nearest representable number. Round toward + ∞ : The rounded up toward plus infinity. Round toward - ∞ : The result is rounded down toward negative infinity. Rounded toward 0 : The result is rounded toward zero. THE STANDARD LIST FOUR ALTERNATIVE APPROACHES :

Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and the significand (mantissa), from left to right. For the IEEE 754 binary formats they are apportioned as follows: Internal representation

IEEE 754 goes beyond the simple definition of a format to lay down specific practices and procedures so that floating-point arithmetic produces uniform , predictable results independent of the hardware platform. IEEE STANDARD FOR BINARY FLOATING-POINT ARITHMETIC

EXPONENT OVERFLOW • A positive exponent exceeds the maximum possible exponent value. In some system, this may be designated as + ∞ or -∞. EXPONENT UNDERFLOW • A negative exponent is less than the minimum possible exponent value (e.g., -200 is less than -127). This means that the numbers is too small to be represented, and it may be reported as 0. SIGNIFICANT UNDERFLOW • In the process of aligning significant, digits may flow off the right end of the significant. As we shall discuss, some form of rounding is required SIGNIFICANT OVERFLOW • The addition of two significant of the same sign may result in a carry out of the most significant bit. This can be fixed by realignment, as we shall explain. Floating Point Operation

PHASE 1 : ZERO CHECK • Addition and subtraction are identical except for a sign change, the process by changing the sign of the subtracted if it is a subtract operation. Next, if either operand is 0, the other is reported as the result. PHASE 2: SIGNIFICAND ALIGMENT. • The next phase is to manipulate the numbers so that the two exponents are equal. PHASE 3: ADDITIION • The two significands are added together, taking into account their sign. Because the signs may differ, the result may be 0. There is also the possibility of significand overflow by 1 digit. If so, the result is shifted right and the exponent is incremented. An exponent overflow could not occur as a result; this would be reported and the operation halted. PHASE 4: NORMALIZATION • The final phase normalizes the result. Normalization consists of shifting significand digits left until the most significand digit (bit, or 4 bits or base- 16 exponent) is nonzero. FLOATING-POINT: ADDITION AND SUBTRACTION (Z X ± Y)

FLOATING-POINT ADDITION AND SUBTRACTION (Z X ± Y)

FLOATING-POINT: MULTIPLICATION (Z X * Y)

FLOATING-POINT: DIVISION (Z X/Y)

Signed zero In the IEEE 754 standard, zero is signed, meaning that there exist both a "positive zero" (+0) and a "negative zero" (−0). • Subnormal numbers Subnormal values fill the underflow gap with values where the absolute distance between them are the same as for adjacent values just outside of the underflow gap. • Infinities The infinities of the extended real number line can be represented in IEEE floating point data types, just like ordinary floating point values like 1, 1.5 etc. They are not error values in any way, though they are often (but not always, as it depends on the rounding) used as replacement values when there is an overflow. Upon a divide by zero exception, a positive or negative infinity is returned as an exact result. An infinity can also be introduced as a numeral (like C's "INFINITY" macro, or "∞" if the programming language allows that syntax). • NaNs IEEE 754 specifies a special value called "Not a Number" (NaN) to be returned as the result of certain "invalid" operations, such as 0/0, ∞×0, or sqrt(−1). The representation of NaNs specified by the standard has some unspecified bits that could be used to encode the type of error; but there is no standard for that encoding. In theory, signaling NaNs could be used by a runtime system to extend the floating-point numbers with other special values, without slowing down the computations with ordinary values. Such extensions do not seem to be common, though. Special values

NAN: Sign bit, nonzero significand, maximum exponent • Invalid Exception • occurs when exact result not a well-defined real number • 0/0 • sqrt(-1) • infinity-infinity, infinity/infinity, 0*infinity • NAN + 3 • NAN > 3? • Return a NAN in all these cases • Two kinds of NANs • Quiet - propagates without raising an exception • good for indicating missing data • Ex: max(3,NAN) = 3 • Signaling - generate an exception when touched • good for detecting uninitialized data IEEE Floating Point Arithmetic Standard 754 - NAN (Not A Number)

OPERATIONS THAT PRODUCE A QUIET NaN

Normalized Nonzero Representable Numbers: +- 1.d…d x 2exp • Macheps = Machine epsilon = 2-#significand bits=relative error in each operation • OV = overflow threshold = largest number • UN = underflow threshold = smallest number • +- Zero: +-, significand and exponent all zero Why bother with -0 later IEEE Floating Point Arithmetic Standard 754 - Normalized Numbers Format # bits #significand bits macheps #exponent bits exponent range ---------- -------- ----------------------- ------------ -------------------- ---------------------- Single 32 23+1 2-24 (~10-7) 8 2-126 - 2127 (~10+-38) Double 64 52+1 2-53 (~10-16) 11 2-1022 - 21023 (~10+-308) Double >=80 >=64 <=2-64(~10-19) >=15 2-16382 - 216383 (~10+-4932) Extended (80 bits on Intel machines)

Denormalized Numbers: +-0.d…d x 2min_exp • sign bit, nonzero significand, minimum exponent • Fills in gap between UN and 0 • Underflow Exception • occurs when exact nonzero result is less than underflow threshold UN • Ex: UN/3 • return a denorm, or zero • Why bother? • Necessary so that following code never divides by zero • if (a != b) then x = a/(a-b) IEEE Floating Point Arithmetic Standard 754 - “Denorms”

+- Infinity:Sign bit, zero significand, maximum exponent • Overflow Exception • occurs when exact finite result too large to represent accurately • Ex: 2*OV • return +- infinity • Divide by zero Exception • return +- infinity = 1/+-0 • sign of zero important! Example later… • Also return +- infinity for • 3+infinity, 2*infinity, infinity*infinity • Result is exact, not an exception! IEEE Floating Point Arithmetic Standard 754 - +- Infinity

Basic error formula • fl(a op b) = (a op b)*(1 + d) where • op one of +,-,*,/ • |d| <=  = machine epsilon = macheps • assuming no overflow, underflow, or divide by zero • Example: adding 4 numbers fl(x1+x2+x3+x4) = {[(x1+x2)*(1+d1) + x3]*(1+d2) + x4}*(1+d3) = x1*(1+d1)*(1+d2)*(1+d3) + x2*(1+d1)*(1+d2)*(1+d3) + x3*(1+d2)*(1+d3) + x4*(1+d3) = x1*(1+e1) + x2*(1+e2) + x3*(1+e3) + x4*(1+e4) where each |ei| <~ 3*macheps get exact sum of slightly changed summands xi*(1+ei) Backward Error Analysis- algorithm called numerically stable if it gives the exact result for slightly changed inputs Numerical Stability is an algorithm design goal Error Analysis

What happens when the “exact value” is not a real number, or too small or too large to represent accurately? • 5 Exceptions: • Overflow- exact result > OV, too large to represent • Underflow- exact result nonzero and < UN, too small to represent • Divide-by-zero - nonzero/0 • Invalid- 0/0, sqrt(-1), … • Inexact - you made a rounding error (very common!) • Possible responses • Stop with error message (unfriendly, not default) • Keep computing (default, but how?) Exception Handling

Each of the 5 exceptions has the following features • A sticky flag, which is set as soon as an exception occurs • The sticky flag can be reset and read by the user • reset overflow_flag and invalid_flag • perform a computation • test overflow_flag and invalid_flag to see if any exception occurred • An exception flag, which indicate whether a trap should occur • Not trapping is the default • Instead, continue computing returning a NAN, infinity or denorm • On a trap, there should be a user-writable exception handler with access to the parameters of the exceptional operation • Trapping or “precise interrupts” like this are rarely implemented for performance reasons. Exception Handling User Interface

FPU Data Register Stack • FPU register format (extended precision) 0 79 78 64 63 s exp frac • FPU register stack • stack grows down • wraps around from R0 -> R7 • FPU registers are • typically referenced • relative to top of stack • st(0) is top of stack (Top) • followed by st(1), st(2),… • push: increment Top, load • pop: store, decrement Top absolute view stack view st(5) R7 st(4) R6 st(3) R5 st(2) R4 st(1) R3 st(0) Top R2 st(7) R1 st(6) R0

Large number of floating point instructions and formats • ~50 basic instruction types • load, store, add, multiply • sin, cos, tan, arctan, and log! • Sampling of instructions: FPU instructions Instruction Effect Description fldz push 0.0 Load zero flds S push S Load single precision real fmuls S st(0) <- st(0)*S Multiply faddpst(1) <- st(0)+st(1); pop Add and pop

END

Floating Point Arithmetic