1 / 15

A gentle introduction to floating point arithmetic

A gentle introduction to floating point arithmetic. Ho Chun Hok ( cho@doc.ic.ac.uk ) Custom Computing Group Seminar 25 Nov 2005. s. exponent. fraction. IEEE 754: floating point standard. vsize. esize. fsize. Let v =. Normal numbers (when exponent > 0 and < max exponent)

kaloni
Télécharger la présentation

A gentle introduction to floating point arithmetic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A gentle introduction to floating point arithmetic Ho Chun Hok (cho@doc.ic.ac.uk) Custom Computing Group Seminar25 Nov 2005

  2. s exponent fraction IEEE 754: floating point standard vsize esize fsize Let v = • Normal numbers (when exponent > 0 and < max exponent) • v = (-1)^s x 2^exponent x (1.fraction) • Subnormal numbers (when exponent = 0) • v = (-1)^s x 2^exponent x (0.fraction) • Special numbers (when exponent = max exponent) • Infinity, Nan (not a number) • precisions • Single: esize = 8, fsize = 23, vsize = 32 • Double: esize = 11, fsize = 52, vsize = 64 • Double extended, vsize > 64 • Operations: • +, -, x, /, sqrt, f2i, i2f, compare, f2d, d2f • Rounding • Nearest even, +inf, -inf, towards 0

  3. What IEEE 754 standard supposes to be • Approximation to real number with expected error • the epsilon can of any real number can be determined when mapping to floating point number • Results of all operations can be correctly rounded in case of inexact result • Ensure some math properties hold (in general), x+y==y+x, -(-a) == a, a>=b & c>=0  a*c >= b*c, x+0 == x, y*y >= 0 • All exception can be detected • Using exception flags • Same results across different machines

  4. How to ensure the standard? • Processor? • Rounding numbers in different mode • Gradual underflow • Raise exceptions • Operating System? • Handle exception • Handle function which may not be supported in hardware (what if a processor cannot handle subnormal number) • Keep track of the floating point unit state, (precision, rounding mode) • Programming Language? • Well-defined semantic for floating point (yes, we have infamous JAVA language) • Compiler? • Preserve the semantic defined in that language • Programmer? • read: What Every Computer Scientist Should Know About Floating-Point Arithmetic

  5. Case study 1 int main (void) { double ref,index; double tmp; int i; ref = (double) 169.0/ (double) 170.0; for(i=0;i<250;i++){ index=i; if(ref == (double) (index/(index+1.0)) ) break; } printf("i=%d\n", i); return 0; }

  6. Visual C compiler, running on P-M • Same result on lulu (pentium 3) and irina (Xeon)

  7. GCC, running on Pentium 4 (skokie)

  8. VCC: fld qword ptr [ebp-10h] fadd qword ptr [__real@8@3fff8000000000000000 (00426028)] fdivr qword ptr [ebp-10h] fcomp qword ptr [ebp-8] fnstsw ax test ah,40h je main+5Fh (0040106f) jmp main+61h (00401071) gcc: fld1 faddp %st,%st(1) fldl 0xfffffff0(%ebp) fdivp %st,%st(1) fldl 0xfffffff8(%ebp) fxch %st(1) fucompp fnstsw %ax and $0x45,%ah cmp $0x40,%ah je 0x80483d2 <main+102> VCC use normal stack (ebp) (64-bits) to store the result, and compare with a 64bit double precision value GCC use advanced FPU register stack (st) (80-bits) to store the result, and compare with a 64bit double precision value

  9. Case study 1 • It’s compiler issue • Using more precision to calculate the intermediate result is a good idea • Compiler should convert the 80-bit floating point number to 64-bit before comparison • And it’s programmer issue too • Equality test between FP variables is dangerous • We can detect the problem before it hurts…. • It is not easy to compliance with the standard

  10. Case study 2 • Calculate • When x is large, result == 0, rather than • Beware, even everything compliance with standard, the standard cannot guarantee the result is always correct • Again, programmer should detect this before it hurts • Define routine to trap the exception • Exceptions are not errors as long as they are handled correctly

  11. Case Study 3 • Jean-Michel Muller’s Recurrence

  12. Using double extended precision (80-bits) • x[2] = 5.590164e+00 • x[3] = 5.633431e+00 • x[4] = 5.674649e+00 • x[5] = 5.713329e+00 • x[6] = 5.749121e+00 • x[7] = 5.781811e+00 • x[8] = 5.811314e+00 • x[9] = 5.837660e+00 • x[10] = 5.861018e+00 • x[11] = 5.882514e+00 • x[12] = 5.918471e+00 • x[13] = 6.240859e+00 • x[14] = 1.115599e+01 • x[15] = 5.279838e+01 • x[16] = 9.469105e+01 • x[17] = 9.966651e+01 • x[18] = 9.998007e+01 • x[19] = 9.999881e+01 Converge to 100.0, it seems correct

  13. Case Study 3 • This series can converge to either 5, 6, and 100 • Depends on the value of x0, x1 • If x0 = 11/2, x1 = 61/11, the series should be converged to 6 • Little round off error may affect the result dramatically • We can calculate the result analytically by substituting • In general, it’s very difficult to detect this error

  14. Case Study 4 • Table maker’s dilemma • If we want n-digit accuracy for elementary function like sine, cosine, we (in most case) need to calculate the digit up to n+2 digit • What if the last 2-digit is “10”? • We can calculate last 3 digit • What if the last 3-digit is “100”? • We can calculate last 4 digit • What if… • The result of most elementary function library (e.g. libm) is not correctly rounded in some cases

  15. Conclusion • We have a standard representation for floating point number • Comforting the standard requires collaboration between different parties • Even if we have standard-compliance platform, cautions must be taken when “underflow, overflow”, and make sure the algorithm is “numerically stable” • When using elementary function, don’t expect the result can be comparable between different machine • Elementary function is NOT included in the IEEE standard • Floating point, when use properly, can do something serious

More Related