1 / 60

Floating Point Computation

Floating Point Computation. Jyun-Ming Chen. Contents. Sources of Computational Error Computer Representation of (floating-point) Numbers Efficiency Issues. Converting a mathematical problem to numerical problem, one introduces errors due to limited computation resources:

heinz
Télécharger la présentation

Floating Point Computation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Floating Point Computation Jyun-Ming Chen Spring 2013

  2. Contents • Sources of Computational Error • Computer Representation of (floating-point) Numbers • Efficiency Issues Spring 2013

  3. Converting a mathematical problem to numerical problem, one introduces errors due to limited computation resources: round off error (limited precision of representation) truncation error (limited time for computation) Misc. Error in original data Blunder: to make a mistake through stupidity, ignorance, or carelessness; programming/data input error Propagated error Sources of Computational Error Spring 2013

  4. Gross error: caused by human or mechanical mistakes Roundoff error: the consequence of using a number specified by n correct digits to approximate a number which requires more than n digits (generally infinitely many digits) for its exact specification. Truncation error: any error which is neither a gross error nor a roundoff error. Frequently, a truncation error corresponds to the fact that, whereas an exact result would be afforded (in the limit) by an infinite sequence of steps, the process is truncated after a certain finite number of steps. Supplement: Error Classification (Hildebrand) Spring 2013

  5. Common Measures of Error • Definitions • total error = round off + truncation • Absolute error = | numerical – exact | • Relative error = Abs. error / | exact | • If exact is zero, rel. error is not defined Spring 2013

  6. Representation consists of finite number of digits The approximation of real-number on the number line is discrete! R Ex: Round off error Spring 2013

  7. Watch out for printf !! • By default, “%f” prints out 6 digits behind decimal point. Spring 2013

  8. Ex: Numerical Differentiation • Evaluating first derivative of f(x) Truncation error Spring 2013

  9. Select a problem with known answer So that we can evaluate the error! Numerical Differentiation (cont) Spring 2013

  10. Error analysis h  (truncation) error  What happened at h = 0.00001?! Numerical Differentiation (cont) Spring 2013

  11. Ex: Polynomial Deflation • F(x) is a polynomial with 20 real roots • Use any method to numerically solve a root, then deflate the polynomial to 19th degree • Solve another root, and deflate again, and again, … • The accuracy of the roots obtained is getting worse each time due to error propagation Spring 2013

  12. Computer Representation of Floating Point Numbers Decimal-binary conversion Floating point VS. fixed point Standard: IEEE 754 (1985) Spring 2013

  13. Ex: 29(base 10) 2)29 2)14 1 2) 7 0 2) 3 1 2) 1 1 2) 0 1 Decimal-Binary Conversion 2910=111012 Spring 2013

  14. Fraction Binary Conversion • Ex: 0.625 (base 10) 2 a1=1 2 a2=1 a3=1 a4= a5=…=0 Spring 2013

  15. Computing: How about 0.110? 0.625    2 2 2 1.250 0.500 1.000 0.110 = 0.000112 0.62510 = 0.1012 Spring 2013

  16. Floating VS. Fixed Point • Decimal, 6 digits (positive number) • fixed point: with 5 digits after decimal point • 0.00001, … , 9.99999 • Floating point: 2 digits as exponent (10-base); 4 digits for mantissa (accuracy) • 0.001x1000, … , 9.999x1099 • Comparison: • Fixed point: fixed accuracy; simple math for computation (used in systems w/o FPU) • Floating point: trade accuracy for larger range of representation Spring 2013

  17. Floating Point Representation • Fraction, f • Usually normalized so that • Base, b • 2 for personal computers • 16 for mainframe • … • Exponent, e Spring 2013

  18. IEEE 754-1985 • Purpose: make floating system portable • Defines: the number representation, how calculation performed, exceptions, … • Single-precision (32-bit) • Double-precision (64-bit) Spring 2013

  19. S: sign of mantissa Range (roughly) Single: 10-38 to 1038 Double: 10-307 to 10307 Precision (roughly) Single: 7-8 significant decimal digits Double: 15 significant decimal digits Number Representation Spring 2013

  20. In binary sense, 24 bits are significant (with implicit one – next page) In decimal sense, roughly 7-8 decimal significant digits When you write your program, make sure the results you printed carry the meaningful significant digits. 2-23 1 Significant Digits Spring 2013

  21. Implicit One • Normalized mantissa always  1.0 • Only store the fractional part to increase one extra bit of precision • Ex: 3.5 Spring 2013

  22. Exponent Bias • Ex: in single precision, exponent has 8 bits • 0000 0000 (0) to 1111 1111 (255) • Add an offset to represent +/ – numbers • Effective exponent = biased exponent – bias • Bias value: 32-bit (127); 64-bit (1023) • Ex: 32-bit • 1000 0000 (128): effective exp.=128-127=1 Spring 2013

  23. Ex: Convert – 3.5 to 32-bit FP Number Spring 2013

  24. Explain how this program works Examine Bits of FP Numbers Spring 2013

  25. The “Examiner” • Use the previous program to • Observe how ME work • Test subnormal behaviors on your computer/compiler • Convince yourself why the subtraction of two nearly equal numbers produce lots of error • NaN: Not-a-Number !? Spring 2013

  26. Design Philosophy of IEEE 754 • [s|e|m] • S first: whether the number is +/- can be tested easily • E before M: simplify sorting • Represent negative by bias (not 2’s complement) for ease of sorting • [biased rep] –1, 0, 1 = 126, 127, 128 • [2’s compl.] –1, 0, 1 = 0xFF, 0x00, 0x01 • More complicated math for sorting, increment/decrement Spring 2013

  27. Exceptions • Overflow: • ±INF: when number exceeds the range of representation • Underflow • When the number are too close to zero, they are treated as zeroes • Dwarf • The smallest representable number in the FP system • Machine Epsilon (ME) • A number with computation significance (more later) Spring 2013

  28. Extremities More later • E : (1…1) • M (0…0): infinity • M not all zeros; NaN (Not a Number) • E : (0…0) • M (0…0): clean zero • M not all zero: dirty zero (see next page) Spring 2013

  29. Not-a-Number • Numerical exceptions • Sqrt of a negative number • Invalid domain of trigonometric functions • … • Often cause program to stop running Spring 2013

  30. 1. 1. Extremities (32-bit) • Max: • Min (w/o stepping into dirty-zero) (1.111…1)2254-127=(10-0.000…1) 21272128 (1.000…0)21-127=2-126 Spring 2013

  31. a.k.a.: also known as Dirty-Zero (a.k.a. denormals) • No “Implicit One” • IEEE 754 did not specify compatibility for denormals • If you are not sure how to handle them, stay away from them. Scale your problem properly • “Many problems can be solved by pretending as if they do not exist” Spring 2013

  32. denormals R 2-126 0 dwarf Dirty-Zero (cont) 2-126 00000000 10000000 00000000 00000000 2-127 00000000 01000000 00000000 00000000 2-128 00000000 00100000 00000000 00000000 00000000 00010000 00000000 00000000 2-129 (Dwarf: the smallest representable) Spring 2013

  33. Drawf (32-bit) Value: 2-149 Spring 2013

  34. Machine Epsilon (ME) • Definition • smallest non-zero number that makes a difference when added to 1.0 on your working platform • This is not the same as the dwarf Spring 2013

  35. Computing ME (32-bit) 1+eps Getting closer to 1.0 ME: (00111111 10000000 00000000 00000001) –1.0 = 2-23 1.12  10-7 Spring 2013

  36. Effect of ME Spring 2013

  37. Significance of ME • Never terminate the iteration on that 2 FP numbers are equal. • Instead, test whether |x-y| < ME Spring 2013

  38. Machine Epsilon (Wikipedia) Machine epsilon gives an upper bound on the relative error due to rounding in floating point arithmetic. Spring 2013

  39. Number density: there are as many IEEE 754 numbers between [1.0, 2.0] as there are in [256, 512] Revisit: “roundoff” error ME: a measure of real number density near 1.0 Implication: Scale your problem so that intermediate results lie between 1.0 and 2.0 (where numbers are dense; and where roundoff error is smallest) R Numerical Scaling Spring 2013

  40. Scaling (cont) • Performing computation on denser portions of real line minimizes the roundoff error • but don’t over do it; switch to double precision will easily increase the precision • The densest part is near subnormal, if density is defined as numbers per unit length Spring 2013

  41. How Subtraction is Performed on Your PC • Steps: • convert to Base 2 • Equalize the exponents by adjusting the mantissa values; truncate the values that do not fit • Subtract mantissa • normalize Spring 2013

  42. 1. 1110111 0100011 1010100… – Subtraction of Nearly Equal Numbers • Base 10: 1.24446 – 1.24445 Significant loss of accuracy (most bits are unreliable) Spring 2013

  43. Theorem of Loss Precision • x, y be normalized floating point machine numbers, and x>y>0 • If then at most p, at least q significant binary bits are lost in the subtraction of x-y. • Interpretation: • “When two numbers are very close, their subtraction introduces a lot of numerical error.” Spring 2013

  44. When you program: You should write these instead: Implications Every FP operation introduces error, but the subtraction of nearly equal numbers is the worst and should be avoided whenever possible Spring 2013

  45. Efficiency Issues • Horner Scheme • program examples Spring 2013

  46. Horner Scheme • For polynomial evaluation • Compare efficiency Spring 2013

  47. Accuracy vs. Efficiency Spring 2013

  48. Good Coding Practice Spring 2013

  49. Storing Multidimensional Array in Linear Memory C and others Fortran, MATLAB Spring 2013

  50. On Accessing Arrays … Which one is more efficient? Spring 2013

More Related