Chapter 6-3 Divider and Floating Point

Chapter 6-3Divider and Floating Point • Divider • Floating Point Representation and Numbers • Next Lecture • Processing Unit Fundamental

Longhand Division Examples 21 10101 13 274 1101 100010010 26 1101 14 10000 13 1101 1 1110 1101 1

Integer Division – Decimal Version 21 13 274 26 14 13 1 Divide 274 by 13 • We try to divide 13 into 2, and it does not work • We try to divide 13 into 27 • Trial : multiplying 13 by 2 to get 26, and 27-26 = 1 • 1 is less than 13, so we enter 2 as the quotient and perform the subtraction • The next digit of the dividend, 4, is brought down and we finish by deciding that 13 goes into 14 once, and the remainder is 1

Integer Division – Decimal Version 10101 1101 100010010 1101 10000 1101 1110 1101 1 Implementation • Position the divisor appropriately with respect to the dividend • Perform a subtraction • If the remainder is zero or positive, a quotient bit of 1 is determined • The remainder is extended by another bit of the dividend • The divisor is repositioned and another subtraction is performed • If the remainder is negative, a quotient bit of 0 is determined • The dividend is restored by adding back the divisor • The divisor is repositioned for another subtraction

Unsigned Binary Division Scheme Restoring Division Shift left a a a q q n n - 1 0 n - 1 0 A (remainder, Initially 0) Dividend Q (quotient) Quotient setting Add/Subtract n + 1 -bit adder Control sequencer m m 0 n - 1 0 Divisor M

Restoring Binary Division 1 0 1 1 1 0 0 0 1 1 1 1 0 2A-M Q= A = 0 0 0 0 0 1 0 0 0 M = 0 0 0 1 1 Shift 0 0 0 0 1 0 0 0 Subtract 1 1 1 0 1 (2’s C of 00011) First cycle Set q 1 1 1 1 0 negate 0 +M Restore 1 1 0 0 0 0 1 0 0 0 0 Shift 0 0 0 1 0 0 0 0 Subtract 1 1 1 0 1 Second cycle Set q 1 1 1 1 1 0 Restore 1 1 0 0 0 1 0 0 0 0 0 Shift 0 0 1 0 0 0 0 0 Subtract 1 1 1 0 1 q Third cycle Set 0 0 0 0 1 0 Shift 0 0 0 1 0 0 0 0 1 Subtract 1 1 1 0 1 0 0 1 Set q 1 1 1 1 1 0 Fourth cycle Restore 1 1 0 0 0 1 0 0 0 1 0 Shift: to position the divisor appropriately Remainder Quotient

Unsigned Binary Division Scheme • An n-bit positive divisor is loaded into register M • An n-bit positive dividend is loaded into register Q • Register A is set to 0 • After the division is complete, the n-bit quotient is in register Q and the remainder is in register A • The required subtractions are facilitated using the 2's complement arithmetic • The extra bit position at the left end of both A and M accommodates the sign bit during subtractions Algorithm: do the following n times 1. Shift A and Q left one binary position 2. Subtract M from A, and place the answer back in A 3. If the sign of A is 1, set qi to 0 and add M back to A (restoring) 4. Otherwise, set qi to 1

If it shifts without restore between first and second cycles, the result will be 4A-2M We can get 4A-M by adding one M at the second cycle Therefore, the algorithm can be improved by avoiding the need for restoring A after an unsuccessful subtraction Steps: Do the following n times If the sign of A is 0, shift A and Q left one bit position and subtract M from A Otherwise, shift A and Q left and add M to A If the sign of A is 0, set qi to 1 Otherwise, set qi to 0 Final Step:If the sign of A is 1, add M to A (to get a correct remainder) Improvement

Non-Restoring Example 1 0 1 1 1 0 0 0 1 1 1 1 0 A = Initially 0 0 0 0 0 Q= 1 0 0 0 0 0 0 1 1 M = Shift 0 0 0 0 1 0 0 0 First cycle Subtract 1 1 1 0 1 q Set 1 1 1 1 0 0 0 0 0 negate Shift 1 1 1 0 0 0 0 Add 0 0 0 1 1 Second cycle Set q 1 1 1 1 1 0 0 0 Shift 1 1 1 1 0 0 Add 0 0 0 1 1 Third cycle Set q 0 0 0 0 1 0 0 Shift 0 0 0 1 0 Subtract 1 1 1 0 1 Fourth cycle Set q 1 1 1 1 1 0 0 1 0 0 Quotient Add 1 1 1 1 1 0 0 0 1 1 Restore remainder 0 0 0 1 0 Remainder

Non-Restoring Example 1 0 1 1 1 0 0 0 A = Initially 0 0 0 0 0 Q= 1 0 0 0 1 1 0 0 0 1 1 M = 1 1 0 Shift 0 0 0 0 1 0 0 0 First cycle Subtract 1 1 1 0 1 q Set 1 1 1 1 0 0 0 0 0 0 Shift 1 1 1 0 0 0 0 0 Add 0 0 0 1 1 Second cycle Set q 1 1 1 1 1 0 0 0 0 0 Shift 1 1 1 1 0 0 0 0 Add 0 0 0 1 1 Third cycle Set q 0 0 0 0 1 0 0 0 1 0 Shift 0 0 0 1 0 0 0 1 Subtract 1 1 1 0 1 Fourth cycle Set q 1 1 1 1 1 0 0 1 0 0 Quotient 1 1 1 1 1 0 0 0 1 1 Add Restore remainder 0 0 0 1 0 Remainder

There are no simple algorithms for performing signed division that are comparable to the algorithms for signed multiplication If you have any idea, please come to me, let’s discuss $$$ In a division, the operands can be preprocessed to transform them into positive values After using one of the previous algorithms, the results are transformed to the correct signed values as needed Unsigned Binary Division Technique

Why Use Floating Point Numbers • In the 2's complement system, the signed value F, represented by the n-bit binary fraction B=b0.b-1b-2…b-(n-1) is given by -1 ≤ F ≤ 1 – 2-(n-1) • Consider the range of values representable in a 32-bit signed, fixed point format interpreted as integers, the value range is approximately +-4.55x10-10 to +-1 • Neither of these ranges is sufficient for scientific calculations • Example:Avogadro’s Number: 6.0247 x 1023 mole-1 => floating point numbers

Because the position of the binary point in a floating point number is variable, it must be given explicitly in the floating point representation For example: 6.0247 is given to five significant digit The scale factor 1023 indicates the position of the decimal point with respect to the significant digits By convention when the decimal point is placed to the right of the first (non zero) significant digit, the number is said to be normalized General form and size for numbers in the decimal system +/- X1.X2X3X4X5X6X7 x 10+/-Y1Y2 Xi and Yi are decimal digits It is possible to approximate this mantissa precision and scale factor range in a binary representation that occupies 32bit Representing Floating Point Numbers

Floating Point (a brief look) • We need a way to represent • numbers with fractions, e.g., 3.1416 • very small numbers, e.g., 0.000000001, 6.0247 x 10-23 • very large numbers, e.g., 3.15576x109, 6.0247 x 1023 • Representation: • sign, exponent, significand: (–1)signxsignificand x2exponent • more bits for significand gives more accuracy • more bits for exponent increases range • IEEE 754 floating point standard: • single precision: 8 bit exponent, 23 bit significand • double precision: 11 bit exponent, 52 bit significand • Signficand = (1 + fraction), defined implicitly

IEEE 754 Standard Floating-point Representation Single Precision (32-bit) 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 S exponent significand 23 bits 1bit 8 bits (–1)sign x (1+fraction) x 2exponent-127 Double Precision (64-bit) 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 S exponent significand 20 bits 11 bits 1bit sigfinicand (continued) 32 bits (–1)sign x (1+fraction) x 2exponent-1023

Representing Floating Point Numbers • A 23-bit mantissa can approximately represent a 7 digit decimal number • An 8 bit exponent to an implied base 2 provides a scale factor with a reasonable range • One bit is needed for the sign of the number • Since the leading non-zero bit of a normalized binary mantissa must be 1, it does not have to be included explicitly in the representation IEEE

If a number is not normalized, it can always be put in normalized form by shifting the fraction and adjusting the exponent Floating Point Normalization in IEEE Single-Precision Format

Representing Floating Point Numbers • Instead of the signed exponent, E, the value actually stored in the exponent field is an unsigned integer E' = E + 127 called excess -127 format • E' is in the range 0 ≤ E’ ≤ 255 • The end values of E' (i.e. 0 and 255) are used to indicate the floating point values of exact 0 and infinity • The range of E' for normal values is 0 < E’ < 255 • Thus, the actual exponent , E, is in the range -126≤E ≤ 127 • The excess-x representation enables efficient comparison of the relative sizes of two floating point numbers (e.g., sorting)

Representing Floating Point Numbers • The last 23 bits represent the mantissa • Since binary normalization is used, the most significant bit of the mantissa is always equal to 1 • The 23 bits stored in the M field represent the fractional part of the mantissa (the bits to the right of the binary point) • This 32 bit representation is called single precision because it occupies a single 32-bit word • The scale factor has a range of 2-126 to 2+127 which is approximately 10+/-38 • The 24 bit mantissa provides approximately the same precision as a 7 digit decimal value

IEEE 754 floating-point standard • Leading “1” bit of significand is implicit 1.1011 • Exponent is “biased” to make sorting easier • all 0s is smallest exponent; all 1s is largest • bias of 127 for single precision and 1023 for double precision • summary: (–1)signx (1+fraction) x2exponent – bias(a.k.a. a normalized number – because of the 1 for scientific notation)

IEEE 754 Floating-Point Conversion • Steps • Any number • Binary conversion • Normalization • Put that in 32-bit (single) or 64-bit (double) expression • Mind excess-format (add 127 for exponent) • Leading “1” bit of significand is implicit • Example: • decimal: -.75 = -3/4 = -3/22 = -(1/2+1/22) • binary: -.11 = -1.1 x 2-1(normalization) • floating point: exponent = 126 = 01111110 • IEEE single precision: 10111111010000000000000000000000

IEEE 754 Standard Example (single precision) (-0.75)10 = (BF400000)16 (-0.75)10 = (????????)16

IEEE 754 Standard Example (single precision) (23.15625)10 = (41B94000)16 (23.15625)10 = (????????)16

Double Precision 64 bits S E ¢ M Sign 11-bit excess-1023 52-bit exponent mantissa fraction E ¢ - 1023 Value represented = ± 1. M ´ 2 • The double precision format has an increased exponent and mantissa ranges • The 11-bit excess 1023 exponent E' has the range 0 <E’ < 2047 for normal values • 0 and 2047 are used to indicate the special values 0 and infinity

Double Precision • The actual exponent E is in the range –1022 ≤ E ≤ 1023 providing a scale factors of 2-1022 to 2+1023 (approximately 10+/-308) • The 52 bit mantissa provides a precision equivalent to about 16 decimal digits Exceptions • Five exceptions: invalid operation, division by 0, overflow, underflow and inexact • An invalid operation occurs when the user attempts to take the square root of a negative number • A division by 0 exception occurs when the divisor is 0 and the numerator is finite, the result is infinity

IEEE 754 Standard Encoding • NaN : (infinity – infinity), or 0/0 • Denormalized number = (-1)sign * 0.M * 21-bias • smallest ±2−149 ≈ ±1.4012985×10−45

Precision Issues • Cannot represent all possible real numbers, they are infinite ! • Must sacrifice precision when representing FP numbers in some cases • Precision lost when integer portion is too large • Precision lost when fraction portion is too small • Example • How to represent 224 and 224+1 ? • Both = 4B800000 in single precision • How to represent 2-127 ? (use denormalized number ?? 0.1*2-126) • How about 2-150? (use denormalized number ?? What is the smallest number by denormalized number?) • Smallest: ±2−149 ≈ ±1.4012985×10−45

Floating Point Complexities • Operations are somewhat more complicated • In addition to overflow, we can have “underflow” • Accuracy can be a big problem • IEEE 754 keeps two extra bits, guard and round • four rounding modes • positive divided by zero yields “infinity” • zero divide by zero yields “not a number” • other complexities • Implementing the standard can be tricky • Not using the standard can be even worse • Pentium bug!

Basic Operation with Floating Point • Since the scale factor is in the form 2i, shifting the mantissa right or left by one bit is compensated by an increase or a decrease of 1 in the exponent • As computations proceed, a number that does not fall in the representable range might be generated • In single precision, this means that its normalized representation requires an exponent less than -126 (underflow) or greater than 127 (overflow) Arithmetic operations on floating point numbers • If the exponents differ, mantissas must be shifted with respect to each other before they are added or subtracted

Arithmetic Operations with Floating Point Add/subtract 1. choose the number with the smaller exponent and shift its mantissa right a number of steps equal to the difference in exponents 2. set the exponent of the result equal to the larger exponent 3. perform addition/subtraction on the mantissas and determine the sign of the result 4. normalize the resulting value Multiply 1. add the exponents and add 127 2. multiply the mantissas and determine the sign of the result 3. normalize the resulting value Divide 1. subtract the exponents and add 127 2. divide the mantissas and determine the sign of the result 3. normalize the resulting value

Implementing Floating Point Operations • Hardware implementation of FP operations involves considerable circuitry • Operations can also be implemented in software • In most modern computers, FP operations are available as the machine-instruction level and are implemented in hardware

FPU A : S , E ¢ , M A A A 32-bit operands B : S , E ¢ , M B B B E ¢ E ¢ A B M M A B M of number with smaller E ¢ 8-bit SWAP subtractor M of number ¢ with larger E SHIFTER sign n bits S S A B n = E ¢ - E ¢ to right A B Add / Subtract Combinational Add/Sub Mantissa CONTROL adder/subtractor network Sign E ¢ E ¢ A B Magnitude M Leading zeros detector MUX X Normalize and E ¢ round 8-bit subtractor E ¢ - X 32-bit result R : S E ¢ M R R R R = A + B

Step1: Compare exponents to determine how far to shift the mantissa of the number with the smaller exponent This count n = E’A-E’B is determined by the 8-bit subtractor The magnitude of the difference n is sent to the SHIFTER If n is larger than the number of significant bits of the operand, the answer is the larger operand The sign difference that results from comparing exponents determines which mantissa to be shifted The sign is sent to the SWAP block If the sign is 0, then E’A≥ E’B and the mantissas MA and MB are sent straight to the SWAP Thus MB is sent to the SHIFTER to be shifted n positions to the right The other mantissa MA is sent directly to the mantissa adder/subtractor If the sign is 1, then E’A < E’B and the mantissas are swapped before they are sent to the SHIFTER FPU Operations

FPU Operation Step2: performed by the two-way multiplexer, MPX • The exponent result, E', is tentatively determined as E’A if E’A≥ E’B, orE’B if E’A < E’B • Based on the sign of the difference resulting from comparing exponents in step 1 Step 3: the control logic determines whether the mantissas are to be added or subtracted • This is decided by the signs of the operands SA and SB and the operation add or subtract to be performed on the operands (i.e., (±A) ± (±B)) • The control logic also determines the sign of the result SR

FPU Operation • If A is negative (SA=1), B is positive (SB=0) and the operation is A –B • Then the mantissas are added and the sign of the result is negative • If A and Bare both positive (SA=SB=0) and the operation is A –B, then the mantissas are subtracted • The sign of the result SR, depends on the mantissa subtraction operation • If EA > EB, then MA – (shifted MB) is positive, and the result is positive • If EB > EA, then MB – (shifted MA) is positive, and the result is negative • The sign from the exponent comparison is also required as input to the control block • When E’A = E’B, and the mantissas are subtracted, the sign of the mantissa adder-subtractor output determines the sign of the result

FPU Operation Step 4: normalize the result of step3 • If leading zeros in M, the number of leading zeros in M determines the number of bit shifts, X, to be applied to M • The normalized value is truncated to generate 24-bit mantissa, MR, of the result • The value X is also subtracted from the tentative result exponent E’ to generate the true result exponent, E’R

Example We have discussed IEEE 32‑bit format for floating‑point numbers. Here we use a shortened format that retains all the pertinent concepts but is manageable for working through numerical exercises. Consider the floating‑point numbers represented in a 12‑bit format as (1 bit sign5 bits exponent6 bits mantissa), the 5 bit exponent is a binary excess‑15 number. The 6‑bit mantissa is normalized the same way as in IEEE 32-bit format. The exponent 00000 and 11111 represents absolute zero and infinity, respectively. A. Represent the following numbers in this format. -7 =____________________________ 1/32 =____________________________ B. What are the least and the greatest negative numbers (other than 0 and infinity, respectively) representable in this format? least negative number (its absolute value is closest to zero) =_________________________ greatest negative number (its absolute value is closest to infinity) =_________________________ C. Perform Add and Multiply operations on the operands in the format described above using proper algorithm learned in the class. Assume there are no guard bits. A= 0 10010 011010 B = 1 10000 101100 A +B=_________________ A x B=_________________

Chapter 6-3 Divider and Floating Point

Chapter 6-3 Divider and Floating Point

Presentation Transcript

Floating Point

Chapter 6 Floating Point

Floating point

CHAPTER 3 Floating Point

Lecture 6. Fixed and Floating Point Numbers

Floating Point

A Floating Point Divider for RC Systems

Floating Point

Floating Point

Floating Point

Floating Point

Floating Point

Floating point

Floating Point

Floating point

Floating Point

Floating Point

Floating Point

New divider floating vs. grounded

Floating Point