Floating Point Arithmetic

Floating Point Arithmetic Through this be madness, yet there is method in ‘t.

Use and Distribution Notice • Possession of any of these files implies understanding and agreement to this policy. • The slides are provided for the use of students enrolled in Jeff Six's Computer Architecture class (CMSC 411) at the University of Maryland Baltimore County. They are the creation of Mr. Six and he reserves all rights as to the slides. These slides are not to be modified or redistributed in any way. All of these slides may only be used by students for the purpose of reviewing the material covered in lecture. Any other use, including but not limited to, the modification of any slides or the sale of any slides or material, in whole or in part, is expressly prohibited. • Most of the material in these slides, including the examples, is derived from Computer Organization and Design, Second Edition. Credit is hereby given to the authors of this textbook for much of the content. This content is used here for the purpose of presenting this material in CMSC 411, which uses this textbook.

Floating Point Numbers • Now we have explored integers – logical operations and all four major arithmetic operations. • We will now move on to real numbers – more often called floating point numbers in computer terminology. • Recall that we can always express real numbers in scientific notation… • A number in scientific notation with no leading zeros is called a normalized number… 0.000000001 = 1.0 X 10-9 3,155,760,000 = 3.15576 X 109 1.0 X 10-9 Normalized 10.0 X 10-10 Not Normalized

Computers and Normalized Numbers • Since computers only deal with binary numbers, we express them using normalized scientific notation, using a binary point. • Why always use this notation? • It simplifies the exchange of data as all floating point numbers are in the same form. • It simplifies the floating point algorithms since the numbers are always in this form. • It increases the accuracy of the numbers that can be stored in a word as all leading zeros are removed, freeing up that space for digits to the right of the binary point.

Design Decisions • We need to store three pieces of information when storing a floating point number… • The sign (positive/negative) • The exponent • The significand (mantissa) • So for a fixed number of bits (say, a word), how do we decide how many bits store the exponent and how many store the significand? • Increasing the size of the significand increases the accuracy of the numbers we can represent. • Increasing the size of the exponent increases the range of the numbers we can represent. • It’s a tradeoff(just like every other design decision).

The IEEE 754 Floating Point Standards • The IEEE has considered these design choices and recommended a tradeoff value of 8 bits for the exponent and 23 bits for the significant (with the assumption of a 32-bit word). • This is the format used by MIPS and almost every computer used since 1980 – it is that good of a tradeoff decision. s exponent significand 1 bit 8 bits 23 bits The number represented is simply (-1)S x F x 2E where S, F, and E, are related to the s, exponent, and significand fields (a one in the sign bit signifies a negative value)

Floating Point Exceptions • The IEEE 754 standard allows a large range of real numbers to be expressed, from fractions as small as 2.0x10-38 to numbers as large as 2.0x1038. • Note that we have a large range, but not an infinite range… • Floating point overflow occurs when the computed exponent is too large to be represented in the exponent field (too big of a number). • Floating point underflow occurs when the computed exponent is too small to be represented in the exponent field (too small of a fraction).

Double Precision • To help deal with these issues, the IEEE 754 standards include a double precision specification, where two words are used to represent the number. • Here, the exponent grows to 11 bits and the significand grows to 52 bits… s exponent significand 1 bit 11 bits 20 bits significand (continued) 32 bits

Double Precision Advantages • The double precision standard allows us to express a greater range of numbers, from fractions as small as 2.0x10-308 to numbers as large as 2.0x10308. • Although this form does increase the exponent range, its primary advantage is its (much) greater accuracy, due to the significantly large significand.

A Slight Optimization • Ever trying to save bits, the designers of IEEE 754 noticed that normalized binary numbers always have a one as the bit to the left of the binary point. • This one is thus implied – IEEE 754 numbers actually have 24 significand bits (the implied one and the 23 stored fraction bits) in single precision and 53 significand bits (the implied one and the 52 stored fraction bits) in double precision. • Zero is the special case, this is represented with a zero in the exponent (and everywhere else).

More Optimizations… • The significand has an implied one, making our representation expression… (-1)S x (1 + significand) x 2E where the bits of the significand represent the fraction between zero and one. • To simplify comparison operations, the designers of this standard wanted an easy (quick) way to compare two floating point numbers. • This is why the sign bit is stored in the MSB – a test of less than, greater than, or equal to zero, can be performed quickly.

Comparison Optimizations • The exponent comes before the significand – this also eases comparison as an integer comparison of the exponent can lead to a quick result. • It’s still not as easy as two’s complement numbers as we need to look at the sign bit and the magnitude of the exponent. • This works great – if both exponents are positive. What about negative exponents? How are these encoded? Remember, we need to be able to easily compare two exponent values to determine their relationship.

This number is actually smaller. Exponent Encoding • If we did the most obvious idea of using two’s complement to encode exponents, we run into the issue that a negative exponent will look like a big number (as it will have a one in the MSB). • For instance, if we encoded 1.0 X 2-1 and 1.0 X 21 using two’s complement-encoded exponent IEEE 754 floating point numbers… 0 1111 1111 0000 0000 0000 0000 0000 000 S exponent significand 0 0000 0001 0000 0000 0000 0000 0000 000 S exponent significand

Exponent Encoding • The desirable notation is for the most negative exponent to be encoded as 00…00 and for the most positive exponent to be encoded as 11..11. • This convention is known as biased encoding – a bias value is added to the normal, unsigned binary representation of the number to form the biased representation. • The IEEE 754 standard uses a bias of 127 (which results in the encoding set we want). • So, a –1 is encoded as (-1)+127=126 (0111 1110) and 1 is encoded as 1+127=128 (1000 0000).

Biased Encoding in IEEE 754 • This means that the expression to determine the value encoded in a IEEE 754 value is… (-1)S x (1 + significand) x 2(exponent – bias) • The same form is true for double precision numbers, with the exception of the bias value. The bias value for double precision values is 1023 (which will result in an encoded set where 00..00 is the smallest negative exponent and 11..11 is the highest positive exponent).

Example: Encoding IEEE 754 • Let’s encode –0.75ten into IEEE 754. • In binary…–0.11two. • In normalized scientific notation…-1.1two x 2-1. • The expression we need to form is… (-1)S x (1 + significand) x 2(exponent – bias) • Filling in the appropriate values… (-1)1x(1 + .1000 0000 0000 0000 0000 0000) x 2(126 – 127) • So, in single precision IEEE 754… • And in double precision IEEE 754… 1 0111 1110 1000 0000 0000 0000 0000 000 S exponent significand 0 0111 1111 110 1000 0000 0000 0000 0000 000 S exponent significand 0000 0000 0000 0000 0000 0000 0000 0000

Example: Decoding IEEE 754 • Let’s decode this IEEE 754 encoded number… • We have our expression for this type of representation… (-1)S x (1 + significand) x 2(exponent – bias) • Filling in the values… (-1)1 x (1 + 0.25) x 2(129 – 127) • Solving this expression… -1 x 1.25 x 22 = -1.25 x 4 = -5.0 • So, this encoded value represents –5.0ten. 1 1000 0001 0100 0000 0000 0000 0000 000 S exponent significand

Floating Point Addition • Now that we know how floating point numbers can be stored, let’s move on to the addition of floating point numbers. • A good way to understand how computers do floating point addition is to perform such addition ourselves, step-by-step. • Then we can look at adding hardware to an ALU to perform the steps we do by hand. • To do this, let’s do it step-by-step, adding the numbers 9.999 X 101 + 1.610 X 10-1 (we’ll use decimal, but binary numbers can be added using the same steps, as we’ll see).

One More Little Thing • Computers store floating point numbers in fixed-size memory locations. • For our example, let’s assume we are using a format that can only store four decimal digits of the significand and two decimal digits of the exponent. • The same principles will apply to binary numbers using the IEEE 754 standard – this is just a simplified example to illustrate the process and the tradeoffs with limited storage means.

Floating Point AdditionStep 1: Align the Decimal Points • To properly add the numbers, we need to align the decimal point of the number that has the smaller exponent to match that of the number with the larger exponent. • In our example, we need to align 1.610 X 10-1’s exponent to match that of 9.999 X 101. • 1.610 x 10-1 = 0.1610 x 100 = 0.01610 x 101 • Remember we can only store four digits in the significand, so we get the value 0.016 x 101 (note that we lost some precision due to the simplified constraints of our hardware).

Floating Point AdditionStep 2: Add the Significands • Now that the exponents are aligned, we can add the significands… • So, the sum we just computed is 10.015 x 101. 9.999 + 0.016 ------- 10.015

Floating Point AdditionStep 3: Normalize the Sum • We need to normalize the sum, as it is not in normalized scientific notation. • 10.015 x 101 = 1.0015 x 102 • Remember that we need to check for underflow or overflow. In this case, the computed exponent of 2 fits in the storage requirements we outlined so neither of these errors have occurred.

Floating Point AdditionStep 4: Make It Fit • We have violated our significand storage requirement so we must round the number. • Using old rounding rules from elementary school, 1.0015 x 102 rounds up to 1.002 x 102. • Notice that some “bad luck” rounding decisions can result in a non-normalized number (like rounding up a string of 9s). If this happens, we would need to do step 3 again.

S t a r t 1 . C o m p a r e t h e e x p o n e n t s o f t h e t w o n u m b e r s . S h i f t t h e s m a l l e r n u m b e r t o t h e r i g h t u n t i l i t s e x p o n e n t w o u l d m a t c h t h e l a r g e r e x p o n e n t 2 . A d d t h e s i g n i f i c a n d s 3 . N o r m a l i z e t h e s u m , e i t h e r s h i f t i n g r i g h t a n d i n c r e m e n t i n g t h e e x p o n e n t o r s h i f t i n g l e f t a n d d e c r e m e n t i n g t h e e x p o n e n t Y e s O v e r f l o w o r u n d e r f l o w ? N o E x c e p t i o n 4 . R o u n d t h e s i g n i f i c a n d t o t h e a p p r o p r i a t e n u m b e r o f b i t s N o S t i l l n o r m a l i z e d ? Y e s D o n e The Floating Point Addition Algorithm

Hardware for FP Addition • Many computer designs include hardware for fast floating point operations, such as addition. • The generic design of this hardware includes two ALUs, a control unit, a shift register, a mini-ALU (a increment/decrement unit), (complex) and rounding hardware.

S i g n E x p o n e n t S i g n i f i c a n d S i g n E x p o n e n t S i g n i f i c a n d C o m p a r e S m a l l A L U e x p o n e n t s E x p o n e n t d i f f e r e n c e 0 1 0 1 0 1 S h i f t s m a l l e r C o n t r o l S h i f t r i g h t n u m b e r r i g h t A d d B i g A L U 0 1 0 1 I n c r e m e n t o r N o r m a l i z e S h i f t l e f t o r r i g h t d e c r e m e n t R o u n d R o u n d i n g h a r d w a r e S i g n E x p o n e n t S i g n i f i c a n d Hardware for FP Addition

Floating Point Multiplication • Now that the easy task of adding FP numbers has been addressed, we can move to the more complex problem of multiplication. • Once again, we can do this step-by-step. • We’ll follow the same procedure as before, working through a decimal example and assuming we can only store four digits of the significand and two digits of the exponent. • Keep in mind that the same procedure can be applied to binary numbers in IEEE 754 format – this is just a simplified example. • Let’s multiply 1.110 x 1010 and 9.200 x 10-5.

Floating Point MultiplicationStep 1: Add the Exponents • Computing the exponent of the product is easy – just add the exponents of the numbers being multiplied. • Adding 10 and (-5), we get 5 – the exponent of the product is 5. • Let’s try this with biased exponents (as we know we normally store exponents in biased form) and a bias of 127. • (10+137) + (-5+137) = 137+122 = 259 • That’s not right: 259-127 = 132, not 5. • We added the bias twice – when we add two biased numbers, the bias must be subtracted from the sum: 132 – 127 = 5 (the correct answer!).

Floating Point MultiplicationStep 2: Multiple the Significands 1.110 x 9.200 --------- 0000 0000 2220 9990 --------- 10212000 • Now we multiply the significands… • The decimal point goes six placed from the right as there are three decimal places in each multiplied term – the product is 10.212000. • Assuming we can only store three digits to the right of the decimal point, the product is 10.212 x 105.

Floating Point MultiplicationStep 3: Normalize the Product • We need to normalize the product, as it is not in normalized scientific notation. • 10.212 x 105 = 1.0212 x 106 • Remember that we need to check for underflow or overflow. In this case, the computed exponent of 6 fits in the storage requirements we outlined so neither of these errors have occurred.

Floating Point MultiplicationStep 4: Make It Fit • Once again, we have violated our significand storage requirement so we must round the number. • Using old rounding rules from elementary school, 1.0212 x 106 rounds down to 1.021 x 106. • Again, we might need to normalize again if a “bad luck” rounding decision is made (just like in FP addition).

Floating Point MultiplicationStep 5: Determine the Sign • Lastly, we need to determine the sign of the product. • If the signs of the original operands are both positive or negative (they are the same), the product is positive. If the signs of the original operands differ, the product is negative. • In our example, both operands have a positive sign, so our product is positive. • Therefore, the product is +1.021 x 106.

S t a r t 1 . A d d t h e b i a s e d e x p o n e n t s o f t h e t w o n u m b e r s , s u b t r a c t i n g t h e b i a s f r o m t h e s u m t o g e t t h e n e w b i a s e d e x p o n e n t 2 . M u l t i p l y t h e s i g n i f i c a n d s 3 . N o r m a l i z e t h e p r o d u c t i f n e c e s s a r y , s h i f t i n g i t r i g h t a n d i n c r e m e n t i n g t h e e x p o n e n t Y e s O v e r f l o w o r u n d e r f l o w ? N o E x c e p t i o n 4 . R o u n d t h e s i g n i f i c a n d t o t h e a p p r o p r i a t e n u m b e r o f b i t s N o S t i l l n o r m a l i z e d ? Y e s 5 . S e t t h e s i g n o f t h e p r o d u c t t o p o s i t i v e i f t h e s i g n s o f t h e o r i g i n a l o p e r a n d s a r e t h e s a m e ; i f t h e y d i f f e r m a k e t h e s i g n n e g a t i v e D o n e The Floating PointMultiplication Algorithm

Floating Point Division • FP division hardware is quite complex. • In addition to following our tried-and-true approach of doing it ourselves step-by-step and building hardware for it, let’s just mention some commercial speedup approaches. • Newton’s Iteration is a technique that finds the reciprocal of one operand by finding the zero of a function and then multiplying using an optimized multiplication hardware design. • The SRT Division technique tries to guess several quotient bits per step using a lookup table and the upper bits of the dividend and remainder (the Intel Pentium uses a similar but slightly different approach).

Accuracy and Rounding • One more issue comes up when dealing with floating point computation. • What we store in a register (normally in IEEE 754 format) can never be a true FP number, only an approximation of it (as the number of places past the decimal/binary point is infinite). • The best a computer can ever do with FP numbers is to compute and store the floating point representation that is closest to the actual number. • To accomplish this, IEEE 754 offers several modes of rounding FP numbers.

Standard Support for Rounding • In all of the examples we did we were vague with how many bits each intermediate value could occupy. • If we always truncated anything past what we could represent, we could never round (as the information we would need to round would be truncated). • Therefore, IEEE 754 always has two extra bits on the right during intermediate calculations, called the guard bit and the round bit. • These bits are computed just like any other bits in the calculations – when the end of the algorithm is reached, they are used to round the result (they no longer hold data after the round).

The Sticky Bit • IEEE 754 also includes a third special bit, called the sticky bit. • This bit lies all the way to the right and it is set whenever there is nonzero data past the round bit. • Having this bit allows the computer to get the same result as if the intermediate values were computed to infinite precision and then rounded.

FP Support in MIPS • The MIPS architecture supports IEEE 754 in both single and double precision formats… • FP addition – single (add.s) and double (add.d) • FP subtraction – single (sub.s) and double (sub.d) • FP multiply – single (mul.s) and double (mul.d) • FP divide – single (div.s) and double (div.d) • FP comparison – single (c.x.s) and double (c.x.d), where x is one of equal (eq), not equal (neq), less than (lt), greater than (gt), less than or equal to (le), or greater than or equal (ge) • FP branch – true (bc1t) and false (bc1f) • (FP comparison sets a special bit to true or false and the FP branch decides to branch based on that condition).

Separate FP Registers? • One issue that designers of computers face is whether to use the same registers that integer instructions use for floating point computations or to introduce new, FP only registers. • Integer ops and FP ops frequently work on different data, so there should not be much conflict with sharing registers. • The major impact is the need to introduce a separate set of data transfer instructions for FP. • If separate registers are used, you get twice as many registers without using more bits in the instruction format. • It’s a design decision – is it worth the cost?

Historical Factors • One reason that some designs have separate register banks for integer and FP numbers is limitations of early computers. • In the 1980s, there were simply not enough transistors to put FP hardware on the main microprocessor chip. The FP unit, and its register bank, was available as a second, optional chip, called an accelerator or a math coprocessor. • The early 1990s saw integrated FP hardware on-chip (along with a lot of other functions) – that’s why there almost no coprocessors anymore.

MIPS says Yes • The designers of MIPS decided to include separate FP registers, $f0, $f1, and so forth. • There are separate instructions for loading and storing FP registers – lwc1 and swc1. The base address registers used are still the normal integer registers. • For example, to load two single precision numbers from memory, add them, and store the result… • A double precision register is an even/odd pair of single precision registers, using the even register number as its name. lwc1 $f2, x($sp) # load 32-bit FP num into $f4 lwc1 $f6, y($sp) # load 32-bit FP num into $f6 add.s $f2,$f4,$f6 # $f2=$f4+$f6, single precision swc1 $f2, z($sp) # store 32-bit FP num from $f2

Floating Point Support in the PowerPC Architecture • The PowerPC architecture is much like MIPS when dealing with FP numbers. There are a few differences that mainly derive from the fact that PowerPC is a much newer and more advanced architecture. • There are no Hi and Lo registers – PowerPC instructions operate directly on registers. • PowerPC has 32 single precision and 32 double precision floating point registers, twice as many as in the MIPS architecture. • Power PC also introduces a multiply-add instruction (more about that on the next slide).

The Multiply-Add Instruction • The PowerPC floating point multiply-add instruction reads three operands, multiples two of them, adds the third to the product, and stores the result in the third operand register. • Two MIPS instructions = One PowerPC instruction • This instruction rounds after both operations complete; two separate instructions = two rounds and less accurate results • This instruction is actually used in PowerPC to perform floating point division (using Newton’s Iteration, as we previously mentioned) – the necessity for accurate division was the primary motivation for skipping the rounding operation between the two operands in this “fused” op.

Floating Point Support in the Intel IA-32 Architecture • Floating point support for IA-32/x86 was introduced with the 8087 coprocessor in 1980 and is vastly different than floating point support in the MIPS and PowerPC architectures. • Intel uses a stack-based architecture with the floating point instructions – it’s almost like another machine itself. • Loads push FP numbers onto the FP stack and increment the FP stack pointer. • Stores pop the FP number off of the top of the FP stack and store it to memory.

The FP Stack-based Architecture • FP operations operate on the top two elements on the FP stack – they are replaced with the result (two pops and one push). • There are also ways of having one operand in memory and only using the top of the stack or having one operand in one of seven special FP registers. • FP instructions in IA-32 fall into one of four types… • Data movement – load, load immediate, store, etc • Arithmetic – add, sub, mul, div, sqrt, abs, etc • Comparison – can send result to the integer processor so that it can branch • Transcendental – sine, cosine, log, exponentiation, etc

Aside: Stack-based Machines • This stack-based architecture is vastly different than the register-based architectures we have seen so far. • Data is moved to/from the stack instead of registers for the processor to operate on it. • This type of architecture is not uncommon. Some common (and very modern) computers share a similar architecture… • The Java Virtual Machine (JVM) • The Microsoft Common Language Runtime (CLR) • Most HP graphing calculators (the interface, not the actual processor)

Double Extended Precision • The stack in the Intel IA-32 floating point system is 80 bits wide, known as double extended precision. • All FP numbers, both single and double precision, are converted to this form when moved to the stack and converted back when moved back to memory. • The FP registers are 80 bits wide. • If one operand in a FP operation is in memory, it is converted on-the-fly to 80-bit format for the operation. • This form is not normally used by modern programming language compilers, but it is available if desired using straight assembly language programming.

Floating Point Arithmetic