CSE 575 Computer Arithmetic Spring 2005 Mary Jane Irwin (cse.psu/~mji)

CSE 575Computer ArithmeticSpring 2005Mary Jane Irwin (www.cse.psu.edu/~mji)

Remaining Lecture Schedule

T = O(log n) A = O(n log n) T = O(n), A = O(n) Review: Binary Adders synchronous word parallel adders ripple carry adders (RCA) carry prop min adders signed-digit fast carry prop residue adders adders (CPAs) adders Manchester carry carry prefix cond. carry carry chain select lookahead sum skip T = O(n), A = O(n) T = O(1), A = O(n) T = O(n), A = O(n)

n X0 X1 X2 k X3 X4 X5 X6 Sum log(k2n - k + 1)  n + logk Multioperand Addition • Addition of more than two numbers • vector inner products • computing averages

Serial Implementation n + logk bits CPA  X’s Xj n bits Partial sum register Tserial-multiadd = O(k log(n + logk)) = O(k logn + k loglogk) Addition time grows superlinearly with k when n is fixed and logarithmically with n for a fixed k.

Multiply • Binary multiplication as repeated additions n multiplicand - D multiplier - Q partial product array n double precision product - P 2n

A Serial Implementation Multiplicand register 2n bits 2n-b CPA P D Partial product register n bits 1 bit Multiplier register Add/no add control Q Tserial-multiply = O(n log(2n)) Multiplication time grows superlinearly with n (when using a log time adder)

Shift & Add Multiplication • Left shift and add • Partial products accumulated from bottom to top • Requires a 2n bit adder • Right shift and add • Partial products accumulated from top to bottom • Only requires an n bit adder • Sign extend ‘icand on right shift; premultiply ‘icand by 2n to offset effect of right shifts (integer operands only)

Right Shift & Add Multiplier n bits n-b CPA Multiplicand register 0 D n bits Add/subt control Q P Add/no add control Multiplier register (Partial) Product register Tserial-multiply = O(n logn) or O(n2) Multiplication time grows superlinearly with n.

Signed Multiplication • So far we have (q0 . q1 q2 q3 …qn-1) P0 = 0 P1 = ½(P0 + qn-1 D) P2 = ½(P1 + qn-2 D) . . . Pi+1 = ½(Pi + qn-i-1 D) = ( qn-j 2-j) D So Pn-1 = ( qn-j 2-j) D = Q * D sign bit i+1 j=1 n-1 j=1

Negative (2s’C) Multiplicand • As long as we sign extend the ‘icand our scheme works fine • But what if both ‘icand and ‘ier are negative? 1 0 0 1 1 D = -13 0 1 0 1 1 *Q = +11 1 1 1 11 0 0 1 1 1 1 11 0 0 1 1 0 . . . 0 1 1 0 0 1 1 sign extend 1 0 1 1 1 0 0 0 1 P = -143

Negative (2s’C) Multiplier • Recall for 2s’c D = -d020 + dj 2-j and Q = -q020 + qj 2-j and what we have computed so far is Pn-1 = ( qn-j 2-j) D what we want is P= Q * D = -q020 D + ( qn-j 2-j) D n-1 j=1 n-1 j=1 n-1 j=1 n-1 j=1 • So the correction factor for 2s’c is P = Pn-1 - q0D

Negative (2s’C) Multiplier Example 1 0 0 1 1 D = -13 1 1 0 1 0 *Q = -6 0 . . . 0 1 11 0 0 1 1 0 . . . 0 1 0 0 1 1 1 0 1 1 1 1 1 1 0 - 1 0 0 1 1 0 0 1 0 0 1 1 1 0 P = +78

Other Negative Multipliers • 1s’C P = Pn-1 - q0D + 2-(n-1)q0D • adder must do 1s’C addition (EAC) • sign extend ‘icand • initialize P0 as q0D (rather than clearing the register) • do an optional subtraction as a last step • SM |P| = Pn-1 and psign = q0 d0 • strip off the sign bits and do unsigned multiplication (so no corrections and no sign extensions) • sign of the product is the xor of ‘ier and ‘icand sign bits

Lower Bound on Multiplication • Winograd’s lower bound on multiplication of two n-digit d-valued numbers is t  log2n • Mult can be done as the addition of the log representation of two numbers a * b = c  loga + logb = logc but the data representation is nonstandard

Faster Serial Multiplication • Use logn fast CPA • Bypass addition cycle when ‘ier bit is 0 • Zero detect and barrel shift • Detect strings of zeros in the ‘ier and shift 1, 2, 3, … n-1 places right in one cycle • Use higher radix multiplication • Multiplier recoding to simplify multiple formation • CSAs to form multiples

Carry Save Adder (CSA) • A carry save adder is nothing more than a full adder with the carries saved rather than propagated! • Also called a (3,2) counter FA

FA FA FA FA FA FA 6-b CSA Carry Save Word Adder • A 6 bit CSA reduces three 6-bit inputs to one 6-bit output and one 7-bit output

Radix 4 Multiply • Radix 4 multiply involves half as many additions, so runs twice as fast where Pi+1 = ¼(Pi + qn-i-1||qn-i-2 D) with P0 = 0 and Pn-1 = ( qn-j 2-j) D = Q * D n multiplicand -D multiplier - Q partial product array n/2 double precision product - P 2n n-1 j=1

Forming the Multiples • Need the multiples 0*D, 1*D, 2*D, 3*D • All are easy except 3*D • compute it via an addition (3D = 2D + 1D) every cycle • too slow! • precompute it and store it in a register • use a CSA to form the multiples • replace 3D with a 4D (a carry into the next higher multiplier digit) and a –D – recode the multiplier – so you don’t need it

Using a CSA to Form Multiples n+2 bits n+1-b CSA n+2-b CPA 0 ‘icand D 0 Add/subt control 0,2D 0,1D Q P Shift P || Q right 2 bits each iteration ‘ier (Partial) Product

add a unit here subtract r here Recoding the Multiplier • Recall for radix 4, Q=[0,1,2,3] can be recoded into Q=[-2,-1,0,1,2] • This recoding has to be accomplished so that the algebraic value (Q = -q0 +  qjr-j in RC) of the ‘ier is unchanged . . . qj-1 qj qj+1 . . . r-(j-1)qj-1 + r-jqj = r-(j-1)(qj-1 + 1) + r-j(qj - r)

Goals of Recoding • Maximize the number of zero’s 0111 1111  1000 000-1 or • Eliminate the possibility of a 11 or -1-1 digit pairing 0111 0111  100-1 100-1  1000 -100-1

Recoding • With mode digit, mj, and recoded digit, qj’ r-(j-1)qj-1’ + r-jqj’ + r-(j+1)qj+1’ = r-(j-1)(qj-1+mj-1) + r-j(qj-rmj-1+mj) + r-(j+1)(qj+1-rmj) • So that qj’ = qj - rmj-1 + mj

Recoding, Con’t • And Q’ =  qj’r-j =  r-j (qj - rmj-1 + mj) =  r-j qj +  r-j (-rmj-1 + mj) =  r-j qj - r0m0 + r-1m1 - r-1m1 + … + r-(n-1)mn-1 = - m0 +  r-j qj + r-(n-1)mn-1 • So if mn-1= 0 and m0 = q0then the recoding works for RC notation and the choices for mj (j = 1, 2, …, n-2) are arbitrary!!

Recoding Table • In binary qj’ = qj - 2mj-1 + mj • Given mj from the previous step, when qj is sensed pick mj-1

uses qj-1 and qj uses qj-1 ,qj and qj+1 Recoding Families • Canonical (Booths) • Differentiating • Nonrestoring • Modified Booths

middle of string of 0’s middle of string of 1’s Multiplier Recoding Schemes isolated 1 start string of 1’s start string of 0’s isolated 0

Canonical (Booths) Recoding -1 0 m7 = 1 -1 m6 = 1 0 m5 = 1 m4 = 1 1 0 m3 = 0 1 m2 = 0 0 m1 = 0 0 1 0 1 0 -1 0 -1 m0 = 0

Canonical Recoding Facts • Every two nonzero recoded digits are separated by at least one zero digit (in binary), so Q=[-2,-1,0,1,2] and no 3D to deal with • Proof: • Produces a multiplier with the most zeros • It is a left-directed (serial) recoding If no two successive digits are nonzero, it must be true that qi-1’ & qi’ = 0 From the table mi-1 = miqi | miqi-1 | qi-1qi and qi’ = qi!mi | !qimi and !mi-1 = !qi!qi-1 | !mi!qi-1 | !mi!qi and !qi’ = qimi | !qi!mi So qi-1’ & qi’ = (qi-1!mi-1 | !qi-1mi-1)(qi!mi | !qimi) And substituting terms gives qi-1’ & qi’ = (qi-1!qi!mi | !qi-1qimi)(qi!mi | !qimi) = 0

Multiplier Recoding Schemes

Differentiating Recoding -1 0 m7 = 1 1 m6 = 1 -1 m5 = 0 m4 = 1 1 0 m3 = 0 -1 m2 = 0 1 m1 = 1 1 -1 0 1 -1 1 0 -1 m0 = 0

qj+1 Differentiating Recoding Facts • Because of the pairing of rows, the recoding is independent of qj-1 and mj-1 = qj so mj = qj+1 • So the recoding can be based upon just qj and qj+1 to recode qj

1 1 -1 -1 More Differentiating Facts • The recoding can be done lsd first (left directed) OR msd first (right directed) OR in parallel 0 1 0 0 1 0 1 1 Q M (Q shifted left) Q’ • Successive nonzero recoded digits are always of opposite sign, so Q=[-2,-1,0,1,2] and still no 3D to have to deal with • Also gives a n/2 versus n height partial product array 0 1 0 0 1 0 1 1 0 1 -1 0 1 -1 1 0 -1

Modified Booth’s Recoding • Modified Booth’s recoding has the same goal as differentiating, to have a recoding scheme that is parallel and that allows a radix 4 multiply without 3D • Instead of a mode digit, it uses three adjacent bits of Q to do the recoding and recodes two bits at a time (instead of one) • Successive nonzero recoded digits are always of opposite sign, so Q=[-2,-1,0,1,2]

Modified Booth’s Scheme run of zeros end of string of ones isolated one end of string of ones start of string of ones isolated zero start of string of ones run of ones

differentiating modified Booths Recoding Hardware Comparison • How do differentiating and modified Booth’s compare wrt time, complexity, power? 0 1 0 0 1 0 1 1 0 Q

Right Shift & Add Multiplier n+2 bits ‘icand n+2-b CPA 0 D !D Add/subt control -2D,-1D,0,1D,2D recode Q P Shift P || Q right 2 bits each iteration ‘ier (Partial) Product Tserial-multiply = O((n/2) logn)

low order 0’s only never? Multiplier Recoding Schemes

Nonrestoring Recoding 1 -1 m7 = 0 1 m6 = 1 -1 m5 = 0 m4 = 1 -1 1 m3 = 1 -1 m2 = 0 1 m1 = 1 1 -1 1 -1 -1 1 -1 1 m0 = 0

Nonrestoring Recoding Facts • The msd does not conform to the rules, it is overridden by the termination condition that m0 must agree with the sign of the ‘ier • Gives a recoded digit set of Q’=[-3,-1,1,3] (lsd could also be Q’=[-2, 0, 2]) • It is a left-directed (serial) recoding • It corresponds to the inverse of nonrestoring division, so could be useful in helping to determine the relationship between multiply and divide

Higher Radix Multiply • Does recoding work for radix 8? radix 16? radix 32? • Only choice is to form or pre-form multiples of D -7 to 7 many “hard” multiples (-3,3,-5,5,-6,6,-7,7) max. redundant -6 to 6 many “hard” multiples (-3,3,-5,5,-6,6) -5 to 5 some “hard” multiples (-3,3,-5,5) -4 to 4 few “hard” multiples (-3,3) min. redundant

Multiply Operation Review n multiplicand (D) multiplier (Q) partial product array (ppa) (note: can be formed in parallel) n double precision product (P = Q*D) 2n

Parallel Multiplication • In a parallel multiplier • Can use a multiplier recoding scheme to reduce the height of the ppa in half (from n bits high to n/2 bits high) • must be able to form the ppa in parallel (so must use either modified Booths or differentiating recoding) • Reduce the height of the ppa to two rows in parallel with a tree of fast adders • Use a fast CPA to do the final add

0 D 0 D Q (‘ier) multiple forming circuits . . . 0 D 0 D partial product reduction tree fast CPA P (product) Full Tree Multiplier Structure use multiplier recoding to reduce the height of the tree to n/2

... FA FA FA n-b CSA Tree Reduction Techniques • CSA ((3,2) counters) trees • Wallace - row reduction • combine partial product bits as early as possible • fastest possible design, shorter CPA • Dadda - column reduction • combine partial product bits as late as possible • cheaper CSA tree, wider CPA • Other counter trees • SDA trees

4x4 Tree Reduction 5-bit CPA 6-bit CPA

6x6 Tree Reduction

Maximum Inputs for CSA Trees The maximum number, n, of inputs that can be reduced to two outputs by an h-level CSA tree is n(h) = 3n(h-1)/2 Giving an upper bound of n(h)  2(3/2)h and a lower bound of n(h)  2(3/2)h-1

Log Reduction TreesTraditional Wallace and Dadda Approaches • Gives an irregular structure making design and layout quite difficult • Connections and signal paths of varying lengths lead to signal skew and increased glitching impacting both performance and power consumption • Is there an approach better for VLSI layout we can use?

CSE 575 Computer Arithmetic Spring 2005 Mary Jane Irwin (cse.psu/~mji)