CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246: Computer Arithmetic Algorithms and Hardware Design Fall 2006 Lecture 8: Division Instructor: Prof. Chung-Kuan Cheng

Topics: • Radix-4 SRT Division • Division by a Constant • Division by a Repeated Multiplication

Project Update • Come in to speak briefly about the final project • Status Update • 2:30 – 3:00 p.m. • Tuesday or Thursday

Radix-4 SRT Division • 4sj-1 = qjd + sj where • qj is in [-2,2] and sj-1 is in [-hd,+hd] • h is less than or equal to 2/3 • Therefore, sj-1 is in [-2d/3, 2d/3] • And, 4sj-1 is in [-8d/3, 8d/3] • s shifts to the left by 2 bits

Radix-4 SRT Division 4sj-1 8d/3 11.0 Anything above 8d/3 goes against our assumption and is therefore the infeasible region 10.1 qj=2 5d/3 10.0 4d/3 1.1 qj=1 1.0 2d/3 0.1 d/3 d qj=0 0.0 .1 .101 .110 .111 1.00 -2d/3 • The overlap regions of qj denote a choice still allowing for recursion. The gap defines the precision for carry save addition.

Radix-4 SRT Division • The value of qj determines the range it governs • For example, qj = 1 • 1 + 2/3 = 5/3 • 1 – 2/3 = 1/3 • The range is 1/3 to 5/3

Division by a Constant • Multiplication is O(log n) but division is linear…much slower • Try to convert division to multiplication • Property: Given an odd number d m such that d*m = 2n– 1 • Ex. • d = 3, m = 5 3*5 = 24– 1 • d = 7, m =9 7*9 = 26– 1 • d = 11, m = 93 11 * 93 = 210 - 1 E

Division by a Constant • 1/d = m/(2n– 1) • 1/(1-r) = 1+r+r2+r3+… = (1+r)(1+r2)(1+r4)(1+r8)… • Example • z/7 = zm/(2n-1), m=9, n=6 • log(n/6) operations m 1 m = = (1+2-n)(1+2-2n)(1+2-4n) 2n 1-2-n 2n z 9 9z = = (1+2-6)(1+2-12)(1+2-24) 26 1-2-6 26

Division by Reciprocation • Find 1/d with iteration • Newton Raphson Algorithm xi+1=xi-f(xi)/f’(xi) • Set f(x)=1/x-d, (1/2<=d<1) We have f’(x)=-1/x2 • Thus xi+1=xi(2-xid) • Let ei=1/d-xi We have ei+1=1/d-xi+1=1/d-xi(2-xid) =d(1/d-xi)2=dei2 • The convergence rate is quadratic. • For k iterations, it takes 2k multiplications

Division by Reciprocation • z/d=3/0.7 • x0=4(31/2-1)-2d=2.9282-2d=1.5282 • e0=1/d-x0=1/0.7-1.5282=-0.0996286 • x1=x0(2-x0d)=1.42164 • e1=1/d-x1=1/0.7-1.42164=0.0069314 • x2=x1(2-x1d)=1.4285377 • e2=1/d-x2=1/0.7-1.4285377=0.0000337 • x3=x2(2-x2d)=1.4285715 • e3=1/d-x3=1/0.7-1.4285715=-0.000000(1) • The convergence rate is quadratic.

Division by Recursive Multiplication • q = z/d = (z/d) (x0/x0) (x1/x1)… (xk-1/xk-1) eq(a) • Let ½<=d<1 • It takes 2k multiplication for eq(a) • We also need k operations to find xi

Division by a Repeated Multiplication • q = z/d = (z/d) (x0/x0) (x1/x1)… (xk-1/xk-1) • Let ½<=d<1 • Set d0=d, xk = 2-dk 1. d1 = dxo = d(2-d) = 1-(1-d)2 2. dk+1= dkxk = dk(2-dk) = 1-(1-dk)2 3. 1-dk+1 = (1-dk)2 =(1-d)2k quadratic convergence • For k-bit operands, we need 2m-1 multiplications • m 2’s complement • m = ceiling(log2 k) with log2 m extra bits for precision

Division by a Repeated Multiplication • q = z/d=3/0.7 = (z/d) (x0/x0) (x1/x1)… (xk-1/xk-1) • d0=d=0.7, xk = 2-dk, dk+1=dkxk 1. x0=2-d0=1.3, d1=d0xo= 0.7x1.3 = 0.91 2. x1=2-d1=1.09, d2=d1x1=0.91x1.09=0.9919 3. x2=2-d2=1.0081, d3=d2x2=0.9919x1.0081=0.9999343

Division Methods • Iteration • Memory • Arithmetic

0.1 1 0 1 1 0 1 0 1 0 0 1 R0=A 1 0 1 0 1 0 0 0 R1 Q1 = 0.1Q2 = 0.01Q3 = 0.000Q4 = 0.0001 1 0 1 0 0 1 0 0 R2 0 0 0 0 1 0 0 0 R3 1 0 1 0 0 1 1 0 R4 Division –Iteration effort • Pencil and paper method: (A=QB+2-nR and R<B)1 bit partial quotient per iteration, n iterations A = 0.1001, B = 0.1010; Q= A / B. + Qi: Partial Quotient Ri: Partial Remainder Ri+1 = Ri – B  Qi Q = 0.1101

Division –Memory effort • Lookup table is the simplest way to obtain multiple partial quotient bits in each iteration. • SRT method: a lookup tables stores m-bit partial quotients decided by m bits of partial remainder and m bits of divisor. Table size: 22m m • STR method is limited by memory wall.

Division –Arithmetic effort • Partial quotient is calculated by arithmetic functions. • Prescaling: • Taylor expansion: • Series expansion:

Division –Solution space • Modern FPGAs contains plenty of memory and build-in multipliers, which enable high performance divider. Memory Effort Our target SRT Memory Wall Low latency Prescaling Pencil-and-paper Series Expansion Iteration Effort Taylor Expansion Arithmetic Effort Low area

Division –PST algorithm • Utilize the power of series expansion, but need a good start point. • Prescaling provide a scaled divisor close to 1. • 0-order Taylor expansion iterates to reach the final quotient

z1 = z  E0 =0.1101,1000,0010 d1 = d  E0 =0.1111,0001,0001 Q1 = z1 E1 =0.1110,0011 R1 = B1 – Q1 d1 =0.0000,0010,0101,1110,1101 Q2 = R1 E1 =0.1001,1111 R2 = R1 – Q2 d1 =0.0000,0001,1111,1011,0001 Q =0.1110,0011+ 0.0000,0010,0111,11= 0.1110,0101,0111,11 Division –PST algorithm B(m) =0.1100 E0 =1.0011 z =0.1011,0110 d =0.1100,1011 E1 = INV(d1(2m)) =1.0000,1110 E0 = Table (d(m))  1/d z1 = zE0; d1 = dE0 E1 = (2  d1)  INV(d1(2m)) Qi = Ri-1  E1 Ri= Ri-1 Qi  B1 Q = Q + Qi

Division –FPGA Implementation • PST algorithm is suitable for high-performance division unit design in FPGAs 32-bit division with 5-cycle latency

CSE 246: Computer Arithmetic Algorithms and Hardware Design