450 likes | 1.3k Vues
CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design. Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg477 [Adapted from Rabaey’s Digital Integrated Circuits , © 2002, J. Rabaey et al.]. Review: Basic Building Blocks. Datapath Execution units
E N D
CSE477VLSI Digital CircuitsFall 2002 Lecture 21: Multiplier Design Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg477 [Adapted from Rabaey’s Digital Integrated Circuits, ©2002, J. Rabaey et al.]
Review: Basic Building Blocks • Datapath • Execution units • Adder, multiplier, divider, shifter, etc. • Register file and pipeline registers • Multiplexers, decoders • Control • Finite state machines (PLA, ROM, random logic) • Interconnect • Switches, arbiters, buses • Memory • Caches (SRAMs), TLBs, DRAMs, buffers
T = O(N) A = O(N) T = O(log N) A = O(N log N) Review: Binary Adder Landscape synchronous word parallel adders ripple carry adders (RCA) carry prop min adders signed-digit fast carry prop residue adders adders adders Manchestercarry parallel conditional carry carry chain select prefixsum skip T = O(N), A = O(N) T = O(1), A = O(N) T = O(N), A = O(N)
can be formed in parallel Multiply Operation • Multiplication as repeated additions N multiplicand multiplier partial product array N double precision product 2N
Shift & Add Multiplication • Right shift and add • Partial product array rows are accumulated from top to bottom on an N-bit adder • After each addition, right shift (by one bit) the accumulated partial product to align it with the next row to add • Time for N bits Tserial_mult = O(NTadder) = O(N2) for a RCA • Making it faster • Use a faster adder • Use higher radix (e.g., base 4) multiplication • Use multiplier recoding to simplify multiple formation • Form partial product array in parallel and add it in parallel • Making it smaller (i.e., slower) • Use an array multiplier • Very regular structure with only short wires to nearest neighbor cells. Thus, very simple and efficient layout in VLSI • Can be easily and efficiently pipelined
0 D Q (‘ier) 0 D 0 D multiple forming circuits 0 D (‘icand) partial product array reduction tree fast carry propagate adder (CPA) P (product) Tree Multiplier Structure mux + reduction tree (log N) + CPA (log N)
(4,2) Counter • Built out of two (3,2) counters (just FA’s!) • all of the inputs (4 external plus one internal) have the same weight (i.e., are in the same bit position) • the internal output is carried to the next higher weight position (indicated by the ) (3,2) Note: Two carry outs - one “internal” and one “external” (3,2)
Tiling (4,2) Counters • Reduces columns four high to columns only two high • Tiles with neighboring (4,2) counters • Internal carry in at same “level” (i.e., bit position weight) as the internal carry out (3,2) (3,2) (3,2) (3,2) (3,2) (3,2)
4x4 Partial Product Array Reduction • Fast 4x4 multiplication using (4,2) counters multiplicand multiplier partial product array reduced pp array (to CPA) double precision product
8x8 Partial Product Array Reduction ‘icand How many (4,2) counters minimum are needed to reduce it to 2 rows? ‘ier partial product array Answer: 24 reduced partial product array
Alternate 8x8 Partial Product Array Reduction ‘icand More (4,2) counters, so what is the advantage? ‘ier partial product array reduced partial product array
multiplicand . . . multiple generators (4,2) counter slice 2 multiple selection signals (‘ier) (4,2) counter slice (4,2) counter slice CPA Array Reduction Layout Approach
Next Lecture and Reminders • Next lecture • Shifters, decoders, and multiplexers • Reading assignment – Rabaey, et al, 11.5-11.6 • Reminders • Project final reports due December 5th • HW5 (last one!) due November 19th • Final grading negotiations/correction (except for the final exam) must be concluded by December 10th • Final exam scheduled • Monday, December 16th from 10:10 to noon in 118 and 121 Thomas
Topics • Adders and ALUs (§6.4, §6.5) • Multipliers (§6.6) • Array multiplier • Baugh-Wooley multiplier • Booth encoding • Wallace tree multiplier • Subsystem design principles (§6.2)
Elementary School Algorithm 0 1 1 0 multiplicand × 1 0 0 1 multiplier 0 1 1 0 + 0 0 0 0 0 0 1 1 0 + 0 0 0 0 0 0 0 1 1 0 + 0 1 1 0 0 1 1 0 1 1 0 partial products
Combinational Multiplier bit of multiplier controls whether addition occurs
Array Multiplier • Regular layout • An n × m cell layout • Easy to be pipelined • Used frequently in FPGA and ASICs • Critical path • Less than (n+m-1) bit adder delay • Handles unsigned multiplication ONLY
A 4 × 4 Unsigned Array Multiplier skew array for rectangular layout
x3y0 x2y0 x1y0 x0y0 0 0 0 P0 Cin a x2y1 x1y1 x0y1 + + + x3y1 b P1 + x2y2 x1y2 x0y2 x3y2 + + + P2 Cout Sum x2y3 x1y3 x0y3 + + + x3y3 P3 0 + + + P7 P6 P5 P4 Unsigned Array Multiplier
Signed Multiplication • Signed number representation • Signed n×n multiplication • (1110)2 × (0011)2 = (1010)2 (-2) × 3 = (-6) • No difference from unsigned multiplication if the result has the same bit-width as the input • But what if we want the result to be 2n bit? • Use sign-bit extension • Needs 2n × 2n array multiplier
Cin a x3y0 x2y0 x1y0 x0y0 0 0 0 b + P0 x2y1 x1y1 x0y1 + + + x3y1 Cout Sum P1 x2y2 x1y2 x0y2 y3 x3 x3y2 + + + P2 x3y3 x2y3 x1y3 x0y3 x3 + + + + 1 y3 + + + + + P3 P6 P5 P4 P7 Baugh-Wooley Multiplier: Structure
Booth Multiplier • Utilize Booth encoding scheme • Booth encoding scheme • Handles signed multiplication • Reduce the number of partial products by half • Small area and fast • Encoding scheme cannot be applied hierarchically • Often used as the first stage partial products reduction
Booth Encoding: Principle • Two’s-complement form of multiplier y • Consider first two terms • By looking at three bits of y, we can determine whether to add x, 2x to partial product.
Booth Actions yi yi-1 yi-2 increment 0 0 0 0 0 0 1 X 0 1 0 X 0 1 1 2X 1 0 0 -2X 1 0 1 -X 1 1 0 -X 1 1 1 0
Booth Example • Don’t forget the sign extension of the encoded value when add them together • Only have to extend 2 bits though • x = 011001 (2510), y = 101110 (-1810). • y1y0y-1 = 100, P1 = P0 - (10 011001) = 11111001110 • y3y2y1= 111, P2 = P1 0 = 11111001110. • y5y4y3= 101, P3 = P2 - 0110010000 = 11000111110.
Wallace Tree • Reduces the number of partial products • Built from carry-save adders: • Three inputs: a, b, c • Two outputs: y, z such that y + z = a + b + c • Carry-save equations: • yi = ai bi ci • zi+1 = aibi + bici + ciai • What’s the difference from carry-ripple adder?
a1 c1 a2 a0 c2 b1 c0 b2 b0 carry-ripple adder FA FA FA s2 s0 s1 a1 c1 a2 a0 c2 b1 c0 b2 b0 carry-save adder FA FA FA y2 z3 y1 z2 y0 z1 Wallace Tree Structure
Wallace Tree Operation • n additions are reduced to (2n/3) additions after each level • Sum of inputs = Sum of outputs • Can apply the reduction hierarchically • More efficient design uses 4-2 adders to reduce n additions to (n/2) additions after each level • Need final adder to add the last two numbers
Booth encoders B B B B B B B B B B B B B B B B B Wallace tree level 1 4-2 adder array 4-2 adder array 4-2 adder array 4-2 adder array FF Wallace tree level 2 4-2 adder array 4-2 adder array FF Wallace tree level 3 4-2 adder array FF Wallace tree level 4 3-2 adder array Final Adder 64-bit adder ( not part of pipeline) A Booth-Wallace Tree Multiplier Most commonly used high-performance multiplier
Topics • Adders and ALUs (§6.4, §6.5) • Multipliers (§6.6) • Subsystem design principles (§6.2)
Pipelining • Pipelining can be used to reduce clock period at the expense of latency: combinational logic 1 combinational logic 2
Cycle Time and Latency cycle time latency # stages # stages
Data Paths • A data path is a logical and physical structure: • bit-wise logical organization • bit-wise physical structure • Data paths generally use busses to pass data between function units.
Bit Slice Organization control registers bit n-1 shifter ALU bus bit 0
Data Path Cell Design • Connections may be made by: • abutment, requiring stretching cells; • river routing, requiring a routing channel between function units.
Project • Due 10/26 • Schematic • Verilog/Spectre simulation results • 10/27 presentation (10-15 PowerPoint slides) • Important (efficiency-related) • How to add array of instances