1 / 45

Matthew W. Ernest Electrical, Computer and Systems Engineering Dept.

Critical ALU Path Optimization and Implementation in a BiCMOS Process for Gigahertz Range Processors. Matthew W. Ernest Electrical, Computer and Systems Engineering Dept. Rensselaer Polytechnic Institute. Overview. Motivation Parallel Prefixes and Carry Types HBT Digital Circuits

ganesa
Télécharger la présentation

Matthew W. Ernest Electrical, Computer and Systems Engineering Dept.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Critical ALU Path Optimization and Implementation in a BiCMOS Process for Gigahertz Range Processors Matthew W. Ernest Electrical, Computer and Systems Engineering Dept. Rensselaer Polytechnic Institute

  2. Overview • Motivation • Parallel Prefixes and Carry Types • HBT Digital Circuits • Pseudo-carry Adder • Future Directions

  3. Motivation “Speed has always been important otherwise one wouldn't need the computer.” -Seymour Cray • Ubiquity • Simplicity • Complexity

  4. Parallel Prefixes Given: x0 x1 x2 ...xk Find: x0 x0 Ä x1 x0 Äx1 Ä x2 ... x0 Ä x1 Ä x2... Ä xk • The set of problems covering sequences of operations where terms are added in order to the result of the previous operation • Carry computation is an application of parallel prefix theory

  5. 1 1 0 0 Carry types: Carry Select • Compute possible results in parallel • Select when actual carry-in available • Requires internal carry for blocks, e.g. ripple • Delay: O(f(n/b) +b), min. O(n1/2) • Area: O(f(n/b)·b+b), approx. 2n • Affected by block sizing

  6. Carry-out can be “generated” at current position or carry-in “propagated” Delay: O(1) Area: O(n2) High fan-in/fan-out Carry Types: Carry look-ahead

  7. Carry Types: Block carry look-ahead • A block propagates a carry if all bits in the block propagate a carry • A block generates a carry if a bit generates a carry and all succeeding bits propagate • Delay: O(log n) • Area: O(n log n)

  8. Block carry look-ahead trees

  9. Carry vs. Pseudo-carry Cout=Gn+ Pn• Gn-1 +…+Pn• Pn-1• ... P0• Cin If G=A•B and P=A+B then G=G•P Cout= Pn•Gn+ Pn• Gn-1 +…+Pn• Pn-1• ... P0• Cin Cout= Pn(Gn+ Gn-1 +…+Pn-1• ... P0• Cin) Cout= Pn•Hn Hn =Gn+ Gn-1 +…+Pn-1• ... P0• Cin

  10. Carry vs. Pseudo-carry • Redundant terms create factorization opportunities • Factorization moves terms from critical paths to non-critical paths • Multiple paths can be parallelized • Products with fewer terms lead to implementations with smaller, faster gates

  11. Pseudo-carries can be generated in blocks like carries Deriving Block Pseudo-carry from Block Carry Look-ahead Terms Block Generate: Gi•j0= Gij + PijGij-1i + … + PijPij-1iPij-2i•••Gi0 If G=A•B and P=A+B then G=G•P Gi•j0= PijGij + PijGij-1i + … + PijPij-1iPij-2i•••Gi0 Gi•j0= Pij(Gij + Gij-1i + … + Pij-1iPij-2i•••Gi0) Hi•j0= Gij + Gij-1i + … + Pij-1iPij-2i•••Gi0

  12. Generalized Pseudocarry Equations H2s= G1s+1 + G1s Hi+js= Hjs+i + Ijs+i-1•His Hi+j+ks= Hks+I+j + Iks+I+j-1•Hjs+i + Iks+I+j-1• Ijs+i-1•His Ip+qt= Iqt+p•Ipt Ip+q+rt= Irt+q+p•Iqt+p•Ipt

  13. Sum with pseudo-carry no more complex than sum with carry Other look-ahead features still apply, e.g. Han-Carlson “every other carry” Generating Sums Using Pseudocarry Sn=AnÅBnÅCn-1 If Tn=AnÅBn Cm= Pm•Hm then Sn=TnÅPn-1Hn-1

  14. Adder comparision CSel PCLA Ripple CLA Bits C B A 32 32 12 12 9 6 5 64 64 20 16 12 7 6

  15. HBT Digital Circuits • Exponential I/V relationship leads to high gain and fast switching • Vertical arrangement allows critical dimensions to be smaller with tighter tolerances • Traditionally high DC power consumption: compare increasing leakage and switching currents for FETs

  16. Constant current source equals combined emitter currents Ratio of current through each transistor is exp. function of base voltage Difference in currents at collector converted to difference in voltage on pull-up resistors. Current Steering Logic

  17. Limited to simple functions Large fan-in Any function of inputs Fan-in limited by supply voltage Single-ended vs. Double-ended

  18. Look-ahead gate w/ fully differential logic Hn-2 Hn-2 Hn-1 Hn-1 In-1 In-1 In In Hn-1 Hn-1 Hn Hn In In Hn Hn

  19. Hn Hn-1 Vr Hn Vr In In Mixed input look-ahead gates • In(Hn+ Hn-1) + In•Hn • Hn+ In•Hn-1 • Two series-gated levels for three inputs

  20. Hn Hn-1 Hn Hn-2 Hn-1 Hn In-1 In-1 In In Mixed input look-ahead gates • In In-1(Hn+ Hn-1 + Hn-2) + In In-1(Hn+ Hn-1) + In• In-1• Hn • Hn+ In•Hn-1 + In• In-1• Hn-2 • Three series-gated levels for five inputs

  21. Pseudocarry Blocks H2s H2s H2s H2s H2s H2s H2s H2s H2s H2s H2s H2s H2s H2s H2s H2s H6s H6s H6s H6s H6s H18s H14s H32s

  22. Pseudocarry Tree Oscillator Select 0 1 31 32 1 B A Cin Cout

  23. 2 x 165 ps Carry Tree High-speed Output

  24. Breakdown of measured delay Resistor model Total measured delay = 165 ps Temperature 11% 6% Wire C 12% 71% Devices

  25. At design time, fT peak at 1.2mA/um2 but limit at 2mA/um2 For some devices, max. frequency when driving load can occur above fT peak current Models supported this, no reason at time to not believe them However, models are never qualified above fT peak current! Loaded vs. unloaded toggling

  26. Loaded vs. unloaded toggling

  27. Resistor Model Effects

  28. Model parameter variation

  29. Cadence internal parasitic methods • Approximates all capacitance as polynomial function of distance between conductors • Cannot extract RC and capacitance between conductors at the same time: killer for differential wiring! • Convenient, but window of usability small and shrinking

  30. QuickCap capacitance extraction • Field solving with floating random walk method • Accuracy almost wholly a function of run time: 4x run time give ½ error • Random walks independent, near perfect parallelization

  31. Comparing parasitic extraction

  32. Extract physical data from layout Compute RC with QuickCap Extract netlist from schematic Combine to simulate with Spectre Cadence/QuickCap Design Flow

  33. Partial manual extraction with QuickCap • Identify main wires of oscillation paths: approx. dozen pairs • QuickCap extraction for each wire-ground cap. and cap. between pair • Add RC-ladder for each pair by hand to schematic and simulate

  34. Feedback path w/o parasitics (ps) QuickCap parasitic cap. (ps) COEFGEN parasitic cap. (ps) Raphael parasitic cap. (ps) QuickCap parasitic RC (ps) Cin 100 121 128 131 135 A1 103 123 130 129 137 A31 108 127 129 132 141 Simulation with Parasitic Extraction

  35. Pseudo-carry Tree configured as Ring Oscillator 00...00 11...11 Sel 0 Sel 1 30 32 1 1 B A 1 C in C out

  36. SMI00 Test Structure Layout

  37. SMI00 Test Structure

  38. Carry Tree High-speed Outputs 16 x 146 ps

  39. Reference Type Size Gate Del. Time ZIMM96 Carry 32 5 - STEL96 Adder 64(32) 12.5(12?) - WANG97 Adder 32 3 2.7ns CHAN98 Adder 64(32) 27(19.5) - SILB98 Fixed 64 - 550 ps AIPP99 Adder 64 - 660 ps SAGE01 Adder 32[16x2] - <500ps MATH01 Adder 64 - 482 ps STAS01 Adder 64 - 440 ps LEE02 Adder 64 900 ps VANA02 ALU 32 8 <200 ps Comparisons of published adders

  40. Eliminates Miller capacitance between input and output Reduces Cjc and Cjs on outputs Shortens rise time, but increases delay Cascode Output Stage

  41. Dotted Emitter/Collector

  42. “Wide/Short” gate with dotted emitter/collector

  43. “Wide/Short” gate with dotted emitter/collector • Shorter trees lead to lower supply voltages • Wider trees reduce ratio of emitter-followers to terms computed, lowering total current • More inputs per look-ahead gate means fewer look-ahead levels • Elimination of single-ended inputs on critical H signals allow faster switching with reduced swing

  44. Even wider look-ahead gate Width limited by • Accumulated Cjc and Cjs of dotted-and node • Saturation vs. breakdown • Fan-out loading from inputs and interconnect

  45. Conclusions • 32-bit addition depth reduced to 5 gates fabricated. 4 and 3 gate depth circuits designed. • Gate to compute 3-way look-ahead fabricated. Up to 8-way look-ahead designed. • Carry delay for 32-bit addition measured at 146ps. • QuickCap technology file for 5HP brings simulated results within 11% of measured.

More Related