300 likes | 441 Vues
csda. csda. Challenges in Automatic Optimization of Arithmetic Circuits. Ajay K. Verma , Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL).
E N D
csda csda Challenges in Automatic Optimizationof Arithmetic Circuits Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL)
Circuit PerformanceDepends Heavily on the Description Multiplier with OptimizedCompressor Tree “Software” Multiplier Multiplier with Compressor Tree
Pre-Synthesis Optimization of Arithmetic Circuits Known architectures Original Circuit Description Arithmetic optimizations Logic Synthesis Physical Design Automatic architecture exploration
Automation and Computer Arithmetic • Algorithmic approaches for a particular class of circuits • Variable group size CLA adder [Lee91] • Irregular partial product compressors [Stelling98] Automation • Heuristics to optimize general classes of circuits • Kernel and co-kernel extraction [Brayton82] • Decomposition based approaches for general circuits [Bertacco97, Mishchenko01, Yang02]
Logic Synthesis • Synthesis tools have become extremely good in optimizing circuits expressed in Sum-Of-Product form • And when there are plenty of XOR gates? ? Before expansion :0.37 ns (138.2 μm2) After expansion :0.26 ns (146.9 μm2) Before expansion :0.22 ns (58.8 μm2) After expansion :0.27 ns (221.2 μm2)
Outline Verma, Brisk, & Ienne; DAC 2007 Best Paper Award nominee Verma, Brisk, & Ienne; IWLS 2008 Verma & Ienne; DAC 2006 Verma & Ienne; ICCAD 2004 Verma, Brisk, & Ienne; TCAD 2008 Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level
Outline Low Complexity High High Granularity Low Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level
Outline Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level
Clustering: Maximization of the Use of Carry-Save Representation The two addition nodes are clustered Two addition nodes are separated by NOT Goal: Swap the adders with other logic operations while preserving the semantics to cluster additions
Examples of Transformations Advancing shift left over add(distributivity of multiplication over addition) (A << k) A . 2k Advancing shift right over addition is more complex Advancing SEL over add(existence of the identity element of addition) C ? (A + B) : D (C ? A : D) + (C ? B : 0)
Some Transformations Have a Cost Advancing PP over add(distributive property of multiplication over addition) This transformation has a significant cost in terms of area!
Generation of All Pareto-Optimal Implementations Pareto-optimal:better than any other in terms of area or critical-path delay Theorem: The transformations form a persistent and confluent reduction system
Example: adpcmdecode Kernel AND network Compressor tree 0.51 ns, 4901 μm2 0.85 ns, 5678 μm2
Outline Limited scope for optimizations Bit-level Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level
Implementation of Subcircuits Corresponding to Contiguous Layers Can Be Improved Arithmetic ADD Logic LZD A direct implementation of LZA in carry-select fashion [Gerwig99] Leading Zero Anticipator
Recursively compute leader expressions again Input Condensation • Leader expressions: • Sufficient to evaluate the whole of an expression • Once you evaluate them, you can discard the input bits IN IN 8-input parallel counter Some Large Circuit Leader expressions L |L| < |IN| s c Smaller circuit OUT OUT Compute all leader expressions in parallel
x y z z = f(x, y) Progressive Decomposition: Algorithm Overview • Choose a subset of input bits • How many bits? • Many different combinations? • Find leader expressions • Optimize via Boolean ring properties • Find identities • Discard dependent expressions • Rewrite circuit in terms of leader expressions • Recursively process the remaining circuit
a1 a0 b1 c1 b0 c0 CSA CSA 0 carry sum Carry-save adder + + 0 a1 a0 c0 b1 b0 c1 X + + 0 0 + + + 0 X Example: 3-Input Adder (s2 Output) X = [a1b1 + (a1 + b1)a0b0] [(a1 b1 a0b0)c1 + c0(a0 b0)(c1 + (a1 b1 a0b0))] L(X, {a1, b1, c1}) ={a1 b1 c1, a1b1 b1c1 a1c1} 3:2 Compressor Ripple-Carry Adder Ripple-Carry Adder
A Better Division Is Used for Leader Expression Computation X = ab (c d e) cd (a b e) X = (ab + cd) (a b c d e) Based on the identity: pq (p q) = 0 Theorem: An expression of the form (PQ RS) can be factored as (P R) T, if there exist U and V such that 1) PU = RV = 0 and 2) Q S = U V The ideal membership problem can be used to determine the existence of such U and V
Progressive Decomposition: Qualitative Analysis • Completely agnostic of the type of circuit to optimize • Automatically infers successful circuit designs from the literature… • Carry-lookahead adder (beyond minimal sizes) • Structured LZD/LOD circuit • Optimized LZA circuit (no sum computation) • Carry-save addition • Parallel counter • …and discovers some unknown to us! • Multi-Input comparisons (min/max)
Multi-Input Comparator(Min/max of k n-bit Integers) Binary tree of comparators Pairwise comparison of inputs Number of comparators: k (k − 1)/2 Critical path delay: O(log n + log k) Hardware area: O(k2n) Number of comparators: k − 1 Critical path delay: O(log n log k) Hardware area: O(kn) 0.46 ns, 1755 μm2 0.21 ns, 3479 μm2 With Our Structuring Algorithm: Bitwidth reduction using dominators and LODs Number of LODs: k log* n Critical path delay: O(log n + log k log* n) Hardware area: O(kn) 0.22 ns, 1331 μm2 log*() is the number of times the logarithm function must be iteratively applied before the result is ≤ 1 – e.g., log*(265536) = 5
Outline Reed-Muller form can be very inefficient Exhaustive Exploration Efficient implementation of the leader expressions ? Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level
Problem Statement no “reuse” total “reuse” selective “reuse” Given a set of Boolean expressions, generate all their Pareto-optimal implementations
EnumeratingCommon Sub-Expressions Root: Original Reed-Muller form Eitherxy or xy replaced by a new variable The nodes of the DAG correspond to all partial implementations of the two expressions with some sharing between them
Pruning the Enumeration DAG • The size of DAG can be as large as O ((n + m) 2m), where n is the number of variables and m is the size of Boolean expressions • Enumerating the whole DAG is computationally infeasible • Pruning Criteria • Recognizing node equivalence (width reduction) • Merging some reductions into a single one(height reduction) • Delaying certain reductions (branch reduction)
There Is Scope for Further Pruning… Number of possible implementations: >1060 Number of explored implementations: 2687 Number of actual Pareto-optimal implementations: 4 Area and delay for all 6-bit adders generated by our algorithm Without any pruning, it would be impossible to handle expressions with more than five variables
+ …but the Enumeration Algorithm Finds Interesting Non-Trivial Relations! 4x4-bit multiplier: better than our best manually-designed cell-based multiplier?! The method has been generalized for higher bitwidth multipliers It reduced the delay of the best cell-based 8 x 8-bit multiplier by 10% Verma & Ienne; ASP-DAC 2007 Best Paper Award nominee
Summary Verma, Brisk, & Ienne; DAC 2007 Best Paper Award nominee Verma, Brisk, & Ienne; IWLS 2008 Verma & Ienne; DAC 2006 Verma & Ienne; ICCAD 2004 Verma, Brisk, & Ienne; TCAD 2008 Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level
Computer Arithmetic and Automation • Computer Arithmetic has been for long the domain of extremely ingenuous manually developed architectures • Automation has mostly addressed the optimization of such architectures through the exploration of the predefined design spaces they delimit • Logic synthesis, from the “bottom”, has failed to explore beyond known territories due to fairly fundamental issues It is perhaps high time to try to change all this…