Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis

Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein Department of Computer Science Carnegie Mellon University

QDI: Orphans problem • Early propagation: • “A” arrives early => Z transitions • Stale values on the other signals • Incorrect behavior: inputs acknowledged before being received A1 X1 B1 A0 X0 B0 Z1 C1 Y1 Z0 D1 C0 Y0 D0

DoneA Done C NCL-X solution A1 X1 B1 A0 X0 B0 Z1 N1 C1 Y1 Z0 D1 C0 N3 Y0 D0 N2 Add completion detection

QDI Gate Delays  QDI implementations always assume the worst: equal probability for any gate delay

Motivation • Quasi-Delay Insensitive (QDI) circuits: • One timing constraint • Naturally tolerate parametric variation, but… • Have large area overheads • Added completion detection for correctness

Goal: pay only what is necessary Parametric Variation and Gate Delays ITRS’05: 35% parametric variation by 2020 

Use timing information to reduce size of completion detection Use mixed gates to further reduce area w/ early propagation w/o early propagation regular gates strict gates Goal: Optimizing Sync→Async Flow

Contributions Three new relative-timing area optimizations: • Direct method: • Timing analysis + simple CD elimination • Greedy method: fast but not optimal • Uses strict gates, but may increase area • Exact method: optimal, but slow • Solves an mILP problem

Outline • Timing analysis & Direct Optimization • Greedy optimization method • Exact optimization method • Results • Conclusions

Basics • QDI circuits: • Unbounded but finite delays on gates and wires • One timing assumption: isochronic fork • Timed circuits: • Delays on gates and wires: bounded time intervals • Given input arrival times: compute propagation intervals for each gate and wire

GlobalPI (1.5,1.9) (1.0,1.2) (0,0) (0,0) (1.1,1.2) (0.5,0.7) (0,0) (2.0,5.6) (2.0,5.6) (3.0,4.0) (0,0) (0.5,0.7) (0,0) (3.6,4.9) (3.5,4.1) (0.6,0.8) (3.6,4.9) (0,0) (0,0) Timing Computation X • Conservative assumption: any input change can trigger an output change A (1.5,1.9) B N1 Z C N3 D N2 Y

Done Under any input change, gate quiescent when output produced 1.9 < 2.0 (1.0,1.2) (1.1,1.2) C Direct Optimization Method (1.5,1.9) X • Gate completion detection iff gate may not be stable when outputs are produced A (1.5,1.9) B N1 Z (2.0,5.6) (2.0,5.6) (3.0,4.0) C (3.6,4.9) N3 (3.5,4.1) D (3.6,4.9) N2 Y

All inputs must arrive before producing an output Eliminate early propagation effect Extremely expensive Decrease length of propagation interval C C C Strict Gates A B

(1.5,1.9) (1.0,1.2) (1.5,1.9) (1.1,1.2) (5.0,6.8) (5.0,6.8) (3.0,4.0) Done (3.6,4.9) (3.5,4.1) (3.6,4.9) Timing Computation with Strict Gates X A • Entire completion detection: single OR gate B N1 Z C (1.4,1.9) N3 D N2 Y • This circuit: area not reduced • Goal: smart insertion of strict gates

Outline • Timing analysis & Direct Optimization • Greedy optimization method • Exact optimization method • Results • Conclusions

Greedy Optimization (1) • Strict gates: area implications • GlobalPI may be narrower and delayed • Fewer gates non-quiescent • Smaller completion detection • Greedy optimization framework: • Flip gates in the circuit from normal to strict • Select most promising candidate • Continue until no improvements possible

Greedy Optimization (2) Algorithm: • For each gate Gi in the circuit • Flip each gate Gi in turn from regular to strict • Perform timing analysis, compute GlobalPIi • Flip back Gi to regular • Select Gk with the narrowest GlobalPIk • If GlobalPIk narrower than previous best: • Flip Gk to strict permanently • Continue (goto 1) Else: finish

Greedy Optimization (3) • Algorithm does not optimize for area directly • Instead: may reduce the completion detection by narrowing the output interval • Results promising, but individual benchmarks may result in larger area

Outline • Timing analysis & Direct Method • Greedy optimization method • Exact optimization method • Results • Conclusions

Exact Optimization Method • mixed Integer Linear Programming (mILP) • Transform circuit graph into an optimization problem: • Introduce variables for each gate, wire and primary input/output • Matrix coefficients: from library (gate areas) and back-annotation (gate/wire delays) files • Decision variables (GS) should gate be strict?

mILP formulation • Minimize: TotalArea = GateArea+CDArea • GateArea = i (GSi·SAreai + (1-GSi)·NAreai) • CDArea = SCD·Or2Area + (SCD-1)·CArea • SCD: # gates that need completion detection • NeedsCD: does a gate need CD? • NeedsCD = 0 if PIM < GlobalPIm or successor is strict; otherwise 1 • Rest of the model implements timing computation

Improving the mILP Model • Basic mILP model: too slow even for small circuits (hours for dozen gates) • Leverage problem knowledge into model improvements: • Branching order: gates closer to the output are more likely to become strict => inspected first • Single input gates: never strict • Provide initial solution (result of greedy opt) • Can solve problems with hundreds of gates in minutes

Related Work: Optimizations • Cortadella et al: • logical function decompositions • can achieve substantial area savings • can be the starting point for our methods • Zhou et al: consider strict gates in optimization, but no timing information • Sokolov et al: two timing optimizations • Alternate levels: unrealistic assumptions for gate delays • Longest path: applicable only for small circuits

Experimental Setup • Tool flow: • Synthesis & tech-mapping with Synopsys Design Compiler • Perl scripts for dual-rail implementations • Optimization tool reads structural Verilog and timing back-annotations • End result: optimized circuits (Verilog) • Experiments: • Arithmetic and ISCAS’89 benchmarks • Pre-layout runs in 0.18m technology

Greedy: 2.83x NCL-X area for le32 Direct: 0.83x Greedy: 0.55x mILP: 0.43x mILP does not finish in less than 1 hour Partial results Area: Ratio vs. NCL-X method

8/168 strict 4.7% before → 40% after Over twice as small than NCL-X Area breakdown

Parametric Variation: BK adder

Conclusions • Paper introduced: • a method to translate synchronous circuits into optimized asynchronous circuits • Three new relative timing optimizations for improving area • Direct: extremely simple • Greedy: fast, good results • Exact: optimal, may be extremely slow • Analyzed the impact of parametric variation on these circuits

Backup slides

Outline • Background • Timing analysis & Direct Optimization • Greedy optimization method • Exact optimization method • Results • Conclusions

Introduction • Future deep sub-micron technologies: • large parametric variations (ITRS’05 predicts 35% by 2020). • Asynchronous design a natural fit • Asynchronous handshaking: widespread • Acceptance for asynchronous circuits is predicated on quality CAD tools: • “Pure” async: from scratch • Sync to async translation

A1 X1 B1 A0 X0 B0 Z1 N1 C1 Y1 Z0 D1 C0 N3 Y0 D0 N2 Dual-rail circuit Synchronous to Asynchronous Translation Z = (A·B)·(C+D) A X N1 B Z N3 C N2 D Y Synchronous circuit Template-based replacement of each sync gate

Related Work • Numerous approaches for translating synchronous circuits into asynchronous • Dealing with the orphans problem: • Kondratiev et al: NCL-X (discussed below) • Brej: anti-tokens • Allows for early propagation • Completion detection in background • Even larger area overheads

CrtSol: current best Integer solution Best Estimation: best guess of how far the optimum is When 0, optimum found ILP optimization for 32-bit BK adder

Outline • Timing analysis & Direc Optimization • Greedy optimization method • Exact optimization method • Results • Conclusions

8/168 strict 4.7% before → 40% after Over twice as small than NCL-X Area breakdown

mILP Run Time

Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis

Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis

Presentation Transcript

Timing Analysis

Timing circuits

Timing Analysis

Timing Analysis

Timing Measurements of Synchronization Circuits

Timing Analysis

Timing Analysis

Relative analysis

Timing Analysis

XMM-Newton EPIC relative timing analysis

Timing Faults in VLSI circuits

lecture 4: Timing circuits

Timing Analysis

Optimizations using SSA

CLOCKS AND TIMING CIRCUITS

Performance-oriented Peephole Optimisation of Balsa Dual-Rail Circuits

Static Timing Analysis for Threshold Logic Circuits

Timing Model Reduction for Hierarchical Timing Analysis

Dual-Chamber Timing

Timing Analysis

Timing Analysis

Timing circuits