Variation-Aware Chip Design for Reliability and Performance

Variation-Aware Chip Design for Reliability and Performance Deming Chen, ECE, UIUC Students: Christine Chen, Greg Lucas, Lu Wan Acknowledgement: work partially supported by Sun Microsystems

Outline Background Process variation Motivation of variation-aware chip design SSTA with multiple clock domains Telescopic logic for processor performance Clock tree design with skew reduction Conclusion

Process Variation Increases as device and interconnect feature sizes are scaled down Can be within-die (intra-die) and between dies (inter-die) (Source: Intel)

Traditional Solutions Speed/Power binning: measure chips and bin into performance categories, sell lower performing or power-hungry chips at a lower price Guard-band the design to achieve the desired yield Uses pessimistic worst-case process corners Inefficient as the variability increases with scaling

Deterministic analysis: WCETX = μX + 3σX, WCETY = μY + 3σY WCETX + Y = WCETX + WCETY = μX + μY + 3(σX + σY)‏ Example • Worst Case Execution Time of X + Y, where X = <μX , σX>, Y = <μY , σY>‏ Statistical analysis: X + Y = <μX + μY , sqrt(σ2X + σ2Y)> WCETX + Y = μX+Y + 3σX+Y = μX + μY + 3(sqrt(σ2X + σ2Y))

Statistical Static Timing Analysis (SSTA) for Multiple Clock Domains

Introduction Increased process variation in DSM technologies demands SSTA Many SSTA algorithms have been proposed, but they all focus on simple timing graphs and the traversal algorithm In industrial designs, there are multi-cycle paths, multiple clock domains, and false paths To meet the demands of industry strength designs, SSTA must be extended to handle complex timing graphs

WID Variation Modeling 2 components to process variation Systematic Variation Lg, Wg Random Variation Na, tox 40%-60% of the total variation is systematic [Nassif, ISQED’00] Therefore, correlation must be considered Utilize a grid based correlation model [Chang & Sapatnekar, ICCAD’03] Correlation is a function of distance, all cells within a grid have correlation = 1

MCSSTA Algorithm Overview MCSSTA extends SSTA to handle Multi-cycle paths Multiple Clock Domains False Paths Adder A – 1 cycle Multiplier B – 2 cycles Multiplier C has both single and multi-cycle paths through it

Extending the Max Equations to Multi-Clock Domains P1=(FF1,FF2) FF5 – 1 cycle P2=(FF3,FF4) FF5 – 2 cycles Clock Domain Decomposition Find: pdfFF5 = max(P1,P2) considering the timing constraints and correlation Step 2: Correlation Correction Step 1: Normalization Where μ = mean, σ = standard deviation, and n = cycle constraint

MCSSTA Timing Graph Setup • Each node/edge contains a list of the cycle constraints that go through the node/edge • Can account for false paths and other complicated timing constraints by removing the cycle constraint from a node/edge Circuit and timing graph for multiplexer C

Principal Components Timing Traversal PCA transforms a set of correlated random variables into a set of independent random variables: Significantly simplifies traversal of the timing graph since correlation does not have to be tracked. Two properties of PCA: 1. 2. The normalization and correlation correction operations can be performed simultaneously by dividing the principal components by the cycle constraint

Modified PCA Traversal Normalization and Correlation Correction are performed during the timing graph traversal

Experimental Results • ISCAS benchmarks, slowest 70% of paths set to 2 cycles • 0.207% error in mean • 2.0% error in standard deviation

Summary SSTA is closer to maturity It has been extended to consider complex timing constraints for normal distributions In the future, we plan to extend the method to handle non-normal distributions

Telescopic Logic for Microprocessor Performance

Traditional View of Circuit Optimization Quality metric for circuit optimization Cycle time: tcycle > ATlongest path Power consumption: Poverall = Pdynamic + Pstatic Circuit optimization is static Static timing analysis Longest path receives most optimization effort Decomposition Re-synthesis Sizing up/down Dual threshold voltage Power optimization creates critical path wall Critical path wall makes timing optimization more difficult

Recent Innovation: RAZOR Logic Tolerate timing error Data correct – one cycle Data error – n cycle to recovery (n>1) Perf = Fmax * ( p + (1-p)/n), where p is probability that data is latched correctly. RAZOR logic

A Promising Alternative: Telescopic Logic • One-cycle class: set of input vectors that make circuit stable before tcycle. • Two-cycle class: set of input vectors that make circuit stable after tcycle. • fh asserted when input vector belongs to two-cycle class. • Throughput: Telescopic unit

Concept of Dynamic Circuit Optimization Classification of Primary Output (PO) Critical(C) / non-critical(NC) High-Activity (HA) / Low-Activity (LA) Four possible combinations: C+HA, NC+HA, C+LA, NC+LA Question: should the optimization be constrained by paths that are rarely exercised? Dynamic optimization: Timing speculation: 1. allow few PO slower than tcycle; 2. do data recovery when error is latched. Instead of spend equal optimization effort on “C+HA” and “C+LA”,dynamic opt. biases optimization effort towards “C+HA”

Dynamic Optimization with Telescopic Logic ROBDD: Reduced Ordered Binary Decision Diagram is used to encode the functionality of circuit. TCF: a Timed Characteristic Function (BDD+timing) that encodes time and function relationship, is built for the circuit. 2. Using TCF to derive sensitization probability 1. Represent function with ROBDD

Cont’ • PROB: Given signal probability of each PI and TCF of a circuit, the probability of sensitizing the POs can be derived. • LowVT: accelerate certain nodes in circuit by assigning lowVT. • MINCUT: using maxflow-mincut algorithm to find candidates to assign lowVT 3. Maxflow-mincut chooses candidates to assign lowVT

Use TCF+BDD to evaluate functional bias Given different input probability Case 1: each PI has static prob. = 0.5; Case 2: each PI has static prob. =0.2; When overclocked for the same amount, probability of getting correct outputs case 2 > case 1.

Dynamic Optimization Effect • Blue is Synopsys dualVT optimized result • Red is dynamic optimized dualVT result with the same amount of lowVT cells. • Though longest path for red is longer than blue, red has higher probability of getting correct output than blue.

Summary Telescopic logic can be a promising approach for dynamic circuit optimization to improve performance. Techniques such as BDD, TCF, maxflow-mincut and lowVT assignment can be used to achieve dynamic optimization. Compared to circuit optimized in traditional way, dynamic optimization increases the overall throughput.

Clock Tree Design under Process Variation

Zero-Skew Clock Tree Synthesis Clock skews are differences in clock arrival times and hurt circuit frequency There are existing clock tree synthesis algorithms for zero skew! Tsay, “Exact zero skew,” 1991 Chao et al., “Zero Skew Clock Routing With Minimum Wirelength”, 1992 others However, exact zero skew cannot be achieved in the presence of process variation

Bounded Skew Clock Tree Synthesis Delays from the clock source to all of the clock sinks are within a certain bound The skew bound is defined as the maximum difference in clock arrival times There are existing clock tree synthesis algorithms Cong et al., “Bounded-Skew Clock and Steiner Routing”, 1998 others Some tradeoffs can be obtained among wirelength, power, and skew bounds However, these works still deal with deterministic delays

Buffered Clock Tree Synthesis In most of the buffered clock tree synthesis algorithms, buffers are inserted after clock tree routing by selecting potential buffer positions in the tree Simultaneous clock tree routing and buffer insertion is done in [3] and [4] [4] is designed to construct a balanced buffered clock tree, which is more compatible for future improvement, e.g. link insertion [3] Chen and Wong, “An Algorithm for Zero-Skew Clock Tree Routing with Buffer Insertion”, 1996 [4] Rajaram and Pan, “Variation Tolerant Buffered Clock Network Synthesis with Cross Links”, 2006

Motivation for This Work Obviously, there are some tradeoffs among clock tree topology, number of buffers, delay, and skew bounds Can we capture these properties and make design space exploration during the synthesis of the clock tree? Furthermore, can we utilize the information to make future improvement? Also, the delays are actually probabilistic distributions instead of deterministic values Can we utilize the delay distributions to make the clock trees more robust?

Algorithm Overview Construct buffered clock trees in a bottom-up fashion passing delay and skew bound information along the way At the top (root) level, several clock trees with different delay and skew bound properties are available for the user to choose from Tradeoffs among delay, skew bounds, buffer distribution, and tree topology can be seen in this stage After the target tree is chosen, the final topology is built in a top-down way

Merging Region [1/2] Given two subtrees A and B, we want to connect them to a merging point M1 and obtain the delay and skew bounds from M1 to any of the clock sinks in A and B Several other points Mi can be chosen such that the delay and skew bounds from Mi to the sink nodes are the same as those of M1 These points form a merging region A M1 Mi B

Several different merging regions for subtrees A and B can be constructed by defining different delay and skew bounds The delay and skew bounds associated with the merging region can be passed to the upper level when merging the new subtree M and another subtree Merging Region [2/2] M A B

Buffer Insertion After the merging region is constructed, we make it a potential buffer position A buffer library is given, defining the characteristics (e.g. intrinsic delay, capacitance, resistance) of different types of buffers Try to insert different types of buffers (or not to insert any buffer) Pass the solutions to an upper level M M M A A A B B B

Illustration: Merging Two distributions and describe the delay at node M M Wire delays A B No delay at the sink nodes

Illustration: Skew Bounds Define the upper skew bound as Define the lower skew bound as Define the overall skew bound as Define the maximum delay as For each selected merging region, we can obtain a pair of (b, d) to characterize it M A B

Buffer Insertion Once a merging region is determined, we try to insert buffers at the position New delays are calculated as is the intrinsic delay of buffer type i is the resistance of buffer type i New (b, d) pairs are also calculated accordingly M A B

Pruning Pruning can be performed to eliminate redundant solutions The blue points are redundant because they have larger (b, d) pairs compared to at least another point The red points are kept in the solution space d b

Next Iteration R M N A B C D

Top-Level Decision When we get to the topmost level, some pairs of (b, d) representing different clock tree designs are available Choose one design that is most desirable Trace back in a top-down way to build the clock tree d b

Summary Our algorithm is able to make a design space exploration on buffered clock trees, capturing their different delay, skew bounds, buffer distribution, and tree topology The synthesized clock trees can be further improved according to their properties For example, a clock tree with loose skew bounds can be improved using link insertion or other techniques More accurate delay models such as SSTA with spatial correlation, or non-normally distributed delays can be adopted

Conclusions • Process variation is shifting chip design into the statistical domain • Various analysis/optimization approaches can be taken • Process variation modeling • Statistical timing analysis • Statistical gate level optimization • Statistical physical design • Variation-aware architecture • … • Performance/power efficiency can be effectively improved • Some remaining challenges • Further validation of variation modeling • How systematic variation can be correlated well to the actual measured data • Novel joint CAD/architecture works • Consider T and V variation together with process variation

Thank you!

Variation-Aware Chip Design for Reliability and Performance

Variation-Aware Chip Design for Reliability and Performance

Presentation Transcript

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Design for Accelerator Reliability

Design For Variation

Temperature-Aware SoC Test Scheduling Considering Inter-Chip Process Variation

Performance and Reliability 101

RF System Improvements for Performance and Reliability

Variation Aware Gate Delay Models

“Location Based On-Chip Variation”

DESIGN FOR RELIABILITY

Variation-Aware Design for Nanoscale VLSI

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Chip design

Power, Temperature, Reliability and Performance - Aware Optimizations in On-Chip SRAMs

Reliability and Performance

Pay for Performance and Regional Variation

Variation Aware Placement in FPGAs

Design Reliability

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Auto Performance Chip

Performance chip

Variation Aware Placement in FPGAs

Performance Chip Review