Large-Scale Static Timing Analysis

Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute

Contents • Introduction • Method • Summary

Static Timing Analysis • A gate (or other device) requires all input signals to be present at the same time. • “Same time” defined by clock signal. • STA ensures all devices have their expected inputs.

Static Timing Analysis • Static timing analysis (STA) ensures that signals will propagate through a circuit. • Checks that every gate will have valid inputs. • Block oriented • Polynomial • Circuit dictates STA’s behaviour

STA: Arrival Time Calculation • Late-mode static timing analysis was used in this study. • Arrival time (AT) is the latest time a signal can arrive. • AT(i) := max (AT(j) + Delay(S(j,i)), for all S(j,i)) • AT goes forward. 69 Slides

Slack, how STA checks that the circuit is correct. • Required Arrival Time (RAT) – earliest time a signal must arrive. • RAT goes backward from output nodes. • For a circuit to work, there must be overlap between the RAT and AT of each node (Slack). • Slack = RAT - AT 69 Slides

Behaviour of STA • Block oriented STA is polynomial with respect to the size of the circuit. • Running time depends on the circuit size, • But more exacting tolerances require more exacting estimates. • Multiple process corners, statistical STA and other techniques are common in industrial tools and increase running times.

Why large-scale static timing analysis is hard • Performance depends on the circuit. • Circuit must be divided/partitioned to minimize the cost of communication. • Graph/circuit partitioning is hard.

What is the role of Decomposition? • Divide the computation into smaller tasks • Execute tasks in parallel • The smaller the tasks, the more important the decomposition. • Case of Amdahl’s Law 69 Slides

What is meant by Regularity? • Presence of a pattern or evidence of rules. • Easy to understand and define. • A regular pattern can be used to decompose work. • Irregular, no obvious pattern or apparent rules. • No clear pattern. Decomposition chooses among many “bad” choices. 69 Slides

Where do we get Structural Irregularity from? • Structural Irregularity is derived from unstructured objects. • Unstructured objects are defined by their neighbour relations • Neighbour relations are unique, complex and often irregular. 69 Slides

STA and Structural Irregularity • STA depends on shape of circuit • Circuit is irregular, which causes STA to demonstrate structural irregularity • Irregularity makes decomposition hard and limits ability to scale

Why large-scale static timing analysis is hard • STA exhibits structural irregularity. • Irregularity hinders problem decomposition. • Large-scale systems usually display traits that makes efficient decomposition difficult. • Good decomposition needed for good performance.

Why balance will help performance • Balance refers to the relative performance of the processors, the network and the memory interconnect. • Balance is the ability of the processor to saturate either the network or the memory interconnect. • Balance simplifies decomposition.

The Blue Gene has modest processors and high performance networks 69 Slides

The IBM Blue Gene/L • Supports MPI as primary programming interface • Simple memory management. • 512M or 1G per node. • 1 or 2 processors per node. • No virtual memory. • Subset of Linux API. • Adaptive routing for point to point, independent networks for collective. • Optimized for bandwidth, scalability and efficiency.

More on Balance • Cray XT5 • High-density blades with 8 Opteron quad core processors • SeaStar2+, a low latency, high bandwidth interconnect. • Hybrid supercomputer, uses FPGAs and co-processors • Better performance when adding a processor from a new node. • 45% slow down when adding processor on existing node, compared with adding processor from new node. • Network saturation not observed with point to point experiments. • Global operations varied in run time. Worley et al., 2009. Snell et al., 2007

Understanding the Input • Static timing analysis processes a timing graph, which represents the circuit. • Timing graph • DAG • From input to output pins • Large-scale STA sorts edges in DAG by source depth.

An approach to Large-Scale STA • Compute all nodes on a level, • share the results, • move to the next level. • Levelized Bulk Synchronization: • Only arrival time calculations, • Each processor loads the entire circuit, • Online partitioning using round robin.

Levelized Bulk Synchronization Algorithm level = 0 Repeat Repeat Repeat if (local segment) calculate delay/slew at local processor Until (all incoming segments have been visited) Compute node arrival time and slew Until (all assigned nodes at the current level have been processed) Repeat if (gate segment) Calculate delay/slew at sink processor else if (net segment) Calculate delay/slew at source processor endif Until (all remote outgoing segments have been visited) Once all processors are complete, advance to next level Until (all levels have been processed) 69 Slides

Our modest 3 million node benchmark circuit • 3 million nodes • 4 million segments • Almost 1000 levels depth. • A mean width of 3000 nodes but a median width of 32 nodes. • Estimated theoretical speedup = 260 • Speedup computed for 1024 processors. • Levels with more than 1024 nodes, truncated to 1024 69 Slides

Our modest 3 million node benchmark circuit

Timing? 260x on 1024 processors But Including partitioning, Speedup is 119x. 69 Slides

Removing global dependencies to improve timing algorithm • Bulk synch forces all processor to wait at the end of a level. • Not required by STA. • Solution: Remove global synch. • No Global Synch : • Compute x nodes • Send y updates • Continue until all nodes are done. 69 Slides

Removing Global Synchronization • 10% improvement with 1024 processors. • Improved partitioning appears to have more potential for impact. 69 Slides

What about partitioning • Each processor loads the entire circuit • For each level: • For each node on level: • Assign node n to task (n%(num cpus)) • Build list of local nodes, and segments • Flexible with respect to the number of cores. • Ignores structure of the circuit • Limits the size of the circuit

What about partitioning?

Future Work • Examine how large-scale Static Timing Analysis performs on different circuits.

Why synthesize large circuits? • Motivated by a lack of benchmark circuits and the relatively small size of provided circuits • Would algorithm scale to 10,000s of processors with larger circuits? • Solution: Generate billion node timing graphs

Synthesis of Large circuits By Concatenation By Scaling Create histogram of pins by level and of segments by level. Multiply levels, pins and segments by R Creates a timing graph with same shape (histogram) but more pins Not a rational circuit • Make R copies of the circuit • Connect all copies to a new global sink • Creates a timing graph that captures internal intricacies of original. • Does not capture shape • Maybe disjoint

By Scaling. 23.6 x speedup from 2^9 to 2^14

By Concatenation. 12.6 x speedup from 2^10 to 2^14

Large circuits and parallel I/O • File divided into m parts • Each processor loads one or more parts • Supports n=m/k processors with one file, where m >= k > 0 • Scales because processors read fewer parts as count increases

Parallel I/O.12.2x speedup from 10^9 to 10^14

Summary • Scaling appears related to the size of the circuit, static timing analysis scales further with larger circuits. • We do not know how shape and structure affect the performance of large-scale static timing analysis.

Prototype for a Large-Scale Static Timing Analyzer running on an IBM Blue Gene, Holder, A., Carothers, C. D., and Kalafala, K., 2010, International Workshop on Parallel and Distributed Scientific and Engineering Computing. • The Impact of Irregularity on Efficient Large-Scale Integer-Intensive Computing, Holder, A., PhD Proposal

Related Work • Hathaway, D.J. and Donath, W.E., 2001. Distributed Static Timing Analysis. • Hathaway, D.J. and Donath, W.E., 2003. Distributed Static Timing Analysis. • Gove, D.J., Mains, R.E. and Chen, G.J., 2009. Multithreaded Static Timing Analysis. • Gulati, K. and Khatri, S.P., 2009. Accelerating statistical static timing analysis using graphics processing units. 69 Slides

Thanks for your time and attention

Large-Scale Static Timing Analysis

Large-Scale Static Timing Analysis

Presentation Transcript

Timing Analysis

Timing Analysis

STATIC TIMING ANALYSIS

F A S T Frequency-Aware Static Timing Analysis

Large-Scale Phylogenetic Analysis

Analysis of Large Scale Visual Recognition

LARGE SCALE

Timing Analysis

Capturing Crosstalk-Induced Waveform for Accurate Static Timing Analysis

Large scale

Static Timing Analysis for Threshold Logic Circuits

Static Timing Analysis for Combinational Threshold Logic Networks

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units

Statistical Static Timing Analysis

Continuing Challenges in Static Timing Analysis

Large-Scale Simulation Experimentation and Analysis

large scale data analysis

Final Project: Static Timing Analysis on GPGPU

Capturing Crosstalk-Induced Waveform for Accurate Static Timing Analysis

Timing Analysis

Chapter 4b Statistical Static Timing Analysis: SSTA

Static Timing Analysis