1 / 40

Large-Scale Static Timing Analysis

Large-Scale Static Timing Analysis. Akintayo Holder Rensselaer Polytechnic Institute. Contents. Introduction Method Summary. Static Timing Analysis. A gate (or other device) requires all input signals to be present at the same time. “Same time” defined by clock signal.

carina
Télécharger la présentation

Large-Scale Static Timing Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute

  2. Contents • Introduction • Method • Summary

  3. Static Timing Analysis • A gate (or other device) requires all input signals to be present at the same time. • “Same time” defined by clock signal. • STA ensures all devices have their expected inputs.

  4. Static Timing Analysis • Static timing analysis (STA) ensures that signals will propagate through a circuit. • Checks that every gate will have valid inputs. • Block oriented • Polynomial • Circuit dictates STA’s behaviour

  5. STA: Arrival Time Calculation • Late-mode static timing analysis was used in this study. • Arrival time (AT) is the latest time a signal can arrive. • AT(i) := max (AT(j) + Delay(S(j,i)), for all S(j,i)) • AT goes forward. 69 Slides

  6. Slack, how STA checks that the circuit is correct. • Required Arrival Time (RAT) – earliest time a signal must arrive. • RAT goes backward from output nodes. • For a circuit to work, there must be overlap between the RAT and AT of each node (Slack). • Slack = RAT - AT 69 Slides

  7. Behaviour of STA • Block oriented STA is polynomial with respect to the size of the circuit. • Running time depends on the circuit size, • But more exacting tolerances require more exacting estimates. • Multiple process corners, statistical STA and other techniques are common in industrial tools and increase running times.

  8. Why large-scale static timing analysis is hard • Performance depends on the circuit. • Circuit must be divided/partitioned to minimize the cost of communication. • Graph/circuit partitioning is hard.

  9. What is the role of Decomposition? • Divide the computation into smaller tasks • Execute tasks in parallel • The smaller the tasks, the more important the decomposition. • Case of Amdahl’s Law 69 Slides

  10. What is meant by Regularity? • Presence of a pattern or evidence of rules. • Easy to understand and define. • A regular pattern can be used to decompose work. • Irregular, no obvious pattern or apparent rules. • No clear pattern. Decomposition chooses among many “bad” choices. 69 Slides

  11. Where do we get Structural Irregularity from? • Structural Irregularity is derived from unstructured objects. • Unstructured objects are defined by their neighbour relations • Neighbour relations are unique, complex and often irregular. 69 Slides

  12. STA and Structural Irregularity • STA depends on shape of circuit • Circuit is irregular, which causes STA to demonstrate structural irregularity • Irregularity makes decomposition hard and limits ability to scale

  13. Why large-scale static timing analysis is hard • STA exhibits structural irregularity. • Irregularity hinders problem decomposition. • Large-scale systems usually display traits that makes efficient decomposition difficult. • Good decomposition needed for good performance.

  14. Why balance will help performance • Balance refers to the relative performance of the processors, the network and the memory interconnect. • Balance is the ability of the processor to saturate either the network or the memory interconnect. • Balance simplifies decomposition.

  15. The Blue Gene has modest processors and high performance networks 69 Slides

  16. The IBM Blue Gene/L • Supports MPI as primary programming interface • Simple memory management. • 512M or 1G per node. • 1 or 2 processors per node. • No virtual memory. • Subset of Linux API. • Adaptive routing for point to point, independent networks for collective. • Optimized for bandwidth, scalability and efficiency.

  17. More on Balance • Cray XT5 • High-density blades with 8 Opteron quad core processors • SeaStar2+, a low latency, high bandwidth interconnect. • Hybrid supercomputer, uses FPGAs and co-processors • Better performance when adding a processor from a new node. • 45% slow down when adding processor on existing node, compared with adding processor from new node. • Network saturation not observed with point to point experiments. • Global operations varied in run time. Worley et al., 2009. Snell et al., 2007

  18. Contents • Introduction • Method • Summary

  19. Understanding the Input • Static timing analysis processes a timing graph, which represents the circuit. • Timing graph • DAG • From input to output pins • Large-scale STA sorts edges in DAG by source depth.

  20. An approach to Large-Scale STA • Compute all nodes on a level, • share the results, • move to the next level. • Levelized Bulk Synchronization: • Only arrival time calculations, • Each processor loads the entire circuit, • Online partitioning using round robin.

  21. Levelized Bulk Synchronization Algorithm level = 0 Repeat Repeat Repeat if (local segment) calculate delay/slew at local processor Until (all incoming segments have been visited) Compute node arrival time and slew Until (all assigned nodes at the current level have been processed) Repeat if (gate segment) Calculate delay/slew at sink processor else if (net segment) Calculate delay/slew at source processor endif Until (all remote outgoing segments have been visited) Once all processors are complete, advance to next level Until (all levels have been processed) 69 Slides

  22. Our modest 3 million node benchmark circuit • 3 million nodes • 4 million segments • Almost 1000 levels depth. • A mean width of 3000 nodes but a median width of 32 nodes. • Estimated theoretical speedup = 260 • Speedup computed for 1024 processors. • Levels with more than 1024 nodes, truncated to 1024 69 Slides

  23. Our modest 3 million node benchmark circuit

  24. Timing? 260x on 1024 processors But Including partitioning, Speedup is 119x. 69 Slides

  25. Removing global dependencies to improve timing algorithm • Bulk synch forces all processor to wait at the end of a level. • Not required by STA. • Solution: Remove global synch. • No Global Synch : • Compute x nodes • Send y updates • Continue until all nodes are done. 69 Slides

  26. Removing Global Synchronization • 10% improvement with 1024 processors. • Improved partitioning appears to have more potential for impact. 69 Slides

  27. What about partitioning • Each processor loads the entire circuit • For each level: • For each node on level: • Assign node n to task (n%(num cpus)) • Build list of local nodes, and segments • Flexible with respect to the number of cores. • Ignores structure of the circuit • Limits the size of the circuit

  28. What about partitioning?

  29. Future Work • Examine how large-scale Static Timing Analysis performs on different circuits.

  30. Why synthesize large circuits? • Motivated by a lack of benchmark circuits and the relatively small size of provided circuits • Would algorithm scale to 10,000s of processors with larger circuits? • Solution: Generate billion node timing graphs

  31. Synthesis of Large circuits By Concatenation By Scaling Create histogram of pins by level and of segments by level. Multiply levels, pins and segments by R Creates a timing graph with same shape (histogram) but more pins Not a rational circuit • Make R copies of the circuit • Connect all copies to a new global sink • Creates a timing graph that captures internal intricacies of original. • Does not capture shape • Maybe disjoint

  32. By Scaling. 23.6 x speedup from 2^9 to 2^14

  33. By Concatenation. 12.6 x speedup from 2^10 to 2^14

  34. Large circuits and parallel I/O • File divided into m parts • Each processor loads one or more parts • Supports n=m/k processors with one file, where m >= k > 0 • Scales because processors read fewer parts as count increases

  35. Parallel I/O.12.2x speedup from 10^9 to 10^14

  36. Contents • Introduction • Method • Summary

  37. Summary • Scaling appears related to the size of the circuit, static timing analysis scales further with larger circuits. • We do not know how shape and structure affect the performance of large-scale static timing analysis.

  38. Prototype for a Large-Scale Static Timing Analyzer running on an IBM Blue Gene, Holder, A., Carothers, C. D., and Kalafala, K., 2010, International Workshop on Parallel and Distributed Scientific and Engineering Computing. • The Impact of Irregularity on Efficient Large-Scale Integer-Intensive Computing, Holder, A., PhD Proposal

  39. Related Work • Hathaway, D.J. and Donath, W.E., 2001. Distributed Static Timing Analysis. • Hathaway, D.J. and Donath, W.E., 2003. Distributed Static Timing Analysis. • Gove, D.J., Mains, R.E. and Chen, G.J., 2009. Multithreaded Static Timing Analysis. • Gulati, K. and Khatri, S.P., 2009. Accelerating statistical static timing analysis using graphics processing units. 69 Slides

  40. Thanks for your time and attention

More Related