350 likes | 470 Vues
This paper discusses how to expedite statistical static timing analysis (SSTA) by leveraging Graphics Processing Units (GPUs). It covers the basics of SSTA, advantages and drawbacks of both static timing analysis and SSTA, the utilization of Monte Carlo methods, comparisons between CPU and GPU, and previous works in the field. The proposed approach involves Monte Carlo-based SSTA on GPUs with specific algorithms for generating gate delay samples and analyzing circuit delays. It explains the advantages of using GPUs for SSTA and highlights the parallel processing capabilities.
E N D
Accelerating Statistical Static Timing Analysis Using GraphicsProcessing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University, College Station, TX ASPDAC 2009
Outline • Preliminaries • Previous works • The proposed approach • Experimental results • Conclusions
Outline • Preliminaries • Previous works • The proposed approach • Experimental results • Conclusions
Preliminaries • Static Timing Analysis • Statistical Static Timing Analysis • Monte Carlo method • Some differences between GPU and CPU
Static Timing Analysis (STA) • At each gate, the MAX of the SUM of the input arrival time at pin i plus the pin-to-output rising (or falling) delay from pin i to the output is computed. • Use LUT for storing delay of each type of gates or compute the delay according to specific equations. • Worst case delay as the representa-tive value.
STA example • We use a 2-inputs NAND as a example.
Pros and Cons of STA • Pros • Can be computed very fast. • Very easy to understand the meaning. • Cons • Not that precise. • Hard to deal with the process variation. • Moreover, variations become less systematic now.
Statistical Static Timing Analysis (SSTA) • Apply probability and statistics in signals, gates, etc. • Basic ideas is the same: MAX and SUM. • Need to generate random samples or deal with probability distribution functions (PDFs) directly.
Why SSTA? • To deal with variations and to move beyond the limitations of the deterministic nature of traditional STA techniques. • The main idea is to include the effect of variations in order to analyze circuit delay more accurately.
Pros and Cons of SSTA • Pros • Could deal with variations. • High accuracy. • Cons • High runtime cost for accurate method. • May have big difference between different methods.
Monte Carlo method • There is no single Monte Carlo method; instead, the term describes a large and widely-used class of approaches. • However, these approaches tend to follow a particular pattern: • Define a domain of possible inputs • Generate inputs randomly from the domain using a certain specified probability distribution • Perform a deterministic computation using the inputs • Aggregate the results of the individual computations into the final result
A simple example for Monte Carlo method • How can we approximate π? • Draw a square and a circle within it on the ground. • Uniformly scatter some uniform size object into the square. • Counting the number of objects in the circle and dividing by the total number of objects in the square will yield an approximation for π / 4
A simple example for Monte Carlo method (cont.) • Generally speaking • The more the objects (samples), the more the preciseness. • The smaller the objects (unit of samples), the more the preciseness. • Distribution of the objects (distribution function of samples) affects the result.
Outline • Preliminaries • Previous works • The proposed approach • Experimental results • Conclusions
Previous works • Block-based SSTA • Perform statistical MAX and SUM operations and traverse the circuit in a level-wise BFS • Fast but not that accurate • Path-based SSTA • Calculate delay PDF of each selected path • Maybe accurate but hard to decide the path that should be selected
Previous works (cont.) • Block-based SSTA like [14][15][16] are fast but only an approximation. • Path-based SSTA like [17] using Gaussian distribution propagation is also approximation. • [19][20][21] propose faster algorithm that compute only the bound of result. • [22][23][24][25] do operations on PDFs.
Outline • Preliminaries • Previous works • The proposed approach • Experimental results • Conclusions
The proposed approach • Monte Carlo based SSTA on GPU with Mersenne Twisterpseudo-random number generator and Box-Muller transformations. • Compute delay of gates like path-based SSTA approach. • Traverse circuit like block-based SSTA approach.
Monte Carlo based SSTA • Generate gate delay samples according to μ and σ. • Do STA for each set of samples. • Aggregate results to produce the full circuit delay distribution. • The spirit of Monte Carlo method – The more the objects (samples), the more the preciseness.
Why Monte Carlo based SSTA on GPU? • Sample parallelism • the generation of samples and the corresponding static timing analysis for a single gate computation can be executed in parallel, with no data-dependency • Data parallelism • gates at the same logic level can execute Monte Carlo based SSTA in parallel
Why Monte Carlo based SSTA on GPU? (cont.) • SIMD of GPU • Parallel execute Mersenne Twisterpseudo-random number generator followed by Box-Muller transformations • Large memory bandwidth of GPU • Extremely fast in lookup • Many threads of GPU • STA with lots of samples can be executed fast • Memory access time can be hided well
Mersenne Twisterpseudo-random number algorithm • Developed in 1997 by Makoto Matsumoto and Takuji Nishimura that is based on a matrix linear recurrence over a finite binary field F2. • For a k-bit word length, the Mersenne Twister generates numbers with an almost uniform distribution in the range [0,2^k -1]. • Long period, efficient use of memory, good distribution properties and high performance
Box-Muller transformations • Given a source of uniformly distributed random numbers. • A method of generating pairs of independent standard normally distributed (zero expectation, unit variance) random numbers • Transform into N(0,1) • Developed by George Edward Pelham Box and Mervin Edgar Muller at 1958.
Example • Suppose a random number sequence: • 0.1 -0.2 0.2 -0.2 0.4 0.1 -0.3 0 0.5 0.1 -0.4 0.2 0.3 -0.2 -0.5 0.3 0.1 0
Outline • Preliminaries • Previous works • The proposed approach • Experimental results • Conclusions
Experimental results • NVIDIA GeForce 8800 GTX graphic card • 768MB memory • Some are listed in previous slides • The environment that is compared with • 3.6GHz CPU with 3GB memory • Linux • Monte Carlo analysis was performed with 64K samples
Experimental results - Some comparisons • Running 16M threads of SSTA kernel • CPU took 37.158 sec • GPU tool 0.115 sec • About 320x faster • Mersenne Twister generator • CPU generates about 2.24*10^7 number/sec • GPU generates about 2.33*10^9 number/sec • About 100x faster
Outline • Preliminaries • Previous works • The proposed approach • Experimental results • Conclusions
Conclusions • Monte Carlo based SSTA on GPU • Mersenne Twister generator and Box-Muller transformation • Combination of path-based SSTA approach and block-based SSTA approach • No loss of accuracy and ultra fast