Architecture and Details of a High Quality, Large-Scale Analytical Placer

Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California, San Diego http://vlsicad.ucsd.edu/ Work partially supported by the MARCO Gigascale Systems Research Center. ABK is currently with Blaze DFM, Inc., Sunnyvale, CA.

Outline • History of APlace • From APlace1.0 to APlace2.0 • Anatomy of APlace2.0 • New techniques in APlace2.0 • Experimental Results • Conclusions and Future Work

History of APlace • Research to study Synopsys patent • Naylor et al., US Patent 6,301,693 (2001) • Extensible foundation: APlace1.0 • Timing-driven placement • Mixed-size placement • Area-I/O placement • ISPD-2005 placement contest  APlace2.0 • Many parts of APlace rewritten • Superior performance

Outline • History of APlace • From APlace1.0 to APlace2.0 • Anatomy of APlace2.0 • New techniques in APlace2.0 • Experimental Results • Conclusions and Future Works

APlace Problem Formulation • Constrained Nonlinear Optimization: Divide the layout area into uniform bins, and seek to minimize HPWL etc. so that total cell area in every bin is equalized • : density function that equals the total cell area in a global bin g • D : average cell area over all global bins

Nonlinear Optimization • Smooth approximation of placement objectives: wirelength, density function, etc. • Quadratic Penalty method • Solve a sequence of unconstrained minimization problems for a sequence of µ → 0 • Conjugate Gradient (CG) solver • Useful for finding an unconstrained minimum of a high-dimensional function • Adaptable to large-scale placement problems: memory requirement is linear in problem size

Wirelength Approximation • Half-Perimeter Wirelength (HPWL) • Half-perimeter of net’s bounding box • Simple, close measure of routing congestion • Not strictly convex, or everywhere differentiable • Log-Sum-Exp approximation • Naylor et al., US Patent 6,301,693 (2001) • Precise, closer to HPWL when α→ 0 • Strictly convex, continuously differentiable

 : Smoothing Parameter • “Significance criterion” for choosing nets with large wirelength to minimize • Larger gradients for longer nets • Minimize long nets more efficiently than short nets • Two-pin net • Partial gradient for x1 • close to 0, when net length |x1- x2| is small compared to  • close to 1 or -1, o.w.

Area Potential Function • Overlap area = • overlap along the x and y directions • 0/1 function with cell size ignored • Area potential function: defines an “area potential” exerted by a cell to nearby grids • smooth bell-shaped function for standard cells [Naylor et al., US Patent 6,301,693 (2001)]

Module Area Potential Function • Mixed-size placement: decide scope of area potential based on module's dimension • p(d) : potential function • d : distance from module to grid • radiusr = w/2 + 2wg for block with width w • convex curved < w/2 + wg • concave curvew/2 + wg < d < w/2+ 2wg • smooth at d = w/2 + wg p(d) 2 1-a*d 2 b*(r-d) d -w/2-2wg w/2+ wg

Changes: APlace1.0  APlace2.0 • Strong scalability from new clustering algorithm • Dynamic adjustment of weights for wirelength and overlap penalty during global placement • Improvements to legalization, detailed placement • whitespace compaction • cell reordering algorithms • global greedy cell movement • APlace2.0 vs. APlace1.0: up to 19% WL reduction 1.5-2x speedup

IBM BigBlue4 Placement 2.1M instances, HPWL = 833.21, CPU = 23h

Outline • History of APlace • From APlace1.0 to APlace2.0 • Anatomy of APlace2.0 • New techniques in APlace2.0 • Experimental Results • Conclusions and Future Works

Anatomy of APlace 2.0 Clustering Adaptive APlace engine Global Phase Unclustering Legalization WS arrangement Detailed Phase Cell order polishing Global moving

New Feature 1: Multi-Level Clustering netlist Objective: cluster to reduce runtime and allow scalable implementations with no compromise to quality reduce netlist size by 10x • Multi-level approach using best-choice clustering (ISPD’05) size ~ 2000? no yes • Clustering ratio  10 • #Top-level clusters  2000 • Wirelength calculation • assume modules located at cluster center • only consider inter-cluster parts of nets global placement no flat? uncluster yes Legalization

Best-Choice Clustering • Each clustering level uses the best-choice heuristic with lazy updates and tight area control • For each clustering level: • Calculate the clustering score of each node to its neighbors • based on the number of connections and areas • Sort all nodes based on their best scores using a heap • Until target clustering ratio is reached: • If top node of heap is “valid” then cluster it with its closest neighbor • Else recalculate the top node score and reinsert in heap; Continue • calculate the clustering score of the new node and reinsert into the heap • update netlist and mark all neighbors of the new node as invalid

Two Clustering Concerns • Mark boundaries of clustering hierarchy at each clustering level •  allow exact reversal of clustering during unclustering • Meet target number of objects by avoiding “saturation” • bypass small fixed objects during clustering fixed object bypass fixed objects cluster

Multiple Levels of Grids • Adaptive grid size based on average cluster size • Better global optimization • use solution of placement problem constrained with coarser grids as initial solution for problem constrained with finer grids • Better scalability • larger grid size spreads modules faster • Different levels of relaxation for density constraints • According to grid size

New Feature 2: Adaptive WL Weight • Important to QOR • Initial weight value • For each cluster level and grid level • Based on wirelength and density partial derivatives • Goal: Magnitudes of gradients roughly equal • Decrease WL weight by half whenever CG solver obtains a stable solution

New Feature 3: Legalization and Detailed Placement Variant of greedy legalization algorithm (Hill’01): • Sort all cells from left to right: move each cell in order to the closest legal position • Sort all cells from right to left: move each cell in order to the closest legal position(s) • Pick the better of (1) and (2) • Detailed Placement Components: • Global cell movement (Goto81, KenningsM98 BoxPlace, FP…) • Whitespace compaction (KahngTZ’99, KahngMR’04) • Cell order polishing (similar to rowIroning, FS detailed placer) • Intra-row cell reordering • Inter-row cell reordering

Global Moving • Move cell to “optimal” location among available whitespace • improve quality when utilization is low • Two steps • search for available location in optimal region of a cell’s placement • search for available location in “best” bin • divide placement area into uniform bins • choose “best" bin according to available whitespace and cost of moving cell to bin center • assume normal distribution of whitespace with width and estimate if an available location exists

WhiteSpace (WS) Compaction row start node sites 1 2 3 4 5 6 7 8 9 10 11 12 cell 1 cell 2 cell 3 cell n end node • Each chain represents the possible placement sites for each cell • The cost on the arrows is the change in HWPL of the cell move to each site • The order of chains correspond to the order of cells from left to right in a row • A Shortest path from source to sink gives the best way to compact WS

Cell Order Polishing • Permute a small window of neighboring cells in order to improve wirelength • MetaPlacer’s rowIroning: up to 15 cells in one row assuming equal whitespace distribution • FengShui's cell ordering: six objects in one or more rows regarding whitespace as pseudo cells • Branch-and-bound algorithm • four nearby cells in one or multiple rows • consider optimal placement for each permutation • more accurate, overlap-free permutations and no cell shifting

Single-Row Cell Ordering • Cost of placing first j cells of a permutation • cost = wirelength increase when placing a cell • ΔWL≠ 0, only if cell is leftmost of rightmost • remaining cells placed to the right of first j cells • unrelated to order or placement of remaining cells • B&B algorithm • construct permutations in lexicographic order • next permutation has same prefix as the previous one • beginning rows of DP table can be reused as possible • cut branch when minimum cost of placing first j cells > best cost till now

Two- or Three-Row Cell Ordering • DP algorithm • decide how many cells assigned to each row from up to down • construct a permutation in lexicographic order • find “optimal” placement within the window • Y-cost of placing first j cells: accurate • remaining cells placed lower than first j cells • X-cost of placing first j cells: inaccurate when a net connects placed and unplaced cells • results show still effective with small set of cells and small window

Outline • Introduction • Clustering • Global Placement • Detailed Placement • Experimental Results • IBM ISPD04 • IBM-PLACE v2 • IBM ICCAD04 • IBM ISPD05 • Conclusions and Future Works

IBM ISPD04 • Test basic placer performance with standard cells • 3% better than the best other - mPL5 (ISPD05)

IBM Place V2 • Test placer under whitespace presence and routability • 12% better than mPL-R+WSA (ICCAD04)

IBM ICCAD04 • Test placer performance with cells and blocks (floorplacement) • 14% and 19% better than FS and Capo, respectively

IBM ISPD05 • Test placer performance with cells and movable/fixed blocks • 6% better than the best other placer (mFAR)

APlace2.0 Conclusions • 60 days + clean sheet of paper + Qinke Wang + Sherief Reda • Scalable implementation • State-of-the-art clustering and global placement engines • Improved detailed placement engine • Better than best published results by • 3% ISPD’04 suite • 14% ICCAD’04 • 12% IBMPLACE V.2 • 6% ISPD’05 Placement Contest • Recent Applications (other than restoring functionality) • IR-drop driven placement (ICCD-2005 Best Paper) • Lens aberration-aware placement (DATE-2006) • Toward APlace3.0: ?

Thank You • Questions?

Goals and Plan Goals: • Build a new placer to win the competition • Scalable, robust, high-quality implementation • Leave no stone unturned / QOR on the table Plan and Schedule: • Work within most promising framework: APlace • 30 days for coding + 30 days for tuning

Philosophy Respect the competition • Well-funded groups with decades of experience • ABKGroup’s Capo, MLPart, APlace = all unfunded side projects • No placement-related industry interactions • QOR target: 24-26% better than Capo v9r6 on all known benchmarks • Nearly pulled out 10 days before competition Work smart • Solve scalability and speed basics first • Slimmed-down data structure, -msse compiler options, etc. • Ordered list of ~15 QOR ideas to implement • Daily regressions on all known benchmarks • Synthetic testcases to predict bb3, bb4, etc.

Implementation Framework New APlace Flow • APlace weaknesses: • Weak clustering • Poor legalization / detailed placement Clustering Adaptive APlace engine Global Phase Unclustering • New APlace: • New clustering • Adaptive parameter setting for scalability • New legalization + iterative detailed placement Legalization WS arrangement Detailed Phase Cell order polishing Global moving

Parameterization and Parallelizing Tuning Knobs: • Clustering ratio, # top-level clusters, cluster area constraints • Initial wirelength weight, wirelength weight reduction ratio • Max # CG iterations for each wirelength weight • Target placement discrepancy • Detailed placement parameters, etc. Resources: • SDSC ROCKS Cluster: 8 Xeon CPUs at 2.8GHz • Michigan Prof. Sylvester’s Group: 8 various CPUs • UCSD FWGrid: 60 Opteron CPUs at 1.6GHz • UCSD VLSICAD Group: 8 Xeon CPUs at 2.4GHz Wirelength Improvement after Tuning : 2-3%

Artificial Benchmark Synthesis • Synthetic benchmarks to test code scalability and performance • Rapid response to broadcast of s00-nam.pdf • Created “synthetic versions of bigblue3 and bigblue4 within 48 hours • Mimicked fixed-block layout diagrams in the artificial benchmark creation • This process was useful: we identified (and solved) a problem with clustering in presence of many small fixed blocks

Results

Conclusions • ISPD05 = an exercise in process and philosophy • At end, we were still 4% short of where we wanted • Not happy with how we handled 5-day time frame • Auto-tuning  first results ~ best results • During competition, wrote but then left out “annealing” DP improvements that gained another 0.5% • Students and IBM ARL did a really, really great job • Currently restoring capabilities (congestion, timing-driven, etc.) and cleaning (antecedents in Naylor patent)

Architecture and Details of a High Quality, Large-Scale Analytical Placer

Architecture and Details of a High Quality, Large-Scale Analytical Placer

Presentation Transcript

APLACE: A General and Extensible Large-Scale Placer

Crane Scale, Analytical Scale, Bench Scale, Counting Scale,

Large- scale water quality modeling

Construction and Quality control of large scale GEM detectors

A Large-Scale Study of Failures in High-Performance Computing Systems

Preview of a Novel Architecture for Large Scale Storage

LARGE SCALE

Large-scale SharePoint Architecture

Path Towards A Large Scale

A Large-Scale Network Testbed

The Architecture of a Large-Scale Web Search and Query Engine

A Generic Architecture for Large-Scale Distributed Simulations

Unified Architecture for Large-Scale Attested Metering

Large scale

Large Scale Structure of the Universe at high redshifts

Small scale… covers a large area Large scale… covers a small area

Analytical Aspect of Quality Control and Quality assurance

High Throughput and Large Scale Proteomics Analysis

APLACE: A General and Extensible Large-Scale Placer

Optimization of a Large-scale Water Quality Monitoring Network

Architecture and Details of a High Quality, Large-Scale Analytical Placer

High-Quality Automatic Loading Systems for Large Scale Industries in Spain