Scalable Solvers and Software for PDE Applications

Scalable Solvers and Software for PDE Applications David E. Keyes Department of Applied Physics & Applied Mathematics Columbia University Institute for Scientific Computing Research Lawrence Livermore National Laboratory www . tops-scidac . org

Motivation • Solver performance is a major concern for parallel simulations based on PDE formulations … including many of those of the U.S. DOE Scientific Discovery through Advanced Computing (SciDAC) program • For target applications, implicit solvers may require 50% to 95% of execution time … at least, before “expert” overhaul for algorithmic optimality and implementation performance • Even after a “best manual practice” overhaul, the solver may still require 20% to 50% of execution time • The solver may hit up against both the processor scalability limit and the memory bandwidth limitation of a PDE-based application, before any other part of the code … the first of these is not fundamental, though the second one is

Presentation plan • Overview of the SciDAC initiative • Overview of the Terascale Optimal PDE Simulations project (TOPS) • Brief review of scalable implicit methods (domain decomposed multilevel iterative methods) • algorithms • software components: PETSc, Hypre, etc. • Three “war stories” from the SciDAC magnetically confined fusion energy portfolio • Some advanced research directions • physics-based preconditioning • nonlinear Schwarz • On the horizon

18 projects in scientific software and network infrastructure SciDAC apps and infrastructure 4 projects in high energy and nuclearphysics 14 projects in biological and environmental research 10 projects will in basic energy sciences 5 projects in fusion energy science

“Enabling technologies” groups to develop reusable software and partner with application groups • From 2001 start-up, 51 projects share $57M/year • Approximately one-third for applications • A third for “integrated software infrastructure centers” • A third for grid infrastructure and collaboratories • Plus, multi-Tflop/s IBM SP machines at NERSC and ORNL available for SciDAC researchers

Unclassified resources for DOE science • IBM Power3+ SMP • 16 procs per node • 6656 procs total • 10 Tflop/s • “Seaborg” Berkeley • IBM Power4 Regatta • 32 procs per node • 864 procs total • 4.5 Tflop/s • “Cheetah” Oak Ridge

Performance loop V&V loop Designing a simulation code (from 2001 SciDAC report)

A “perfect storm” for simulation (dates are symbolic) Hardware Infrastructure Applications 1686 scientific models A R C H I T E C T U R E S 1947 numerical algorithms 1976 computer architecture scientific software engineering “Computational science is undergoing a phase transition.” – D. Hitchcock, DOE

Imperative: multiple-scale applications • Multiple spatial scales • interfaces, fronts, layers • thin relative to domain size,  << L • Multiple temporal scales • fast waves • small transit times relative to convection or diffusion,  << T • Analyst must isolate dynamics of interest and model the rest in a system that can be discretized over more modest range of scales • May lead to infinitely “stiff” subsystem requiring special treatment by the solution method Richtmeyer-Meshkov instability, c/o A. Mirin, LLNL

Examples: multiple-scale applications • Biopolymers, nanotechnology • 1012 range in time, from 10-15 sec (quantum fluctuation) to 10-3 sec (molecular folding time) • typical computational model ignores smallest scales, works on classical dynamics only, but scientists increasingly want both • Galaxy formation • 1020 range in space from binary star interactions to diameter of universe • heroic computational model handles all scales with localized adaptive meshing Supernova simulation, c/o A. Mezzacappa, ORNL • Supernovae simulation • massive ranges in time and space scales for radiation, turbulent convection, diffusion, chemical reaction, nuclear reaction

SciDAC portfolio characteristics • Multiple temporal scales • Multiple spatial scales • Linear ill conditioning • Complex geometry and severe anisotropy • Coupled physics, with essential nonlinearities • Ambition for uncertainty quantification, parameter estimation, and design Need toolkit of portable, extensible, tunable implicit solvers, not “one-size fits all”

TOPS starting point codes • PETSc (ANL) • Hypre (LLNL) • Sundials (LLNL) • SuperLU (LBNL) • PARPACK (LBNL*) • TAO (ANL) • Veltisto (CMU) • Many interoperability connections between these packages that predated SciDAC • Many application collaborators that predated SciDAC

TOPS participants CU CMU ANL NYU UC-B/LBNL CU-B LLNL ODU UT-K TOPS lab (3) TOPS university (7)

In the old days, see “Templates” guides … www.netlib.org/templates www.netlib.org/etemplates 124 pp. 410 pp. … these are good starts, but not adequate for SciDAC scales!

adaptive gridding discretization solvers (TOPS) systems software component architecture performance engineering data management “integrated software infrastructure centers” 34 applications groups 7 ISIC groups (4 CS, 3 Math) software integration 10 grid, data collaboratory groups performance optimization

Keyword: “Optimal” • Convergence rate nearly independent of discretization parameters • multilevel schemes for rapid linear convergence of linear problems • Newton-like schemes for quadratic convergence of nonlinear problems • Convergence rate as independent as possible of physical parameters • continuation schemes • physics-based preconditioning • Optimal convergence plus scalable loop body yields scalable solver 200 unscalable 150 100 Time to Solution 50 scalable 0 1 10 100 1000 Problem Size (increasing with number of processors)

AMG Framework error easily damped by pointwise relaxation Choose coarse grids, transfer operators, and smoothers to eliminate these “bad” components within a smaller dimensional space, and recur But where to go past O(N) ? • Since O(N) is already optimal, there is nowhere further “upward” to go in efficiency, but one must extend optimality “outward,” to more general problems • Hence, for instance, algebraic multigrid (AMG) to seek to obtain O(N) in indefinite,anisotropic, or inhomogeneous problems on irregular grids algebraically smooth error

Optimizer Sens. Analyzer Time integrator Nonlinear solver Eigensolver Linear solver Indicates dependence Toolchain for PDE solvers in TOPS project • Design and implementation of “solvers” • Time integrators • Nonlinear solvers • Constrained optimizers • Linear solvers • Eigensolvers • Software integration • Performance optimization (w/ sens. anal.) (w/ sens. anal.)

finite volumes finite differences finite elements node i row i All lead to problems with sparse Jacobian matrices; many tasks can leverage off an efficient set of tools for manipulating distributed sparse data structures J= Dominant data structures are grid-based

Newton nonlinear solver asymptotically quadratic Krylov accelerator spectrally adaptive Schwarz preconditioner parallelizable Newton-Krylov-Schwarz: a PDE applications “workhorse”

rows assigned to proc “2” A21 A22 A23 SPMD parallelism w/domain decomposition 3 2 1 Partitioning of the grid induces block structure on the Jacobian

Time-implicit Newton-Krylov-SchwarzFor accommodation of unsteady problems, and nonlinear robustness in steady ones, NKS iteration is wrapped in time-stepping: for (l = 0; l < n_time; l++) { select time step for(k = 0; k < n_Newton; k++) { compute nonlinear residual and Jacobian for(j = 0; j < n_Krylov; j++) { forall(i = 0; i < n_Precon ; i++) { solve subdomain problems concurrently } // End of loop over subdomains perform Jacobian-vector product enforce Krylov basis conditions update optimal coefficients check linear convergence } // End of linear solver perform DAXPY update check nonlinear convergence } // End of nonlinear loop } // End of time-step loop Pseudo-time loop NKS loop

Krylov iteration P1: P1: … P2: P2: Pn: Pn: local scatter Jac-vec multiply precond sweep daxpy inner product … (N)KS kernel in parallel Bulk synchronous model leads to easy scalability analyses and projections. Each phase can be considered separately. What happens if, for instance, in this (schematicized) iteration, arithmetic speed is doubled, scalar all-gather is quartered, and local scatter is cut by one-third?

Estimating scalability of stencil computations • Given complexity estimates of the leading terms of: • the concurrent computation (per iteration phase) • the concurrent communication • the synchronization frequency • And a bulk synchronous model of the architecture including: • internode communication (network topology and protocol reflecting horizontal memory structure) • on-node computation (effective performance parameters including vertical memory structure) • One can estimate optimal concurrency and optimal execution time • on per-iteration basis, or overall (by taking into account any granularity-dependent convergence rate), based on problem size N and concurrency P • simply differentiate time estimate in terms of (N,P) with respect to P, equate to zero and solve for P in terms of N

Scalability results for DD stencil computations • With tree-based (logarithmic) global reductions and scalable nearest neighbor hardware: • optimal number of processors scales linearly with problem size • With 3D torus-based global reductions and scalable nearest neighbor hardware: • optimal number of processors scales as three-fourths power of problem size (almost “scalable”) • With common network bus (heavy contention): • optimal number of processors scales as one-fourth power of problem size (not “scalable”) • bad news for conventional Beowulf clusters, but see 2000 Bell Prize “price-performance awards”, for multiple NICs

NKS efficiently implemented in PETSc’s MPI-based distributed data structures Main Routine Timestepping Solvers (TS) Nonlinear Solvers (SNES) Linear Solvers (SLES) PETSc PC KSP Application Initialization Function Evaluation Jacobian Evaluation Post- Processing User code PETSc code

User Code/PETSc library interactions Main Routine Timestepping Solvers (TS) Nonlinear Solvers (SNES) Linear Solvers (SLES) PETSc PC KSP Application Initialization Function Evaluation Jacobian Evaluation Post- Processing User code PETSc code Can be AD code

1999 Bell Prize for unstructured grid computational aerodynamics Transonic “Lambda” Shock, Mach contours on surfaces mesh c/o D. Mavriplis, ICASE Implemented in PETSc www.mcs.anl.gov/petsc

128 nodes 43min Four orders of magnitude in 13 years 3072 nodes 2.5min, 226Gf/s 11M unknowns 70% efficient Fixed-size parallel scaling results c/o K. Anderson, W. Gropp, D. Kaushik, D. Keyes and B. Smith

Physical models based on fluid-like magnetohydrodynamics (MHD) Three “war stories” from magnetic fusion energy applications in SciDAC 0

Challenges in magnetic fusion • Conditions of interest possess two properties that pose great challenges to numerical approaches—anisotropy and stiffness. • Anisotropy produces subtle balances of large forces, and vastly different parallel and perpendicular transport properties. • Stiffness reflects the vast range of time-scales in the system: targeted physics is slow (~transport scale) compared to waves

Tokamak/stellerator simulations • Center for Extended MHD Modeling (based at Princeton Plasma Physics Lab) M3D code • Realistic toroidal geom., unstructured mesh, hybrid FE/FD discretization • Fields expanded in scalar potentials, and streamfunctions • Operator-split, linearized, w/11 potential solves in each poloidal cross-plane/step (90% exe. time) • Parallelized w/PETSc (Tang et al., SIAM PP01, Chen et al., SIAM AN02, Jardin et al., SIAM CSE03) • Want from TOPS: • Now: scalable linear implicit solver for much higher resolution (and for AMR) • Later: fully nonlinearly implicit solvers and coupling to other codes

Provided new solvers across existing interfaces • Hypre in PETSc • codes with PETSc interface (like CEMM’s M3D) can now invoke Hypre routines as solvers or preconditioners with command-line switch • SuperLU_DIST in PETSc • as above, with SuperLU_DIST • Hypre in AMR Chombo code • so far, Hypre is level-solver only; its AMG will ultimately be useful as a bottom-solver, since it can be coarsened indefinitely without attention to loss of nested geometric structure; also FAC is being developed for AMR uses, like Chombo

smoother A Multigrid V-cycle Restrictiontransfer from fine to coarse grid coarser grid has fewer cells (less work & storage) Prolongationtransfer from coarse to fine grid First Coarse Grid Recursively apply this idea until we have an easy problem to solve Hypre: multilevel preconditioning Finest Grid

Hypre’s AMG in M3D • PETSc-based PPPL code M3D has been retrofit with Hypre’s algebraic MG solver of Ruge-Steuben type • Iteration count results below are averaged over 19 different PETSc SLESSolve calls in initialization and one timestep loop for this operator split unsteady code, abcissa is number of procs in scaled problem; problem size ranges from 12K to 303K unknowns (approx 4K per processor)

Hypre’s AMG in M3D • Scaled speedup timing results below are summed over 19 different PETSc SLESSolve calls in initialization and one timestep loop for this operator split unsteady code • Majority of AMG cost is coarse-grid formation (preprocessing) which does not scale as well as the inner loop V-cycle phase; in production, these coarse hierarchies will be saved for reuse (same linear systems are called in each timestep loop), making AMG much less expensive and more scalable

Linear System Interfaces Linear Solvers GMG, ... FAC, ... Hybrid, ... AMGe, ... ILU, ... Data Layout structured composite block-struc unstruc CSR Hypre’s “Conceptual Interfaces” (Slide c/o E. Chow, LLNL)

SuperLU in NIMROD • NIMROD is another MHD code in the CEMM collaboration • employs high-order elements on unstructured grids • very poor convergence with default Krylov solver on 2D poloidal crossplane linear solves • TOPS wired in SuperLU, just to try a sparse direct solver • Speedup of more than 10 in serial, and about 8 on a modest parallel cluster (24 procs) • PI Dalton Schnack (General Atomics) thought he entered a time machine • SuperLU is not a “final answer”, but a sanity check • Parallel ILU under Krylov should be superior

Vorticity, early time Vorticity, later time zoom 2D Hall MHDsawtooth instability (PETSc examples /snes/ex29.c and /sles/ex31.c) (Porcelli et al., 1993, 1999) Model equations: Equilibrium: (figures c/o A. Bhattacharjee, CMRS)

PETSc’s DMMG in Hall MR application • Implicit code (snes/ex29.c) with first- and second-order integration in time • Implicit code (snes/ex29.c) versus explicit code (sles/ex31.c), both with second-order integration in time

Abstract Gantt Chart for TOPS Each color module represents an algorithmic research idea on its way to becoming part of a supported community software tool. At any moment (vertical time slice), TOPS has work underway at multiple levels. While some codes are in applications already, they are being improved in functionality and performance as part of the TOPS research agenda. Dissemination Applications Integration Hardened Codes e.g.,PETSc Research Implementations e.g.,TOPSLib Algorithmic Development e.g., ASPIN time

Jacobian-free Newton-Krylov • In the Jacobian-Free Newton-Krylov (JFNK) method, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products • These are approximated by the Fréchet derivatives (where is chosen with a fine balance between approximation and floating point rounding error) or automatic differentiation, so that the actual Jacobian elements are never explicitly needed • One builds the Krylov space on a true F’(u)(to within numerical approximation)

Philosophy of Jacobian-free NK • To evaluate the linear residual, we use the true F’(u) , giving a true Newton step and asymptotic quadratic Newton convergence • To precondition the linear residual, we do anything convenient that uses understanding of the dominant physics/mathematics in the system and respects the limitations of the parallel computer architecture and the cost of various operations: • Jacobian of lower-order discretization • Jacobian with “lagged” values for expensive terms • Jacobian stored in lower precision • Jacobian blocks decomposed for parallelism • Jacobian of related discretization • operator-split Jacobians • physics-based preconditioning

Recall idea of preconditioning • Krylov iteration is expensive in memory and in function evaluations, so subspace dimension k must be kept small in practice, through preconditioning the Jacobian with an approximate inverse, so that the product matrix has low condition number in • Given the ability to apply the action of to a vector, preconditioning can be done on either the left, as above, or the right, as in, e.g., for matrix-free:

Physics-based preconditioning • In Newton iteration, one seeks to obtain a correction (“delta”) to solution, by inverting the Jacobian matrix on (the negative of) the nonlinear residual: • A typical operator-split code also derives a “delta” to the solution, by some implicitly defined means, through a series of implicit and explicit substeps • This implicitly defined mapping from residual to “delta” is a naturalpreconditioner • Software must accommodate this!

Newton Outside Physics-based preconditioning • We consider a standard “dynamical core,” the shallow-water wave splitting algorithm, as a solver • Leaves a first-order in time splitting error • In the Jacobian-free Newton-Krylov framework, this solver, which maps a residual into a correction, can be regarded as a preconditioner • The true Jacobian is never formed yet the time-implicit nonlinear residual at each time step can be made as small as needed for nonlinear consistency in long time integrations

Example: shallow water equations • Continuity (*) • Momentum (**) • These equations admit a fast gravity wave, as can be seen by cross differentiating, e.g., (*) by tand (**) by x, and subtracting:

1D shallow water equations, cont. • Wave equation for geopotential: • Gravity wave speed • Typically , but stability restrictions would require timesteps based on the Courant-Friedrichs-Levy (CFL) criterion for the fastest wave, for an explicit method • One can solve fully implicitly, or one can filter out the gravity wave by solving semi-implicitly

1D shallow water equations, cont. • Continuity (*) • Momentum (**) • Solving (**) for and substituting into (*), where

To be dealt with shortly 1D shallow water equations, cont. • After the parabolic equation is spatially discretized and solved for , then can be found from • One scalar parabolic solve and one scalar explicitupdate replace an implicit hyperbolic system • This semi-implicit operator splitting is foundational to multiple scales problems in geophysical modeling • Similar tricks are employed in aerodynamics (sound waves), MHD (multiple Alfvén waves), reacting flows (fast kinetics), etc. • Temporal truncation error remains due to the lagging of the advection in (**)

Scalable Solvers and Software for PDE Applications