Scalable Solvers and Software for PDE Applications

Scalable Solvers and Software for PDE Applications The Pennsylvania State University 29 April 2003 David E. Keyes Center for Computational Science Old Dominion University & Institute for Scientific Computing Research Lawrence Livermore National Laboratory

Happy Poincaré’s Birthday! • Born:29 April 1854, Nancy • Fundamental contributions to topology, analysis, number theory, potential theory, quantum theory,fluid mechanics, the special theory of relativity, and the philosophy of science • Académie des Sciences, 1887 (President, 1906) • Fellow, Royal Society, 1894 • Died: 17 July 1912, Paris • “The last universalist in mathematics”- anon. • “It is by logic that we prove; it is by intuition that we invent.” – Henri Poincaré

Plan of presentation • Imperative of “optimal” algorithms for terascale computing • Basic domain decomposition and multilevel algorithmic concepts • Examples of applications • Example of Bell-prize winning solver performance on ASCI platforms • Conclusions and outlook

200 unscalable 150 Steel/rubber composite Parallel multigrid c/o M. Adams, Berkeley-Sandia 100 Time to Solution 50 scalable 0 1000 1 10 100 Problem Size (increasing with number of processors) The solver is a key part, but not the only part, of the simulation that needs to be scalable Motivation: optimality • Convergence rate nearly independent of discretization parameters • Multilevel schemes for rapid linear convergence of linear problems • Newton-like schemes for quadratic convergence of nonlinear problems • Convergence rate as independent as possible of physical parameters • Continuation schemes • Physics-based preconditioning

Why optimal algorithms? • The more powerful the computer, the greater the importance of optimality • Example: • Suppose Alg1 solves a problem in time CN2, where N is the input size • Suppose Alg2 solves the same problem in time CN • Suppose that the machine on which Alg1 and Alg2 have been parallelized to run has 10,000 processors • In constant time (compared to serial), Alg1 can run a problem 100X larger, whereas Alg2 can run a problem 10,000X larger

Why optimal?, cont. • Alternatively, filling the machine’s memory, Alg1 requires 100X time, whereas Alg2 runs in constant time • Is 10,000 processors a reasonable expectation? • Yes, we have it today (ASCI White (IBM), Red Storm (Cray))! • Could computational scientists really use 10,000X scaling? • Of course; we are approximating the continuum • A grid for weather prediction allows points every 1km versus every 100km on the earth’s surface • In 2D 10,000X disappears fast; in 3D even faster • However, these machines are expensive (Earth Simulator is $0.5B, plus ongoing operating costs), and optimal algorithms are the only algorithms that we can afford to run on them

Consider, e.g., the implicitly discretized parabolic case Decomposition strategies for Lu=f in  • Operator decomposition • Function space decomposition • Domain decomposition

Operator decomposition • Consider ADI • Iteration matrix consists of four sequential (“multiplicative”) substeps per timestep • two sparse matrix-vector multiplies • two sets of unidirectional bandsolves • Parallelism within each substep • But global data exchanges between bandsolve substeps

System of ordinary differential equations • Perhaps are diagonal matrices • Perfect parallelism across spectral index • But global data exchanges to transformback to physical variables at each step Function space decomposition • Consider a spectral Galerkin method

= Domain decomposition • Consider restriction and extension operators for subdomains, , and for possible coarse grid, • Replace discretized with • Solve by a Krylov method, e.g., CG • Matrix-vector multiplies with • parallelism on each subdomain • nearest-neighbor exchanges, global reductions • possible small global system (not needed for parabolic case)

Comparison • Operator decomposition (ADI) • natural row-based assignment requires all-to-all, bulk data exchanges in each step (for transpose) • Function space decomposition (Fourier) • natural mode-based assignment requires all-to-all, bulk data exchanges in each step (for transform) • Domain decomposition (Schwarz) • natural domain-based assignment requires local, nearest neighbor data exchanges, global reductions, and optional small global problem

Theoretical scaling of domain decomposition(for three common network topologies) • With logarithmic-time (hypercube- or tree-based) global reductions and scalable nearest neighbor interconnects: • optimal number of processors scales linearlywith problem size (“scalable”, assumes one subdomain per processor) • With power-law-time (3D torus-based) global reductions and scalable nearest neighbor interconnects: • optimal number of processors scales as three-fourths power of problem size (“almost scalable”) • With linear-time (common bus) network: • optimal number of processors scales as one-fourth power of problem size (*not* scalable) • bad news for conventional Beowulf clusters, but see 2000 & 2001 Bell Prize “price-performance awards” using multiple commodity NICs per Beowulf node!

Some “Advanced” Concepts • Polynomial combinations of Schwarz projections • Schwarz-Schur combinations • Schwarz on Schur-reduced system • Schwarz inside Schur-reduced system • Nonlinear Schwarz optimization recent Three Basic Concepts • Iterative correction • Schwarz preconditioning • Schur preconditioning

Iterative correction • The most basic idea in iterative methods • Evaluate residual accurately, but solve approximately, where is an approximate inverse to A • A sequence of complementary solves can be used, e.g., with first and then one has • Optimal polynomials of lead to various preconditioned Krylov methods • Scale recurrence, e.g., with , leads to multilevel methods

smoother A Multigrid V-cycle Restrictiontransfer from fine to coarse grid coarser grid has fewer cells (less work & storage) Prolongationtransfer from coarse to fine grid First Coarse Grid Recursively apply this idea until we have an easy problem to solve Multilevel preconditioning Finest Grid

64K DOFs 200M DOFs Example of Hypre’s scaled efficiency

Schwarz preconditioning • Given A x = b , partition x into subvectors, corresp. to subdomains of the domain of the PDE, nonempty, possibly overlapping, whose union is all of the elements of • Let Boolean rectangular matrix extract the subset of : • Let The Boolean matrices are gather/scatter operators, mapping between a global vector and its subdomain support

rows assigned to proc “2” A21 A22 A23 SPMD parallelism w/domain decomposition 3 2 1 Partitioning of the grid induces block structure on the Jacobian

Preconditioning Type in 2D in 3D Point Jacobi Ο(N1/2) Ο(N1/3) Domain Jacobi (=0) Ο((NP)1/4) Ο((NP)1/6) 1-level Additive Schwarz Ο(P1/2) Ο(P1/3) 2-level Additive Schwarz Ο(1) Ο(1) Iteration count estimates from the Schwarz theory • Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned system • In terms of N and P, where for d-dimensional isotropic problems, N=h-d and P=H-d, for mesh parameter hand subdomain diameter H, iteration counts may be estimated as follows:

Comments on the Schwarz theory • Basic Schwarz estimates are for: • self-adjoint operators with smooth coefficients • positive definite operators • exact subdomain solves, • two-way overlapping with • generous overlap, =O(H)(otherwise 2-level result is O(1+H/)) • Extensible to: • nonself-adjointness (e.g, convection) and jumping coefficients • indefiniteness (e.g., wave Helmholtz) • inexact subdomain solves • one-way overlap communication (“restricted additive Schwarz”) • small overlap

Schur preconditioning • Given a partition • Condense: • Let M be a good preconditioner for S • Then is a preconditioner for A • Moreover, solves with may be done approximately if all degrees of freedom are retained

This leads to algorithm “Hybrid II” in S-B-G’96: • Convenient for “SPMD” (single prog/multiple data) Schwarz polynomials • Polynomials of Schwarz projections that are combinations of additive and multiplicative may be appropriate for certain implementations • We may solve the fine subdomains concurrently and follow with a coarse grid (redundantly/cooperatively)

“Balancing Neumann-Neumann” alg Schwarz-on-Schur • Preconditioning the Schur complement is complex in and of itself; Schwarz can be used on the reduced problem • “Neumann-Neumann” alg • Multigrid on the Schur complement

Schwarz-inside-Schur • Consider Newton’s method for solving the nonlinear rootfinding problem derived from the necessary conditions for constrained optimization • Constraint • Objective • Lagrangian • Form the gradient of the Lagrangian with respect to each of x, u, and :

Newton Reduced SQP solves the Schur complement system H u = g , where H is the reduced Hessian Schwarz-inside-Schur • Equality constrained optimization leads to the KKT system for states x , designs u , and multipliers  • Then

Schwarz-inside-Schur, cont. • Problems • is the Jacobian of a PDE  huge! • involve Hessians of objective and constraints  second derivatives and huge • H is unreasonable to form, store, or invert • Solutions • Use Schur preconditioning on full system • Form forward action of Hessians by automatic differentiation (vector-to-vector map) • Form approximate inverse action of state Jacobian and its transpose by Schwarz

wing tip vortices, no control (l); optimal control (r) optimal boundary controls shown as velocity vectors Example of PDE-constrained Optimization Lagrange-Newton-Krylov-Schur implemented in Veltisto/PETSc • Optimal control of laminar viscous flow • optimization variables are surface suction/injection • objective is minimum drag • 700,000 states; 4,000 controls • 128 Cray T3E processors • ~5 hrs for optimal solution (~1 hr for analysis) c/o G. Biros and O. Ghattas www.cs.nyu.edu/~biros/veltisto/

Nonlinear Schwarz preconditioning • Nonlinear Schwarz has Newton both inside and outside and is fundamentally Jacobian-free • It replaces with a new nonlinear system possessing the same root, • Define a correction to the partition (e.g., subdomain) of the solution vector by solving the following local nonlinear system: where is nonzero only in the components of the partition • Then sum the corrections:

Nonlinear Schwarz, cont. • It is simple to prove that if the Jacobian of F(u) is nonsingular in a neighborhood of the desired root then and have the same unique root • To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : • The residual • The Jacobian-vector product • Remarkably, (Cai-Keyes, 2000) it can be shown that where and • All required actions are available in terms of !

Stagnation beyond critical Re Difficulty at critical Re Convergence for all Re Additive Schwarz Preconditioned Inexact Newton (ASPIN) Newton’s method Example of nonlinear Schwarz

Then and — the Schwarz formula! “Unreasonable effectiveness” of Schwarz • When does the sum of partial inverses equal the inverse of the sums? When the decomposition is right! Let be a complete set of orthonormal row eigenvectors for A : or • Good decompositions are a compromise between conditioning and parallel complexity, in practice

Newton-Krylov-Schwarz – a parallel PDE “workhorse” Popularized in parallel Jacobian-free form under this name by Cai, Gropp, Keyes & Tidriri (1994), in PETSc since Balay’s MS project at ODU (1995) Newton nonlinear solver asymptotically quadratic Krylov accelerator spectrally adaptive Schwarz preconditioner parallelizable

Jacobian-free Newton-Krylov method • In the Jacobian-free Newton-Krylov (JFNK) method, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products • These are approximated by the Fréchet derivatives so that the actual Jacobian elements are never explicitly needed, where  is chosen with a fine balance between approximation and floating point rounding error  Schwarz preconditions, using approximate elements

Philosophy of Jacobian-free NK • To evaluate the linear residual, we use the true F’(u) , giving a true Newton step and asymptotic quadratic Newton convergence • To precondition the linear residual, we do anything convenient that uses understanding of the dominant physics/mathematics in the system and respects the limitations of the parallel computer architecture and the cost of various operations: • combinations of operator-split Jacobians (for reasons of physics or reasons of numerics) • Jacobian of related discretization (for “fast” solves) • Jacobian of lower-order discretization (for more stability, less storage) • Jacobian with “lagged” values for expensive terms (for less computation per degree of freedom) • Jacobian stored in lower precision (for less memory traffic per preconditioning step) • Jacobian blocks decomposed for parallelism

Philosophy of Jacobian-free NK, cont. • These motivations are not new; most large-scale application codes also take “short cuts” on the approximate Jacobian operator to be inverted – showing physical intuition • The problem with many codes is that they do not anywhere have an accurate global Jacobian operator; they use only the weak Jacobian • This leads to a weakly nonlinearly converging “defect correction method” • Defect correction: • in contrast to preconditioned Newton:

Newton Outside Physics-based preconditioning • Consider an algorithm that leaves first-order splitting error as solver • In the Jacobian-free Newton-Krylov framework, this solver, which maps a residual into a correction, can be regarded as a preconditioner • The true Jacobian is never formed yet the time-implicit nonlinear residual at each time step can be made as small as needed for nonlinear consistency in long time integrations

Physics-based preconditioning • In Newton iteration, one seeks to obtain a correction (“delta”) to solution, by inverting the Jacobian matrix on (the negative of) the nonlinear residual: • A typical operator-split code also derives a “delta” to the solution, by some implicitly defined means, through a series of implicit and explicit substeps • This implicitly defined mapping from residual to “delta” is a naturalpreconditioner • Software must accommodate this!

Ex.: 1D shallow water preconditioning • Define continuity residual for each timestep: • Define momentum residual for each timestep: • Continuity delta-form (*): • Momentum delta form (**):

1D Shallow water preconditioning, cont. • Solving (**) for and substituting into (*), • After this parabolic equation is solved for  , we have • This completes the application of the preconditioner to one Newton-Krylov iteration at one timestep • Of course, the parabolic solve need not be done exactly; one sweep of multigrid can be used • See paper by Mousseau et al. (2002) in Ref [1] for impressive results for longtime weather integration

Operator-split preconditioning • Subcomponents of a PDE operator often have special structure that can be exploited if they are treated separately • Algebraically, this is just a generalization of Schwarz, by term instead of by subdomain • Suppose and a preconditioner is to be constructed, where and are each “easy” to invert • Form a preconditioned vector from as follows: • Equivalent to replacing with • First-order splitting error, yet often used as a solver!

Operator-split preconditioning, cont. • Suppose S is convection-diffusion and R is reaction, among a collection of fields stored as gridfunctions • On a small regular 2D grid with a five-point stencil: • R is trivially invertible in block diagonal form • S is invertible with one multilevel solve per field J = S + R

Operator-split preconditioning, cont. • Preconditioners assembled from just the “strong” elements of the Jacobian, alternating the source term and the diffusion term operators, are competitive in convergence rates with block-ILU on the Jacobian • particularly, since the decoupled scalar diffusion systems are amenable to simple multigrid treatment – not as trivial for the coupled system • The decoupled preconditioners store many fewer elements and significantly reduce memory bandwidth requirements and are expected to be much faster per iteration when carefully implemented • See “alternative block factorization” by Bank et al. in Ref [1]; incorporated into SciDAC TSI solver by D’Azevedo

Using Jacobian of related discretization • To precondition a variable coefficient operator, such as ·( ), use , based on a constant coefficient average • Brown & Saad (1980) showed that, because of the availability of fast solvers, it may even be acceptable to use to precondition something like

Using Jacobian of lower order discretization • Orszag popularized the use of linear finite element discretizations as preconditioners for high-order spectral element discretizations in the 1970s; both approach the same continuous operator • It is common in CFD to employ first-order upwinded convective operators as approximate inversions for higher-order operators: • better factorization stability • smaller matrix bandwidth and complexity • With Jacobian-free NK, we can have the best of both worlds – a stable factorization/cheap solve and a true Jacobian step

Using Jacobian with lagged terms • Newton-chord methods (e.g., papers by Smooke et al.) “freeze” the Jacobian matrices: • saves Jacobian evaluation and factorization, which can be up to 90% of the running time of the code in some apps • however, nonlinear convergence degrades to linear rate • In Jacobian-free NK, we can “freeze” some or all of the terms in the Jacobian preconditioner, while always accessing the action of the true Jacobian for the Krylov matrix-vector multiply: • still saves Jacobian work • maintains asymptotically quadratic rate for nonlinear convergence • See Knoll-Keyes (2002) for example with coupled edge plasma and Navier-Stokes, showing five-fold improvement over full Newton with constantly refreshed Jacobian on LHS, versus JFNK with preconditioner refreshed once each ten timesteps

Using Jacobian with lower precision elements • Memory bandwidth is the critical architectural parameter for sparse linear algebra computations • Storing the preconditioner elements in single precision effectively doubles memory bandwidth (and potentially halves runtime) for this critical phase • We still form the Jacobian-vector product with full precision and “zero-pad” the preconditioner elements back to full length in the arithmetic unit, so the numerical quality of the Krylov subspace does not degrade

Number of Processors Computational Phase Linear Solve Overall Double Single Double Single 16 223s 136s 746s 657s 32 117s 67s 373s 331s 64 60s 34s 205s 181s 120 31s 16s 122s 106s Memory BW bottleneck revealed via precision reduction Execution times for unstructured NKS Euler Simulation on Origin 2000: double precision matrices versus single precision preconditioner Note that times are nearly halved, along with precision, for the BW-limited linear solve phase, indicating that the BW can be at least doubled before hitting the next bottleneck!

PETSc and Hypre combined in “Terascale Optimal PDE Simulations” (TOPS) ISIC Nine institutions, five years, 24 co-PIs

Optimizer Sens. Analyzer Time integrator Nonlinear solver Eigensolver Linear solver Indicates dependence Scope for TOPS • Design and implementation of “solvers” • Time integrators • Nonlinear solvers • Optimizers • Linear solvers • Eigensolvers • Software integration • Performance optimization (w/ sens. anal.) (w/ sens. anal.)

Ex.: Hall magnetic reconnection Magnetic Reconnection: Applications to Sawtooth Oscillations, Error Field Induced Islands and the Dynamo Effect The research goals of this project include producing a unique high performance code and using this code to study magnetic reconnection in astrophysical plasmas, in smaller scale laboratory experiments, and in fusion devices. The modular code that will be developed will be a fully three-dimensional, compressible Hall MHD code with options to run in slab, cylindrical and toroidal geometry and flexible enough to allow change in algorithms as needed. The code will use adaptive grid refinement, will run on massively parallel computers, and will be portable and scalable. The research goals include studies that will provide increased understanding of sawtooth oscillations in tokamaks, magnetotail substorms, error-fields in tokamaks, reverse field pinch dynamos, astrophysical dynamos, and laboratory reconnection experiments. PI: Amitava BhattacharjeeUniversity of Iowa

Scalable Solvers and Software for PDE Applications