1 / 56

Sparse Direct Methods on High Performance Computers

Sparse Direct Methods on High Performance Computers. X. Sherry Li xsli@lbl.gov http://crd.lbl.gov/~xiaoye CS267/ENgC233: Applications of Parallel Computing April 6, 2009. Sparse linear solvers. Solving a system of linear equations Ax = b Iterative methods A is not changed (read-only)

dnovotny
Télécharger la présentation

Sparse Direct Methods on High Performance Computers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sparse Direct Methods on High Performance Computers X. Sherry Li xsli@lbl.gov http://crd.lbl.gov/~xiaoye CS267/ENgC233: Applications of Parallel Computing April 6, 2009

  2. Sparse linear solvers • Solving a system of linear equations Ax = b • Iterative methods • A is not changed (read-only) • Key kernel: sparse matrix-vector multiply • Easier to optimize and parallelize • Low algorithmic complexity, but may not converge for hard problems • Direct methods • A is modified (factorized) • Harder to optimize and parallelize • Numerically robust, but higher algorithmic complexity • Often use direct method to precondition iterative method

  3. Available sparse codes • Survey of different types of factorization codes http://crd.lbl.gov/~xiaoye/SuperLU/SparseDirectSurvey.pdf • LLT (s.p.d.) • LDLT (symmetric indefinite) • LU (nonsymmetric) • QR (least squares) • Sequential, shared-memory (multicore), distributed-memory, out-of-core • Distributed-memory codes: usually MPI-based • SuperLU_DIST [Li/Demmel/Grigori] • accessible from PETSc, Trilinos • MUMPS, PasTiX, WSMP, . . .

  4. Review of Gaussian Elimination (GE) • First step of GE: • Repeats GE on C • Results in LU factorization (A = LU) • L lower triangular with unit diagonal, U upper triangular • Then, x is obtained by solving two triangular systems with L and U

  5. U 1 2 3 4 L 5 6 7 Sparse GE • Sparse matrices are ubiquitous • Example: A of dimension 105,only 10~100 nonzeros per row • Nonzero costs flops and memory • Scalar algorithm: 3 nested loops • Can re-arrange loops to get different variants: left-looking, right-looking, . . . for i = 1 to n column_scale ( A(:,i) ) for k = i+1 to n s.t. A(i,k) != 0 for j = i+1 to n s.t. A(j,i) != 0 A(j,k) = A(j,k) - A(j,i) * A(i,k) Typical fill-ratio: 10x for 2D problems, 30-50x for 3D problems

  6. Envelope (Profile) solver • Define bandwidth for each row or column • A little more sophisticated than band solver • Use Skyline storage (SKS) • Lower triangle stored row by row Upper triangle stored column by column • In each row (column), first nonzero defines a profile • All entries within the profile (some may be zeros) are stored • All fill-ins are confined in the profile • A good ordering would be based on bandwidth reduction • E.g., (reverse) Cuthill-McKee

  7. RCM ordering • Breadth-first search, numbering by levels, then reverse

  8. Example: 3 orderings (natural, RCM, MD) Envelop size = sum of bandwidths After LU, envelop would be entirely filled Is Profile Solver Good Enough? Env = 61066 NNZ(L, MD) = 12259 Env = 31775 Env = 22320 8

  9. nzval 1 c 2 d e 3 k a 4 h b f 5 i l 6 g j 7 rowind 1 3 2 3 4 3 7 1 4 6 2 4 5 6 7 6 5 6 7 colptr 1 3 6 8 11 16 17 20 A General Data Structure: Compressed Column Storage (CCS) • Also known as Harwell-Boeing format • Store nonzeros columnwise contiguously • 3 arrays: • Storage: NNZ reals, NNZ+N+1 integers • Efficient for columnwise algorithms • “Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods”, R. Barrett et al.

  10. General Sparse Solver • Use (blocked) CRS or CCS, and any ordering method • Leave room for fill-ins ! (symbolic factorization) • Exploit “supernodal” (dense) structures in the factors • Can use Level 3 BLAS • Reduce inefficient indirect addressing (scatter/gather) • Reduce graph traversal time using a coarser graph 10

  11. Numerical Stability: Need for Pivoting • One step of GE: • If α is small, some entries in B may be lost from addition • Pivoting: swap the current diagonal entry with a larger entry from the other part of the matrix • Goal: prevent from getting too large

  12. x s x x b x x x Dense versus Sparse GE • Dense GE: Pr A Pc = LU • Pr and Pc are permutations chosen to maintain stability • Partial pivoting suffices in most cases : Pr A = LU • Sparse GE: Pr A Pc = LU • Pr and Pc are chosen to maintain stability and preserve sparsity • Dynamic pivoting causes dynamic structural change • Alternatives: threshold pivoting, static pivoting, . . .

  13. Algorithmic Issues in Sparse GE • Minimize number of fill-ins, maximize parallelism • Sparsity structure of L & U depends on that of A, which can be changed by row/column permutations (vertex re-labeling of the underlying graph) • Ordering (combinatorial algorithms; NP-complete to find optimum [Yannakis ’83]; use heuristics) • Predict the fill-in positions in L & U • Symbolic factorization (combinatorial algorithms) • Perform factorization and triangular solutions • Numerical algorithms (F.P. operations only on nonzeros) • How and when to pivot ? • Usually dominate the total runtime

  14. Ordering • RCM is good for profile solver • General unstructured methods: • Minimum degree (locally greedy) • Nested dissection (divided-conquer, suitable for parallelism)

  15. i j k l 1 i j k 1 l i i j j k k l l Ordering : Minimum Degree (1/3) Local greedy: minimize upper bound on fill-in i j k l 1 i Eliminate 1 j k l Eliminate 1

  16. Minimum Degree Ordering (2/3) • Greedy approach: do the best locally • Best for modest size problems • Hard to parallelize • At each step • Eliminate the vertex with the smallest degree • Update degrees of the neighbors • Straightforward implementation is slow and requires too much memory • Newly added edges are more than eliminated vertices

  17. Minimum Degree Ordering (3/3) • Use quotient graph as a compact representation [George/Liu ’78] • Collection of cliques resulting from the eliminated vertices affects the degree of an uneliminated vertex • Represent each connected component in the eliminated subgraph by a single “supervertex” • Storage required to implement QG model is bounded by size of A • Large body of literature on implementation variants • Tinney/Walker `67, George/Liu `79, Liu `85, Amestoy/Davis/Duff `94, Ashcraft `95, Duff/Reid `95, et al., . .

  18. Nested Dissection Ordering (1/3) • Model problem: discretized system Ax = b from certain PDEs, e.g., 5-point stencil on n x n grid, N = n^2 • Theorem: NDordering gave optimal complexity in exact arithmetic [George ’73, Hoffman/Martin/Rose] • 2D (kxk = N grids): O(N logN) memory, O(N3/2) operations • 3D (kxkxk = N grids): O(N4/3) memory, O(N2) operations

  19. ND Ordering (2/3) • Generalized nested dissection [Lipton/Rose/Tarjan ’79] • Global graph partitioning: top-down, divide-and-conqure • Best for largest problems • Parallel codes available: e.g., ParMetis, Scotch • First level • Recurse on A and B • Goal: find the smallest possible separator S at each level • Multilevel schemes: • Chaco [Hendrickson/Leland `94], Metis [Karypis/Kumar `95] • Spectral bisection [Simon et al. `90-`95] • Geometric and spectral bisection [Chan/Gilbert/Teng `94] A S B

  20. ND Ordering (3/3)

  21. Ordering for LU (unsymmetric) • Can use a symmetric ordering on a symmetrized matrix • Case of partial pivoting (sequential SuperLU): Use ordering based on ATA • Case of static pivoting (SuperLU_DIST): Use ordering based on AT+A • Can find better ordering based solely on A • Diagonal Markowitz [Amestoy/Li/Ng ‘06] • Similar to minimum degree, but without symmetrization • Hypergraph partition [Boman, Grigori, et al., ‘09] • Similar to ND on ATA,but no need to compute ATA

  22. High Performance Issues: Reduce Cost of Memory Access & Communication • Blocking to increase flops-to-bytes ratio • Aggregate small messages into one larger message • Reduce cost due to latency • Well done in LAPACK, ScaLAPACK • Dense and banded matrices • Adopted in the new generation sparse software • Performance much more sensitive to latency in sparse case

  23. Speedup Over Un-blocked Code • Sorted in increasing “reuse ratio” = #Flops/nonzeros • Up to 40% of machine peak on large sparse matrices on IBM RS6000/590, MIPS R8000, 25% on Alpha 21164

  24. Source of parallelism: Elimination Tree • For any ordering . . . • Each column corresp. to one vertex in the tree • Exhibits column dependencies during elimination • If column j updates column k, then vertex j is a descendant of vertex k • Disjoint subtrees can be eliminated in parallel

  25. Source of parallelism: Separator Tree • Ordering by graph partitioning

  26. Source of parallelism: global partition and distribution • 2D block cyclic recommended for many linear algebra algorithms • Better load balance, less communication, and BLAS-3 1D blocked 1D cyclic 1D block cyclic 2D block cyclic

  27. Major stages of sparse LU • Ordering • Symbolic factorization • Numerical factorization – usually dominates total time • How to pivot? • Triangular solutions • SuperLU_MT • 1. Sparsity ordering • 2.Factorization (steps interleave) • Partial pivoting • Symb. fact. • Num. fact. (BLAS 2.5) • 3. Solve SuperLU_DIST 1.Static pivoting 2. Sparsity ordering 3. Symbolic fact. 4. Numerical fact. (BLAS 3) 5. Solve

  28. P1 P2 U A L NOT TOUCHED DONE BUSY SuperLU_MT [Li/Demmel/Gilbert] • Pthreads or OpenMP • Left looking -- many more reads than writes • Use shared task queue to schedule ready columns in the elimination tree (bottom up)

  29. Matrix 2 2 Process mesh 4 5 4 5 0 2 2 2 4 5 4 5 4 5 2 2 4 5 4 5 2 SuperLU_DIST[Li/Demmel/Grigori] • MPI • Right looking -- many more writes than reads • Global 2D block cyclic layout • One step look-ahead to overlap comm. & comp. 0 1 0 1 0 1 3 3 3 3 0 1 0 1 0 3 3 3 0 1 0 1 0 3 3 3 0 1 2 0 1 0 ACTIVE

  30. Multicore platforms • Intel Clovertown: • 2.33 GHz Xeon, 9.3 Gflops/core • 2 sockets X 4 cores • L2 cache: 4 MB/2 cores • Sun VictoriaFalls: • 1.4 GHz UltraSparc T2, 1.4 Gflops/core • 2 sockets X 8 cores X 8 hardware threads/core • L2 cache shared: 4 MB

  31. Single-core, single threaded BLAS Clovertown Intel MKL • VictoriaFalls • Sun Performance Library (single-threaded) • Can’t use 8 hw threads ! 31

  32. Benchmark matrices

  33. Clovertown • Maximum speedup 4.3, smaller than conventional SMP • Pthreads scale better

  34. VictoriaFalls – multicore + multithread SuperLU_DIST SuperLU_MT • Pthreads more robust, scale better • MPICH crashes with large #tasks, • mismatch between coarse and fine grain models 34

  35. Larger matrices • Sparsity ordering: MeTis applied to structure of A’+A

  36. Strong scaling: IBM Power5 (1.9 GHz) • Up to 454 Gflops factorization rate

  37. Weak scaling • 3D KxKxK cubic grids, scale N2 = K6 with P for constant-work-per-processor • Performance sensitive to communication latency • Cray T3E latency: 3 microseconds ( ~ 2700 flops, 450 MHz, 900 Mflops) • IBM SP latency: 8 microseconds ( ~ 11940 flops, 1.9 GHz, 7.6 Gflops)

  38. Analysis of scalability and isoefficiency • Model problem: matrix from 11 pt Laplacian on k x k x k (3D) mesh; Nested dissection ordering • N = k3 • Factor nonzeros (Memory) : O(N4/3) • Number of flops (Work) : O(N2) • Total communication overhead : O(N4/3 P) (assuming P processors arranged as grid) • Isoefficiency function: Maintain constant efficiency if “Work” increases proportionally with “Overhead”: This is equivalent to: • Memory-processor relation: • Parallel efficiency can be kept constant if the memory-per-processor is constant, same as dense LU in ScaLPAPACK • Work-processor relation: • Work needs to grow faster than processors

  39. Summary • Important kernel for science and engineering applications, used in practice on a regular basis • Good implementation on high-performance machines requiresa large set of tools from CS and NLA • Performance more sensitive to latency than dense case

  40. Open problems • Much room for optimizing performance • Automatic tuning of blocking parameters • Use of modern programming language to hide latency (e.g., UPC) • Scalability of sparse triangular solve • Switch-to-dense, partitioned inverse • Incomplete factorization (ILU preconditioner) – both sequential and parallel • Optimal complexity sparse factorization • In the spirit of fast multipole method, but for matrix inversion • J. Xia’s dissertation (May 2006) • Latency-avoiding sparse LU, QR factorizations

  41. Extra Slides

  42. 4 1 x x 3 x 4 5 2 2 2 1 3 3 4 4 5 5 5 3 Static Pivoting via Weighted Bipartite Matching G(A) A • Maximize the diag. entries: sum, or product (sum of logs) • Hungarian algo. or the like (MC64): O(n*(m+n)*log n) • Auction algo. (more parallel): O(n*m*log(n*C)) • J. Riedy’s dissertation (expected Dec. 2006?) 1 1 row column

  43. Numerical Accuracy: GESP versus GEPP

  44. Parallel Symbolic Factorization [Grigori/Demmel/Li ‘06] • Parallel ordering with ParMETIS on G(A’+A) • Separator tree (binary) to guide computation • Each step: one row of U, one column of L • Within each separator: 1D block cyclic distribution • Send necessary contribution to parent processor • Results: • Reasonable speedup: up to 6x • 5x reduction in maximum per-processor memory needs • Need improve memory balance

  45. Application 1: Quantum Mechanics • Scattering in a quantum system of three charged particles • Simplest example is ionization of a hydrogen atom by collision with an electron: e- + H  H+ + 2e- • Seek the particles’ wave functions represented by the time-independent Schrodinger equation • First solution to this long-standing unsolved problem [Recigno, McCurdy, et al. Science, 24 Dec 1999]

  46. Quantum Mechanics (cont.) • Finite difference leads to complex, unsymmetric systems, very ill-conditioned • Diagonal blocks have the structure of 2D finite difference Laplacian matrices Very sparse: nonzeros per row <= 13 • Off-diagonal block is a diagonal matrix • Between 6 to 24 blocks, each of order between 200K and 350K • Total dimension up to 8.4 M • Too much fill if use direct method . . .

  47. SuperLU_DIST as Preconditioner • SuperLU_DIST as block-diagonal preconditioner for CGS iteration M-1A x = M-1b M = diag(A11, A22, A33, …) • Run multiple SuperLU_DIST simultaneously for diagonal blocks • No pivoting, nor iterative refinement • 12 to 35 CGS iterations @ 1 ~ 2 minute/iteration using 64 IBM SP processors • Total time: 0.5 to a few hours

  48. Complex, unsymmetric N = 2 M, NNZ = 26 M Fill-ins using Metis: 1.3 G (50x fill) Factorization speed 10x speedup (4 to 128 P) Up to 30 Gflops One Block Timings on IBM SP

  49. Application 2: Accelerator Cavity Design • Calculate cavity mode frequencies and field vectors • Solve Maxwell equation in electromagnetic field • Omega3P simulation code developed at SLAC Omega3P model of a 47-cell section of the 206-cell Next Linear Collider accelerator structure Individual cells used in accelerating structure

  50. Finite element methods lead to large sparse generalized eigensystem K x =  M x Real symmetric for lossless cavities; Complex symmetric when lossy in cavities Seek interior eigenvalues (tightly clustered) that are relatively small in magnitude Accelerator (cont.)

More Related