Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator Kengo Nakajima nakajima@cc.u-tokyo.ac.jp Supercomputing Division, Information Technology Center, The University of Tokyo Japan Agency for Marine-Earth Science and Technology April 22, 2008

08-APR22 Overview • Background • Vector/Scalar Processors • GeoFEM Project • Earth Simulator • Preconditioned Iterative Linear Solvers • Optimization Strategy on the Earth Simulator • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications • Matrix Assembling • Summary & Future Works

Large-Scale Computing • Scalar Processor • Big gap between clock rate and memory bandwidth. • Very low sustained/peak performance ratio (<10%) 08-APR22

Scalar ProcessorsCPU-Cache-Memory: Hierarchical Structure CPU Register FAST Small Capacity (MB) Expensive Large (100M of Transistors) Cache SLOW Main Memory Large Capacity (GB) Cheap 08-APR22

Large-Scale Computing • Scalar Processor • Big gap between clock rate and memory bandwidth. • Very low sustained/peak performance ratio (<10%) • Vector Processor • Very high sustained/peak performance ratio • e.g. 35 % for FEM applications on the Earth Simulator • requires … • very special tuning • sufficiently long loops (= large-scale problem size) for certain performance • Suitable for simple computation. 08-APR22

Vector ProcessorsVector Register & Fast Memory • Parallel Processing for Simple DO Loops. • Suitable for Simple & Large Computation Vector Processor do i= 1, N A(i)= B(i) + C(i) enddo Vector Register Very FAST Main Memory 08-APR22

Large-Scale Computing • Scalar Processor • Big gap between clock rate and memory bandwidth. • Very low sustained/peak performance ratio (<10%) • Vector Processor • Very high sustained/peak performance ratio • e.g. 35 % for FEM applications on the Earth Simulator • requires … • very special tuning • sufficiently long loops (= large-scale problem size) for certain performance • Suitable for simple computation. 08-APR22

IBM SP3 (LBNL) Hitachi SR11000/J2 (U.Tokyo) 1.50 9.20 0.623 8.00 .072-.076 (4.8-5.1) .880-.973 (9.6-10.6) .122 (8.1) 1.34 (14.5) 0.413 0.870 1.00 12.0 16.3 4.7 * Earth Simulator Hitachi SR8000/MPP (U.Tokyo) Peak Performance GFLOPS 8.00 1.80 Measured Memory BW (STREAM) (GB/sec/PE) 26.6 2.85 Estimated Performance/Core GFLOPS (% of peak) 2.31-3.24 (28.8-40.5) .291-.347 (16.1-19.3) Measured Performance/Core 2.93 (36.6) .335 (18.6) BYTE/FLOP 3.325 1.583 Comm. BW (GB/sec/Node) 12.3 1.60 * IBM p595 J.T.Carter et al. MPI Latency (msec) 5.6-7.7 6-20 08-APR22

Typical Behavior … 40 % of peak 8 % of peak IBM-SP3: Performance is good for small problems due to cache effect. Earth Simulator: Performance is good for large scale problems due to long vector length. 08-APR22

Parallel ComputingStrong Scaling (Fixed Prob. Size) Ideal Ideal Performance Performance PE# PE# IBM-SP3: Super-scalar effect for small number of PE’s. Performance decreases for many PE’s due to comm. overhead. Earth Simulator: Performance decreases for many PE’s due to comm. overhead and small vector length. 08-APR22

Improvement of Memory Performance on IBM SP3 ⇒ Hitachi SR11000/J2 Memory Performance (BWDTH, latency etc.) ■ Flat-MPI/DCRS □ Hybrid/DCRS 2.3 GHz 8.0 GB/sec 18M L3 cache/PE IBM SP-3 POWER3 375 MHz 1.0 GB/sec 8M L2 cache/PE SR11000/J2 POWER5+ 08-APR22

08-APR22 My History with Vector Computers • Cray-1S (1985-1988) • Mitsubishi Research Institute (MRI) • Cray-YMP (1988-1995) • MRI, University of Texas at Austin • Fujitsu VP, VPP series (1985-1999) • JAERI, PNC • NEC SX-4 (1997-2001) • CCSE/JAERI • Hitachi SR2201 (1997-2004) • University of Tokyo, CCSE/JAERI • Hitachi SR8000 (2000-2007) • University of Tokyo • Earth Simulator (2002-)

08-APR22 • Background • Vector/Scalar Processors • GeoFEM Project • Earth Simulator • Preconditioned Iterative Linear Solvers • Optimization Strategy on the Earth Simulator • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications • Matrix Assembling • Summary & Future Works

08-APR22 GeoFEM: FY.1998-2002http://geofem.tokyo.rist.or.jp/ • Parallel FEM platform for solid earth simulation. • parallel I/O, parallel linear solvers, parallel visualization • solid earth: earthquake, plate deformation, mantle/core convection, etc. • Part of national project by STA/MEXT for large-scale earth science simulations using the Earth Simulator. • Strong collaborations between HPC and natural science (solid earth) communities.

08-APR22 System Configuration ofGeoFEM

08-APR22 Results on Solid Earth Simulation Complicated Plate Model around Japan Islands Magnetic Field of the Earth : MHD code Dh=5.00 Dh=1.25 Simulation of Earthquake Generation Cycle in Southwestern Japan T=100 T=200 T=300 T=400 T=500 Transportation by Groundwater Flow through Heterogeneous Porous Media TSUNAMI !!

08-APR22 Results by GeoFEM

08-APR22 Features of FEM applications (1/2) • Local “element-by-element” operations • sparse coefficient matrices • suitable for parallel computing • HUGE “indirect” accesses • IRREGULAR sparse matrices • memory intensive do i= 1, N jS= index(i-1)+1 jE= index(i) do j= jS, jE in = item(j) Y(i)= Y(i) + AMAT(j)*X(in) enddo enddo

08-APR22 Features of FEM applications (2/2) • In parallel computation … • comm. with ONLY neighbors (except “dot products” etc.) • amount of messages are relatively small because only values on domain-boundary are exchanged. • communication (MPI) latency is critical

08-APR22 Earth Simulator (ES)http://www.es.jamstec.go.jp/ • 640×8= 5,120 Vector Processors • SMP Cluster-Type Architecture • 8 GFLOPS/PE • 64 GFLOPS/Node • 40 TFLOPS/ES • 16 GB Memory/Node, 10 TB/ES • 640×640 Crossbar Network • 16 GB/sec×2 • Memory BWTH with 32 GB/sec. • 35.6 TFLOPS for LINPACK (2002-March) • 26 TFLOPS for AFES (Climate Simulation)

08-APR22 Motivations • GeoFEM Project (FY.1998-2002) • FEM-type applications with complicated unstructured grids (not LINPACK, FDM …) on the Earth Simulator (ES) • Implicit Linear Solvers • Hybrid vs. Flat MPI Parallel Programming Model

08-APR22 Memory Memory P E P E P E P E P E P E P E P E Memory Memory P E P E P E P E P E P E P E P E Flat MPI vs. Hybrid Flat-MPI：Each PE -> Independent Hybrid：Hierarchal Structure

08-APR22 Direct/Iterative Methods for Linear Equations • Direct Methods • Gaussian Elimination/LU Factorization. • compute A-1 directly. • Robust for wide range of applications. • More expensive than iterative methods (memory, CPU) • Not suitable for parallel and vector computation due to its global operations. • Iterative Methods • CG, GMRES, BiCGSTAB • Less expensive than direct methods, especially in memory. • Suitable for parallel and vector computing. • Convergence strongly depends on problems, boundary conditions (condition number etc.) • Preconditioning is required

08-APR22 Preconditioning for Iterative Methods • Convergence rate of iterative solvers strongly depends on the spectral properties (eigenvalue distribution) of the coefficient matrix A. • A preconditioner M transforms the linear system into one with more favorable spectral properties • In "ill-conditioned" problems, "condition number" (ratio of max/min eigenvalue if A is symmetric) is large. • M transforms original equation Ax=b into A'x=b' where A'=M-1A, b'=M-1b • ILU (Incomplete LU Factorization) or IC (Incomplete Cholesky Factorization) are well-known preconditioners.

08-APR22 Strategy in GeoFEM • Iterative method is the ONLY choice for large-scale parallel computing. • Preconditioning is important • general methods, such as ILU(0)/IC(0), cover wide range of applications. • problem specific methods.

08-APR22 Block IC(0)-CG Solver on the Earth Simulator • 3D Linear Elastic Problems (SPD) • Parallel Iterative Linear Solver • Node-based Local Data Structure • Conjugate-Gradient Method (CG): SPD • Localized Block IC(0) Preconditioning (Block Jacobi) • Modified IC(0): Non-diagonal components of original [A] are kept • Additive Schwartz Domain Decomposition（ASDD） • http://geofem.tokyo.rist.or.jp/ • Hybrid Parallel Programming Model • OpenMP+MPI • Re-Ordering for Vector/Parallel Performance • Comparison with Flat MPI

08-APR22 Memory Memory P E P E P E P E P E P E P E P E Memory Memory P E P E P E P E P E P E P E P E Flat MPI vs. Hybrid Flat-MPI：Each PE -> Independent Hybrid：Hierarchal Structure

08-APR22 PE#1 PE#1 PE#0 4 5 6 12 15 6 7 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25 PE#0 1 11 2 3 14 13 4 5 17 17 17 18 18 18 19 19 19 16 16 16 20 20 20 10 1 2 3 7 8 9 10 12 12 12 13 13 13 14 14 14 11 10 12 11 11 11 15 15 15 8 9 11 12 10 9 11 12 7 7 7 8 8 8 9 9 9 6 6 6 10 10 10 5 9 6 3 8 8 6 4 4 5 １１１２２２３３３４４４ 5 5 5 PE#3 PE#2 1 2 7 7 1 2 3 PE#3 PE#2 Local Data StructureNode-based Partitioninginternal nodes - elements - external nodes

08-APR22 1 SMP node => 1 domain for Hybrid Programming ModelMPI communication among domains

08-APR22 Basic Strategy for Parallel Programming on the Earth Simulator • Hypothesis • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.

08-APR22 ILU(0)/IC(0) Factorization do i= 2, n do k= 1, i-1 if ((i,k)∈ NonZero(A)) then aik := aik/akk endif do j= k+1, n if ((i,j)∈ NonZero(A)) then aij := aij - aik*akj endif enddo enddo enddo

08-APR22 ILU/IC Preconditioning M = (L+D)D-1(D+U) Mz= r D-1(D+U)= z1 Forward Substitution (L+D)z1= r : z1= D-1(r-Lz1) Backward Substitution (I+ D-1 U)zNEW = z1 z= z - D-1Uz need to solve this equation

08-APR22 ILU/IC Preconditioning do i= 1, N WVAL= R(i) do j= 1, INL(i) WVAL= WVAL - AL(i,j) * Z(IAL(i,j)) enddo Z(i)= WVAL / D(i) enddo do i= N, 1, -1 SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(i,j) * Z(IAU(i,j)) enddo Z(i)= Z(i) – SW / D(i) enddo Forward Substitution (L+D)z= r : z= D-1(r-Lz) M =(L+D)D-1(D+U) L,D,U: A Dependency… You need the most recent value of “z” of connected nodes. Vectorization/parallelization is difficult. Backward Substitution (I+ D-1 U)znew= zold : z= z - D-1Uz

08-APR22 Basic Strategy for Parallel Programming on the Earth Simulator • Hypothesis • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors. • Re-Ordering for Highly Parallel/Vector Performance • Local operation and no global dependency • Continuous memory access • Sufficiently long loops for vectorization

08-APR22 ILU/IC Preconditioning do i= 1, N WVAL= R(i) do j= 1, INL(i) WVAL= WVAL - AL(i,j) * Z(IAL(i,j)) enddo Z(i)= WVAL / D(i) enddo do i= N, 1, -1 SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(i,j) * Z(IAU(i,j)) enddo Z(i)= Z(i) – SW / D(i) enddo Forward Substitution (L+D)z= r : z= D-1(r-Lz) M =(L+D)D-1(D+U) L,D,U: A Dependency… You need the most recent value of “z” of connected nodes. Vectorization/parallelization is difficult. Reordering: Directly connected nodes do not appear in RHS. Backward Substitution (I+ D-1 U)znew= zold : z= z - D-1Uz

08-APR22 Basic Strategy for Parallel Programming on the Earth Simulator • Hypothesis • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors. • Re-Ordering for Highly Parallel/Vector Performance • Local operation and no global dependency • Continuous memory access • Sufficiently long loops for vectorization • 3-Way Parallelism for Hybrid Parallel Programming • Inter Node : MPI • Intra Node : OpenMP • Individual PE : Vectorization

08-APR22 Vector SMP Parallel Vector SMP Parallel These processes can be substituted by traditional multi-coloring (MC). Re-Ordering Technique for Vector/Parallel ArchitecturesCyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC)) 1. RCM (Reverse Cuthil-Mckee) 2. CMC (Cyclic Multicolor) 3. DJDS re-ordering 4. Cyclic DJDS for SMP unit

08-APR22 Reordering = Coloring • COLOR: Unit of independent sets. • Elements grouped in the same “color” are independent from each other, thus parallel/vector operation is possible. • Many colors provide faster convergence, but shorter vector length.: Trade-Off !! RCM (Reverse CM) Red-Black (2 colors) 4 colors

08-APR22 2D-Storage long vector length, many ZERO’s 1D-Storage (CRS) memory saved, short vector length Large Scale Sparse Matrix Storage for Unstructured Grids

08-APR22 Re-Ordering in each Coloraccording to Non-Zero Off-Diag. Component # Elements on the same color are independent, thereforeintra-hyperplane re-ordering does not affect results DJDS : Descending-Order Jagged Diagonal Storage

08-APR22 iS+1 iv0+1 iE Cyclic DJDS (MC/CM-RCM)Cyclic Re-Ordering for SMP unitsLoad-balancing among PEs do iv= 1, NCOLORS !$omp parallel do do ip= 1, PEsmpTOT iv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1) do j= 1, NLhyp(iv) iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1) iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip ) !cdir nodep do i= iv0+1, iv0+iE-iS k= i+iS - iv0 kk= IAL(k) (Important Computations) enddo enddo enddo enddo npLX1= NLmax * PEsmpTOT INL(0:NLmax*PEsmpTOT*NCOLORS)

08-APR22 Difference betweenFlat MPI & Hybrid • Most of the efforts of re-ordering are for vectorization. • If you have a long vector, just divide it and distribute the segments to PEs on SMP nodes. • Source codes of Hybrid and Flat MPI are not so different. • Flat MPI corresponds to Hybrid where PE/SMP node=1. • In other words, Flat MPI code is sufficiently complicated.

08-APR22 Cyclic DJDS (MC/CM-RCM)for Forward/Backward Substitution in BILU Factorization

08-APR22 z Uniform Distributed Force in z-dirrection @ z=Zmin Uy=0 @ y=Ymin Ux=0 @ x=Xmin Nz-1 Ny-1 y Uz=0 @ z=Zmin Nx-1 x Simple 3D Cubic Model

08-APR22 Effect of Ordering

08-APR22 Long Loops Continuous Access Short Loops Irregular Access Short Loops Continuous Access Effect of Re-Ordering

08-APR22 Matrix Storage, Loops do iv= 1, NVECT iv0= STACKmc(iv-1) do j= 1, NLhyp(iv) iS= index_L(NL*(iv-1)+ j-1) iE= index_L(NL*(iv-1)+ j ) do i= iv0+1, iv0+iE-iS k= i+iS - iv0 kk= item_L(k) Z(i)= Z(i) - AL(k)*Z(kk) enddo iS= STACKmc(iv-1) + 1 iE= STACKmc(iv ) do i= iS, iE Z(i)= Z(i)/DD(i) enddo enddo enddo DJDS • DJDS (Descending order Jagged Diagonal Storage) with long innermost loops is suitable for vector processors. DCRS do i= 1, N SW= WW(i,Z) isL= index_L(i-1)+1 ieL= index_L(i) do j= isL, ieL k= item_L(j) SW= SW - AL(j)*Z(k) enddo Z(i)= SW/DD(i) enddo • Reduction type loop of DCRS is more suitable for cache-based scalar processor because of its localized operation.

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

Presentation Transcript

Unstructured Mesh Discretizations and Solvers for Computational Aerodynamics

Unstructured Grids at Sandia National Labs

Fast Parallel Grid Remapping for Unstructured and Structured Grids

Unstructured grids for Astrophysics Gas dynamics and radiative transfer

Iterative Solvers for Linear Systems of Equations

Avoiding Communication in Sparse Iterative Solvers

Iterative Solvers for Coupled Fluid-Solid Scattering

Fast Parallel Grid Remapping for Unstructured and Structured Grids

Aerodynamic Drag Prediction Using Unstructured Mesh Solvers

The Architecture of Earth Simulator

Earth Simulator Building

Using the PETSc Linear Solvers

Linear Systems Iterative Solutions

The Earth Simulator System

Linear scaling solvers based on Wannier-like functions

Earth Simulator Building

Leading Computational Methods on the Earth Simulator

Iterative Solvers for Linear Systems of Equations

Iterative Solvers for Linear Systems of Equations

Unstructured grids for Astrophysics Gas dynamics and radiative transfer

Linear Systems – Iterative methods

Aerodynamic Drag Prediction Using Unstructured Mesh Solvers