High Performance Computing: Concepts, Methods & Means HPC Libraries

High Performance Computing: Concepts, Methods & MeansHPC Libraries Hartmut Kaiser PhD Center for Computation & Technology Louisiana State University April 19th, 2007

Outline • Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test

Puzzle of the Day #include <stdio.h> int main() { int a = 10; switch(a) { case '1': printf("ONE\n"); break; case '2': printf("TWO\n"); break; defa1ut: printf("NONE\n"); } return 0; } If you expect the output of the above program to be NONE, I would request you to check it out!

Application domains • Linear algebra • BLAS, ATLAS, LAPACK, ScaLAPACK, Slatec, pim • Ordinary and partial Differential Equations • PETSc • Mesh manipulation and Load Balancing • METIS, ParMETIS, CHACO, JOSTLE, PARTY • Graph manipulation • Boost.Graph library • Vector/Signal/Image processing • VSIPL, PSSL. • General parallelization • MPI, pthreads • Other domain specific libraries • NAMD, NWChem, Fluent, Gaussian, LS-DYNA

Application Domain Overview • Linear Algebra Libraries • Provide optimized methods for constructing sets of linear equations, performing operations on them (matrix-matrix products, matrix-vector products) and solving them (factoring, forward & backward substitution. • Commonly used libraries include BLAS, ATLAS, LAPACK, ScaLAPACK, PaLAPACK • PDE Solvers: • Developing general-porpose, parallel numerical PDE libraries • Usual toolsets include manipulation of sparse data structures, iterative linear system solvers, preconditioners, nonlinear solvers and time-stepping methods. • Commonly used libraries for solving PDEs include SAMRAI, PETSc, PARASOL, Overture, among others.

Application Domain Overview • Mesh manipulation and Load Balancing • These libraries help in partitioning meshes in roughly equal sizes across processors, thereby balancing the workload while minimizing size of separators and communication costs. • Commonly used libraries for this purpose include METIS, ParMetis, Chaco, JOSTLE among others. • Other packages: • FFTW: features highly optimized Fourier transform package including both real and complex multidimensional transforms in sequential, multithreaded, and parallel versions. • NAMD: molecular dynamics library available for Unix/Linux, Windows, OS X • Fluent: computational fluid dynamics package, used for such applications as environment control systems, propulsion, reactor modeling etc.

BLAS • (Updated set of) Basic Linear Algebra Subprograms • The BLAS functionality is divided into three levels: • Level 1: contains vector operations of the form: as well as scalar dot products and vector norms • Level 2: contains matrix-vector operations of the formas well as Tx = y solving for x with T being triangular • Level 3: contains matrix-matrix operations of the formas well as solving for triangular matrices T. This level contains the widely used General Matrix Multiply operation.

BLAS • Several implementations for different languages exist • Reference implementation (F77 and C)http://www.netlib.org/blas/ • ATLAS, highly optimized for particular processor architectures • A generic C++ template class library providing BLAS functionality: uBLAShttp://www.boost.org • Several vendors provide libraries optimized for their architecture (AMD, HP, IBM, Intel, NEC, NViDIA, Sun)

BLAS: F77 naming conventions • Each routine has a name which specifies the operation, the type of matrices involved and their precisions. Names are in the form: PMMOO • Some of the most common operations (OO): • DOT scalar product, x^T y AXPY vector sum, α x + y MV matrix-vector product, A x SV matrix-vector solve, inv(A) x MM matrix-matrix product, A B SM matrix-matrix solve, inv(A) B • The types of matrices are (MM) • GE general GB general band SY symmetric SB symmetric band SP symmetric packed HEhermitianHBhermitian band HPhermitian packed TR triangular TB triangular band TP triangular packed • Each operation is defined for four precisions (P) • S single real D double real C single complex Z double complex • Examples SGEMM stands for “single-precision general matrix-matrix multiply” DGEMMstands for “double-precision matrix-matrix multiply”.

BLAS: C naming conventions • F77 routine name is changed to lowercase and prefixed with cblas_ • All routines which accept two dimensional arrays have a new additional first parameter specifying the matrix memory layout (row major or column major) • Character parameters are replaced by corresponding enum values • Input arguments are declared const • Non-complex scalar input parameters are passed by value • Complex scalar input argiments are passed using a void* • Arrays are passed by address • Output scalar arguments are passed by address • Complex functions become subroutines which return the result via an additional last parameter (void*), appending _sub to the name

BLAS Level 1 routines • Vector operations(xROT, xSWAP, xCOPY etc.) • Scalardot products (xDOT etc.) • Vector norms(IxAMX etc.)

BLAS Level 2 routines • Matrix-vector operations(xGEMV, xGBMV, xHEMV, xHBMV etc.) • Solving Tx = y for x, where T is triangular(xGER, xHER etc.)

BLAS Level 3 routines • Matrix-matrix operations(xGEMM etc.) • Solving for triangular matrices(xTRMM) • Widely used matrix-matrix multiply (xSYMM, xGEMM)

Demo 1 • Shows solving a matrix multiplication problem using BLAS expressed in FORTRAN, C, and C++ • Shows genericity of uBLAS, by comparing generic and banded matrix versions • Shows newmat, a C++ matrix library which uses operator overloading

LAPACK • Linear Algebra PACKage • http://www.netlib.org/lapack/ • Written in F77 • Provides routines for • Solving systems of simultaneous linear equations, • Least-squares solutions of linear systems of equations, • Eigenvalue problems, • Householder transformation to implement QR decomposition on a matrix and • Singular value problems • Was initially designed to run efficiently on shared memory vector machines • Depends on BLAS • Has been extended for distributed (SIMD) systems (ScaPACK and PLAPACK)

LAPACK (Architecture)

LAPACK naming conventions • Very similar to BLAS • XYYZZZ • X: data type • S: REAL • D: DOUBLE PRECISION • C: COMPLEX • Z: COMPLEX*16 or DOUBLE COMPLEX • YY: matrix type • BD: bidiagonal • DI: diagonal • GB: general band • GE: general (i.e., unsymmetric, in some cases rectangular) • GG: general matrices, generalized problem (i.e., a pair of general matrices) • GT: general tridiagonal • HB: (complex) Hermitian band • HE: (complex) Hermitian • HG: upper Hessenberg matrix, generalized problem (i.e a Hessenberg and a triangular matrix) • HP: (complex) Hermitian, packed storage • HS: upper Hessenberg • OP: (real) orthogonal, packed storage • OR: (real) orthogonal • PB: symmetric or Hermitian positive definite band • YY: more matrix types • PO: symmetric or Hermitian positive definite • PP: symmetric or Hermitian positive definite, packed storage • PT: symmetric or Hermitian positive definite tridiagonal • SB: (real) symmetric band • SP: symmetric, packed storage • ST: (real) symmetric tridiagonal • SY: symmetric • TB: triangular band • TG: triangular matrices, generalized problem (i.e., a pair of triangular matrices) • TP: triangular, packed storage • TR: triangular (or in some cases quasi-triangular) • TZ: trapezoidal • UN: (complex) unitary • UP: (complex) unitary, packed storage • ZZZ: performed computation • Linear systems • Factorizations • Eigenvalue problems • Singular value decomposition • Etc.

Demo 2 • Shows how using a library might speed up the computation considerably

PETSc (pronounced PET-see) • Portable, Extensible Toolkit for Scientific Computation (http://www-unix.mcs.anl.gov/petsc/petsc-as/) • Suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations (PDEs) • Employs the MPI standard for all message-passing communication • Intended for use in large-scale application projects • Includes a large suite of parallel linear and nonlinear equation solvers • Easily used in application codes written in C, C++, Fortran and Python • Good introduction: http://www-unix.mcs.anl.gov/petsc/petsc-as/documentation/tutorials/nersc02/nersc02.ppt

PETSc (general features) • Features include: • Parallel vectors • Scatters (handles communicating ghost point information) • Gathers • Parallel matrices • Several sparse storage formats • Easy, efficient assembly. • Scalable parallel preconditioners • Krylov subspace methods • Parallel Newton-based nonlinear solvers • Parallel time stepping (ODE) solvers

PETSc (Architecture) PETSc: Module architecture and layers of abstraction

PETSc: Component details • Vector operations (Vec): Provides the vector operations required for setting up and solving large-scale linear and nonlinear problems. Includes easy-to-use parallel scatter and gather operations, as well as special-purpose code for handling ghost points for regular data structures. • Matrix operations (Mat): A large suite of data structures and code for the manipulation of parallel sparse matrices. Includes four different parallel matrix data structures, each appropriate for a different class of problems. • Preconditioners (PC): A collection of sequential and parallel preconditioners, including • (sequential) ILU(k) (incomplete factorization), • LU (lower/upper decomposition), • both sequential and parallel block Jacobi, overlapping additive Schwarz methods • Time stepping ODE solvers (TS): Code for the time evolution of solutions of PDEs. In addition, provides pseudo-transient continuation techniques for computing steady-state solutions.

PETSc: Component details • Krylov subspace solvers (KSP): Parallel implementations of many popular Krylov subspace iterative methods, including • GMRES (Generalized Minimal Residual method), • CG (Conjugate Gradient), • CGS (Conjugate Gradient Squared), • Bi-CG-Stab (BiConjugate Gradient Squared), • two variants of TFQMR (transpose free QMR), • CR (Conjugate Residuals), • LSQR (Least Square Root). All are coded so that they are immediately usable with any preconditioners and any matrix data structures, including matrix-free methods. • Non-linear solvers (SNES): Data-structure-neutral implementations of Newton-like methods for nonlinear systems. Includes both line search and trust region techniques with a single interface. Employs by default the above data structures and linear solvers. Users can set custom monitoring routines, convergence criteria, etc.

Mesh libraries • Introduction • Structured/unstructured meshes • Examples • Mesh decomposition

Introduction to Meshes and Grids • Mesh/Grid : 2D or 3D representation of the computational domain. • Common 2D meshes are composed of triangular or quadrilateral elements • Common 3D meshes are composed of hexahedral, tetrahedral or pyramidal elements Quadrilateral Triangle 2D Mesh elements Hexahedron Prism Tetrahedron 3D Mesh elements

Structured/Unstructured Meshes Structured Grids (Meshes) Unstructured Meshes Mesh connectivity information must be stored Incurs additional memory and computational cost Handles complex geometries and grid adaptivity Typically use finite volume or finite element discretization Mesh quality becomes a concern • Cartesian grids, logically rectangular grids • Mesh info accessed implicitly using grid point indices • Efficient in both computation and storage • Typically use finite difference discretization

Mesh examples

Meshes are used for Computation

Mesh Decomposition • Goal is to maximize interior while minimizing connections between subdomains. That is, minimize communication. • Such decomposition problems have been studied in load balancing for parallel computation. • Lots of choices: • METIS, ParMETIS -- University of Minnesota. • PARTI -- University of Maryland, • CHACO -- Sandia National Laboratories, • JOSTLE -- University of Greenwich, • PARTY -- University of Paderborn, • SCOTCH -- Université Bordeaux, • TOP/DOMDEC -- NAS at NASA Ames Research Center. http://www.hlrs.de

Mesh Decomposition • Load balancing • Distribute elements evenly across processors. • Each processor should have equal share of work. • Communication costs should be minimized. • Minimize sub-domain boundary elements. • Minimize number of neighboring domains. • Distribution should reflect machine architecture. • Communication versus calculation. • Bandwidth versus latency. • Note that optimizing load balance and communication cost simultaneously is an NP-hard problem. http://www.epcc.ed.ac.uk/epcc-tec/documents/meshdecomp-slides/MeshDecomp-13.html

Mesh decomposition http://www.hlrs.de 36

Static and Dynamic Meshes Static Grids (Meshes) Dynamic Meshes Decomposition must be adapted as underlying mesh or processor load changes. Dynamic decomposition therefore becomes part of the calculation itself and cannot be carried out solely as a pre-processing step. • Decomposition need only be carried out once • Static decomposition may therefore be carried out as a preprocessing step, often done in serial http://www.epcc.ed.ac.uk/epcc-tec/documents/meshdecomp-slides/MeshDecomp-14.html

HP J6700 1 CPU Solve Time: 13:26 Baseline Time src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt

Linux Cluster 2 CPU’s Solve Time: 5:20 Speed-Up: 2.5X src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt

Speedup due to decomposition

Jostle and Metis http://www.hlrs.de 44

Jostle http://www.hlrs.de 45

Metis http://www.hlrs.de 48

ParMetis http://www.hlrs.de 49

Metis (serial) http://www.hlrs.de 50

High Performance Computing: Concepts, Methods & Means HPC Libraries