Achievements and challenges running GPU-accelerated Quantum ESPRESSO on heterogeneous clusters

Achievements and challenges running GPU-accelerated Quantum ESPRESSO on heterogeneous clusters Filippo Spiga1,2 <fs395@cam.ac.uk> 1HPCS, University of Cambridge 2Quantum ESPRESSO Foundation

What is Quantum ESPRESSO? • Quantum ESPRESSO is an integrated suite of computer codes for atomistic simulations based on DFT, pseudo-potentials, and plane waves • "ESPRESSO" stands for opEnSource Package for Research in Electronic Structure, Simulation, and Optimization • Quantum ESPRESSO is an initiative of SISSA, EPFL, and ICTP, with many partners in Europe and worldwide • Quantum ESPRESSO is free software that can be freely downloaded. Everybody is free to use it and welcome to contribute to its development

What Quantum ESPRESSO can do? • ground-state calculations • Kohn-Sham orbitals and energies, total energies and atomic forces • finite as well as infinite system • any crystal structure or supercell • insulators and metals (different schemes of BZ integration) • structural optimization (many minimization schemes available) • transition states and minimum-energy paths (via NEB or string dynamics) electronic polarization via Berry’s phase • finite electric fields via saw-tooth potential or electric enthalpy • norm-conserving as well as ultra-soft and PAW pseudo-potentials • many different energy functionals, including meta-GGA, DFT+U, and hybrids (van der Waals soon to be available) • scalar-relativistic as well as fully relativistic (spin-orbit) calculations • magnetic systems, including non-collinear magnetism • Wannier intepolations • ab-initio molecular dynamics • Car-Parrinello (many ensembles and flavors) • Born-Oppenheimer (many ensembles and flavors) • QM-MM (interface with LAMMPS) • linear response and vibrational dynamics • phonon dispersions, real-space interatomic force constants • electron-phonon interactions and superconductivity effective charges and dielectric tensors • third-order an-harmonicities and phonon lifetimes • infrared and (off-resonance) Raman cross sections • thermal properties via the quasi-harmonic approximation • electronic excited states • TDDFT for very large systems (both real-time and “turbo-Lanczos”) • MBPT for very large systems (GW, BSE) .... plus several post processing tools!

Quantum ESPRESSO in numbers • 350,000+ lines of FORTRAN/C code • 46 registered developers • 1600+ registered users • 5700+ downloads of the latest 5.x.x version • 2 web-sites (quantum-espresso.org & qe-forge.org) • 1 official user mailing-list, 1 official developer mailing-list • 24 international schools and training courses (1000+ participants)

PWscf in a nutshellprogram flow 3D-FFT + GEMM + LAPACK 3D-FFT 3D-FFT + GEMM

Spoiler! • Only PWscf ported to GPU • Performance serial (full socket vs full socket + GPU): 3x ~ 4x • Performance parallel (best MPI+OpenMP vs ... + GPU): 2x ~ 3x • Designed to run better at low number of nodes (efficiency not high) • spin magnetization and noncolin not ported (working on it) • I/O set low on purpose • NVIDIA Kepler GPU not exploited at their best (working on it)

Achievement: smart and selective BLAS phiGEMM: CPU+GPU GEMM operations • Drop-in library wont work as expected, need control • overcome limit of the GPU memory • flexible interface (C on the HOST, C on the DEVICE) • dynamic workload adjustment (SPLIT) -- heuristic • call-by-call profiling capabilities GPU C1 A1 B × + CPU I H2D D2H unbalance A2 C2 B × +

Challenge: rectangular GEMMbad shape, poor performance • Issues: • A and B can be larger than GPU memory • A and B matrices are "badly" rectangular (dominant dimension) • Solutions: ~ +15% performance • tiling approach • not too big, not too small • GEMM computation must exceed copies (H-D, D-H), especially for small tiles • handling the "SPECIAL-K" case • adding beta × C done once • accumulating alpha × Ai × Bitimes n n k Common case due to data distribution k m m Optimizations included in phiGEMM ( version >1.9)

Challenge: parallel 3D-FFT • 3D-FFT burns up to 40%~45% of total SCF run-time • 90-ish % 3D-FFT of PWscf are inside vloc_psi ("Wave" grid) • 3D-FFT is "small"  <3003 COMPLEX DP • 3D-FFT can be not a cube • In serial a 3DFFT is called as it is, in parallel 3D-FFT = Σ1D-FFT • In serial data layout is straightforward, in parallel not* • MPI communication become big issue for many-node problem • GPU FFT is mainly memory bounded  grouping & batching 3D-FFT

Challenge: FFT data layoutit is all about sticks & planes A single 3D-FFT is divided in independent 1D-FFTs There are two "FFT grid" representation in Reciprocal Space: wave functions (Ecut) and charge density (4Ecut) Data are not contiguous and not “trivially” distributed across processors Zeros are not transformed. Different cut-offs preserve accuracy

Challenge: parallel 3D-FFT Optimization #1 • CUDA-enabled MPI for P2P (within socket) • Overlap FFT computation with MPI communication • MPI communication >>> FFT computation for many nodes Sync MemCpy HD MPI MemCpy DH

Challenge: parallel 3D-FFT Optimization #2 Observation: Limitation in overlapping D-H copy due to MPI communication • pinned needed (!!!) • Stream D-H copy to hide CPU copy and FFT computation Optimization #3 Observation: MPI “packets” small for many nodes • Re-order data before communication • Batch MPI_Alltoallv communications Optimization #4 Idea: reduce data transmitted (risky...) • Perform FFTs and GEMM in DP, truncate data before communication to SP

Achievements: parallel 3D-FFTminiDFT 1.6 (k-points calculations, ultra-soft pseudo-potentials) Optimization #1: +37% improvement in communication Optimization #2: Optimization #3: +10% improvement in communication Optimization #4: +52% (!!!) improvement in communication SP vs DP without proper stream mng with proper stream mng Lower gain in PWscf !!!

Challenge: parallel 3D-FFT All data of all FFT computed back to host mem 1 2 Data reordering before GPU-GPU communication Image courtesy of D.Stoic

Challenge: H*psi compute/update H * psi: compute kinetic and non-local term (in G space)  complexity : Ni × (N × Ng+ Ng × N × Np) Loop over (not converged) bands: FFT (psi) to R space  complexity : Ni × Nb × FFT(Nr) compute V * psi  complexity : Ni × Nb × Nr FFT (V * psi) back to G space  complexity : Ni × Nb × FFT(Nr) compute Vexx  complexity : Ni × Nc × Nq × Nb × (5 × Nr + 2×FFT(Nr)) N = 2×Nb (where Nb = number of valence bands) Ng = number of G vectors Ni = number of Davidson iteration Np = number of PP projector Nr = size of the 3D FFT grid Nq = number of q-point (may be different from Nk)

Challenge: H*psinon-converged electronic bands dilemma Non-predictable number of FFT across all SCF iterations

Challenge: parallel 3D-FFTthe orthogonal approach Considerations: • memory on GPU  ATLAS K40 (12 GByte) • (still) too much communication  GPU Direct capability needed • enough 3D-FFT  not predictable in advance • benefit also for CPU-only! FFT GR CUFFT GR PSIC PSIC PSIC PSI PSI Multiple LOCAL grid to compute “MPI_Allgatherv” products products DISTRIBUTED “MPI_Allscatterv” PSIC PSIC Overlapping is possible!! PSIC HPSI HPSI CUFFT RG FFT RG Not ready for production yet

Challenge: eigen-solverswhich library? • LAPACK  MAGMA (ICL, University of Tennessee) • hybridization approach (CPU + GPU), dynamic scheduling based on DLA (QUARK) • single and multi-GPU, no memory distributed (yet) • some (inevitable) numerical "discrepancies" • ScaLAPACK  ELPA  ELPA + GPU (RZG + NVIDIA) • ELPA (Eigenvalue SoLvers for PetaflopApplications) improves ScaLAPACK • ELPA-GPU proof-of-concept based on CUDA FORTRAN • effective results below expectation • Lancronzdiagonaliz w/ tridiagonal QR algorithm (Penn State) • simple (too simple?) and designed to be GPU friendly • take advantage of GPU Direct • experimental, need testing and validation

HPC Machines • 128 nodes dual-socket • dual 6-core Intel Ivy Bridge • dual NVIDIA K20c per node • dual Mellanox Connect-IB FDR TITAN (ORNL) [CRAY] WILKES (HPCS) [DELL] • 18688 nodes single-socket • single 16-core AMD Opteron • one NVIDIA K20x per node • Gemini interconnection #2 Green500 Nov 2013 ( ~3632 MFlops/W ) #2 Top500 Jun 2013 ( ~17.59 PFlops Rmax )

Achievement: Save Powerserial multi-threaded, single GPU, NVIDIA Fermi generation -58% -57% -54% 3.67x 3.2x 3.1x Tests run early 2012 @ ICHEC

Achievement: improved time-to-solution 2.4x ~2.9x ~3.4x ~3.4x 2.5x ~3.5x 2.4x ~2.1x Serial Parallel Parallel tests run on Wilkes Serial tests run on SBN machine

Challenge: running on CRAY XK7 Key differences... • AMD Bulldozer architecture, 2 cores shares same FPU pipeline  aprun -j 1 • NUMA locality matters a lot , for both CPU-only and CPU+GPU  aprun –cc numanode • GPU Direct over RDMA is not supported (yet?)  3D-FFT not working • Scheduling policy "unfriendly"  input has to be really big Performance below expectation (<2x)  Tricks: many-pw.x, __USE_3D_FFT

Challenge: educate users • Performance portability myth • "configure, compile, run" same as the CPU version • All dependencies (MAGMA, phiGEMM) compiled by QE-GPU • No more than 2 MPI process per GPU • Hyper-Q does not work automatically, an additional running deamon is needed • Forget about 1:1 output comparison • QE-GPU can run on every GPU but some GPU are better than others...

Lessons learntbeing "heterogeneous" today and tomorrow • GPU does not really improve code scalability, only time-to-solution • Re-think about data distribution for massive parallel architectures • Deal with un-controlled "numerical fluctuations" (GPU magnifies this) • The "data movement" constrain will soon disappear  new Intel Phi Kings Landing, NVIDIA project Denver expected by 2015 • Looking for true alternatives, new algorithms • not easy, extensive validation _plus_ module dependencies • Performance is a function of human effort • Follow the mantra «Do what you are good at.»

Links: • http://hpc.cam.ac.uk • http://www.quantum-espresso.org/ • http://foundation.quantum-espresso.org/ • http://qe-forge.org/gf/project/q-e/ • http://qe-forge.org/gf/project/q-e-gpu/ Thank you for your attention!

Achievements and challenges running GPU-accelerated Quantum ESPRESSO on heterogeneous clusters

Achievements and challenges running GPU-accelerated Quantum ESPRESSO on heterogeneous clusters

Presentation Transcript

GPU-accelerated Fluid Dynamics

OpenFOAM on a GPU-based Heterogeneous Cluster

Heterogeneous CPU/GPU co-processor clusters

Introduction to Quantum Espresso

GPU-Accelerated Genetic Algorithms

Shredder GPU -Accelerated Incremental Storage and Computation

Tarazu Optimizing MapReduce On Heterogeneous Clusters

Programming Heterogeneous (GPU) Systems

gpu -Accelerated Video Encoding/Decoding

Clusters and societal challenges

Accelerated Stereoscopic Rendering using GPU

Reanalysis -Achievements and Challenges -

Learning objects: achievements and challenges

Challenges and Achievements

A GPU Accelerated Storage System

Economic Achievements and Challenges

Running “ Zen ” on Computer Clusters

Achievements and challenges

Direct Self-Consistent Field Computations on GPU Clusters

Heterogeneous Catalysis Opportunities and challenges