H.J.J. van Dam , Martyn Guest and Paul Sherwood,

Performance analysis of GA-based applications using the Vampir tool NWChem and GAMESS-UK on High-end and Commodity class machines. H.J.J. van Dam, Martyn Guest and Paul Sherwood, Quantum Chemistry Group, CLRC Daresbury Laboratory http://www.cse.clrc.ac.uk Miles Deegan Compaq (Galway)

Outline • Background : PNNL, Daresbury and PALLAS • Tool for Performance Analysis - VAMPIR & VAMPIR Trace • VAMPIR - analysis of trace files • VAMPIR Trace • Trace Library for MPI applications • Extensions to handle GA applications • Case Studies • DFT Calculations on Zeolite Fragments (347 - 1687 GTOs) with Coulomb Fitting • High-end Systems - Cray T3E/1200E, Compaq AlphaServer SC (667 & 833 MHz), SGI Origin 3000/R12k-400 and IBM SP/WH2-375 • Commodity Clusters (IA32 and Alpha Linux) • NWChem and GAMESS-UK • Distributed data (NWchem) and Replicated Data (GAMESS-UK) • Analysis of GAs and PeIGs • Summary

PNNL - Daresbury - Pallas Collaborations • PNNL - Daresbury Collaboration • Long term interaction between chemistry activities • Proposed developments around DFT derivative codes • UK Chemistry Collaboration Forum (CCP1) • DFT Flagship project and subsequent DL extensions • DFT Functional Repository (http://www.dl.ac.uk/DFTlib) • Daresbury - Pallas Collaboration • Demonstrate that clusters of IA32 and Alpha processors are competitive with HPC servers (with low to medium processor numbers) for a wide range of applications • Evaluate the suitability of clusters for high-end computing • Analyse kernels and full applications (May 2000 - Sep.2001)

Vampir 2.5 Visualization and Analysis of MPIPrograms

Vampir Features • Offline trace analysis for MPI (and others ...) • Traces generated by Vampirtrace tool (`ld ... -lVT -lpmpi -lmpi`) • Convenient user–interface • Scalability in time and processor–space • Excellent zooming and filtering • High–performance graphics • Display and analysis of MPIandapplication events: • execution of MPI routines • point–to–point and collective communication • MPI–2 I/O operations • execution of application subroutines (optional) • “Easy” customization

Vampir Displays • Global displays show all selected processes • Summary Chart: aggregated profiling information • Activity Chart: presents per–process profiling information • Timeline:detailedapplication execution over time axis • Communication statistics: message statistics for each process pair • Global Comm. Statistics: collective operations statistics • I/O Statistics:MPI I/O operation statistics • Calling Tree: draws global or local dynamic calling trees • Process displays show a single process per window • Activity Chart • Timeline • Calling Tree

See message details Messagesend op Message receive op Timeline Display (Message Info) • Source–code references are displayed if recorded by Vampirtrace Click on message line

Vampirtrace Tracing of MPI and Application Events

Current version: Vampirtrace 2.0 Significant new features: records collective communication enhanced filter functions extended API records source–code information (selected platforms) support for shmem (Cray T3E) records MPI–2 I/O operations Available for all major MPI platforms Library that records all MPI calls, point to point communication, and collective operations. Runtime filters available to limit tracefile size. Provides an API for user instrumentation. Requires MPI to gather performance data. Uses the profiling interface of MPI and is therefore independent of the specifics of a given MPI implementation. Vampirtrace

Switching tracing on/off SUBROUTINE VTTRACEOFF( ) SUBROUTINE VTTRACEON( ) Specifying user-defined states SUBROUTINE VTSYMDEF(ICODE, STATE, ACTIVITY, IERR) Entering/leaving user-defined states SUBROUTINE VTBEGIN(ICODE, IERR) SUBROUTINE VTEND(ICODE, IERR) Logging message send/receive events (undocumented) SUBROUTINE VTLOGSENDMSG( IME, ITO, ICNT, ITAG, ICOMMID, IERR) SUBROUTINE VTLOGRECVMSG( IME, IFRM, ICNT, ITAG, ICOMMID, IERR) Vampirtrace API

Global Arrays

Instrumenting single-sided memory access • Approach 1: Instrument the puts, gets and data server • Advantage: robust and accurate • Disadvantage: one does not always have access to the source of the data server • Approach 2: Instrument the puts and gets only, cheating on the source and destination of the messages • Advantage: no instrumentation of the data server required • Disadvantage: timings of the messages are inaccurate in case of non-blocking operations

The tracing of activities can be modified at runtime through a configuration file. Tracing of messages can not be changed. VTTRACEON and VTTRACEOFF should be used sparingly. Logfile-name /home/user/prog.bpv Symbol nnodes off Symbol nodeid off Symbol GA_Nnodes off Symbol GA_Nodeid off Runtime tracing options Practical issues • The vampirtrace library and evaluation licenses can be downloaded from http://www.pallas.com/ • Evaluation licenses are limited to 32 processors • CPU cycle providers are not too keen to provide vampirtrace?

Case Studies - Zeolite Fragments • DFT Calculations with Coulomb Fitting • Basis (Godbout et al.) • DZVP - O, Si • DZVP2 - H • Fitting Basis: • DGAUSS-A1 - O, Si • DGAUSS-A2 - H • NWChem & GAMESS-UK • Both codes use auxiliary fitting basis for coulomb energy, with 3 centre 2 electron integrals held in core. Si8O7H18 347/832 Si8O25H18 617/1444 Si26O37H36 1199/2818 Si28O67H30 1687/3928

High-End and Commodity Systems • Cray T3E/1200E • 816 processor system at Manchester (CSAR service) • 600 Mz EV56 Alpha processor with 256 MB memory • IBM SP (32 CPU system at DL) • 4-way Winterhawk2 SMP “thin nodes” with 2 GB memory • 375 MHz Power3-II processors with 8 MB L2 cache • Compaq AlphaServer SC - 667 (APAC) and 833 MHz CPUs • 4-way ES40/667 and /833 SMP nodes with 2 GB memory • Alpha 21264a (EV67) CPUs with 8 MB L2 cache • Quadrics “fat tree” interconnect (5 usec latency, 150 MB/sec B/W) • SGI Origin 3800 • SARA (1000 CPUs) - Numalink with R12k/400 CPUs • Commodity Systems (DL) • 32 X IA32 single processor CPUs (Pentium III/450), fast ethernet • Linux Alpha Cluster (16 X UP2000/667 - Quadrics Interconnect)

Measured Parallel Efficiency for NWChem - DFT on IBM-SP; Wall Times to Solution for SCF Convergence 256 D.A Dixon et al., HPC, Plenum , 1999, p. 215 224 192 160 Speed-up 128 96 64 32 32 64 96 128 160 192 224 256 Number of Nodes

DFT Coulomb Fit - NWChem Si8O25H18 617/1444 Si8O7H18 347/832 Measured Time (seconds) Measured Time (seconds) Number of CPUs Number of CPUs

DFT Coulomb Fit - NWChem Si26O37H36 1199/2818 Si28O67H30 1687/3928 TIBM-SP/P2SC-120 (256) = 1137 TIBM-SP/P2SC-120 (256) = 2766 Measured Time (seconds) Measured Time (seconds) Number of CPUs Number of CPUs

NWChem : Si8O7H18 and Si26O37H36 Si8O7H18 Si26O37H36

NWChem / Si8O25H18 / Cycle

NWChem / Si8O25H18 / Diag

NWChem / Si8O25H18 / subdiag

Parallel Implementations of GAMESS-UK • Extensive use of Global Array (GA) Tools and Parallel Linear Algebra from NWChem Project (EMSL) • SCF and DFT energies and gradients • Replicated data, but … • GA Tools for caching of I/O for restart and checkpoint files • Storage of 3-centre 2-e integrals in DFT Jfit • Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix) • SCF second derivatives • Distribution of <vvoo> and <vovo> integrals via GAs • MP2 gradients • Distribution of <vvoo> and <vovo> integrals via GAs

GAMESS-UK: DFT S-VWN Impact of Coulomb Fitting: Compaq AlphaServer SC /833 Basis: DZV_A2 (Dgauss) A1_DFT Fit: Si26O37H36 1199/2818 Si28O67H30 1687/3928 Measured Time (seconds) Measured Time (seconds) JEXPLICIT JEXPLICIT JFIT JFIT Number of CPUs Number of CPUs

DFT Coulomb Fit - GAMESS-UK Si26O37H36 1199/2818 Si28O67H30 1687/3928 Measured Time (seconds) Measured Time (seconds) Number of CPUs Number of CPUs

DFT JFit Performance : Si26O37H36 Cray T3E/1200E SCF XC SGI Origin 3000/R12k-400 JFit SCF Number of CPUs AlphaServer SC/833 XC SCF JFit XC Number of CPUs JFit

GAMESS-UK / Si8O25H18 : 8 CPUs:One DFT Cycle

GAMESS-UK / Si8O25H18 : 8 CPUs Q†HQ (GAMULT2) and PEIGS

Summary • PNNL, Daresbury and PALLAS collaborations • Tool for Performance Analysis - VAMPIR & VAMPIR Trace • Extended to handle GA Applications • Applied in a number of DFT Calculations on Zeolite Fragments on a variety of high-end and commodity-based platforms • Instrumentation of both NWChem and GAMESS-UK: • Distributed data (NWchem) • Replicated Data (GAMESS-UK) • Analysis of GAs and PeIGs • Findings • non-intrusive • Tracing of substantial runs possible • Size of trace files in distributed data applications • Use in quantifying scaling problems • e.g. GA_MULT2 in GAMESS-UK

Acknowledgements Bob Gingold Australian National Univeristy Supercomputer Facility Mario Deilmann, Hans Plum, Heinrich Bockhorst Pallas

H.J.J. van Dam , Martyn Guest and Paul Sherwood,

H.J.J. van Dam , Martyn Guest and Paul Sherwood,

Presentation Transcript

Vincent van Gogh Paul Gauguin

Monday Guest lecture Paul finkleman

MARTYN PIG

Flip - Martyn Bedford

By : Liz Martyn

Martyn stokes

Aino Andriessen Alan van Dam

Martyn F Guest, Edo Apra, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory

Martyn F Guest, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory

Paul Van Dyk

Yong Han, Yong Chen and Paul van Delst

Paul van der Schoot

Huub van Dam and Paul Sherwood STFC Daresbury Laboratory h.j.j.vandam@daresbury.ac.uk

SHERWOOD HOUSE

John Hughes, and Andy van Dam

Peter van Dam

John Hughes, and Andy van Dam