MARMOT- MPI Analysis and Checking Tool

MARMOT- MPI Analysis and Checking Tool Bettina Krammer, Matthias Müller krammer@hlrs.de , mueller@hlrs.de HLRS High Performance Computing Center Stuttgart Allmandring 30 D-70550 Stuttgart http://www.hlrs.de

Overview • General Problems of MPI Programming • Related Work • New Approach: MARMOT • Results within CrossGrid • Outlook

Problems of MPI Programming • All problems of serial programming • Additional problems: • Increased difficulty to verify correctness of program • Increased difficulty to debug N parallel processes • Portability issues between different MPI implementations and platforms, e.g. with mpich on the Crossgrid testbed: WARNING: MPI_Recv: tag= 36003 > 32767 ! MPI only guarantees tags up to this. THIS implementation allows tags up to 137654536 versions of LAM-MPI < v 7.0 only guarantee tags up to 32767 • New problems such as race conditions and deadlocks

Related Work • Parallel Debuggers, e.g. totalview, DDT, p2d2 • Debug version of MPI Library • Post-mortem analysis of trace files • Special verification tools for runtime analysis • MPI-CHECK • Limited to Fortran • Umpire • First version limited to shared memory platforms • Distributed memory version in preparation • Not publicly available • MARMOT

What is MARMOT? • Tool for the development of MPI applications • Automatic runtime analysis of the application: • Detect incorrect use of MPI • Detect non-portable constructs • Detect possible race conditions and deadlocks • MARMOT does not require source code modifications, just relinking • C and Fortran binding of MPI -1.2 is supported

Application Debug Server (additional process) Profiling Interface MARMOT core tool MPI library Design of MARMOT

Examples of Client Checks: verification on the local nodes • Verification of proper construction and usage of MPI resources such as communicators, groups, datatypes etc., for example • Verification of MPI_Request usage • invalid recycling of active request • invalid use of unregistered request • warning if number of requests is zero • warning if all requests are MPI_REQUEST_NULL • Check for pending messages and active requests in MPI_Finalize • Verification of all other arguments such as ranks, tags, etc.

Examples of Server Checks: verification between the nodes, control of program • Everything that requires a global view • Control the execution flow, trace the MPI calls on each node throughout the whole application • Signal conditions, e.g. deadlocks (with traceback on each node.) • Check matching send/receive pairs for consistency • Check collective calls for consistency • Output of human readable log file

MARMOT within CrossGrid

40 MHz (40 TB/sec) level 1 - special hardware 75 KHz (75 GB/sec) 5 KHz (5 GB/sec) level 2 -embedded processors 100 Hz (100 MB/sec) level 3-PCs data recording & offline analysis CrossGrid Application Tasks

CrossGrid Application: WP 1.4: Air pollution modelling (STEM-II) • Air pollution modeling with STEM-II model • Transport equation solved with Petrov-Crank-Nikolson-Galerkin method • Chemistry and Mass transfer are integrated using semi-implicit Euler and pseudo-analytical methods • 15500 lines of Fortran code • 12 different MPI calls: • MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Type_extent, MPI_Type_struct, MPI_Type_commit, MPI_Type_hvector, MPI_Bcast, MPI_Scatterv, MPI_Barrier, MPI_Gatherv, MPI_Finalize.

STEM application on an IA32 cluster with Myrinet

Conclusion • MARMOT supports MPI 1.2 for C and Fortran binding • Tested with several applications and benchmarks • Tested on different platforms, using different compilers and MPI implementations, e.g. • IA32/IA64 clusters (Intel, g++ compiler) with mpich (CrossGrid testbed, …) • IBM Regatta • NEC SX5, SX6 • Hitachi SR8000, Cray T3e

Impact and Exploitation • MARMOT is mentioned in the U.S. DoD High Performance Computing Modernization Program (HPCMP) Programming Environment and Training (PET) project as one of two freely available tools (MPI-Check and MARMOT) for the development of portable MPI programs. • Contact with tool developers of the other 2 known verification tools like MARMOT (MPI-Check, umpire) • Contact with users outside CrossGrid • MARMOT has been presented at various conferences and publications http://www.hlrs.de/organization/tsc/projects/marmot/pubs.html

Future Work • Scalability and general performance improvements • Better user interface to present problems and warnings • Extended functionality: • more tests to verify collective calls • MPI-2 • Hybrid programming

Thanks to all CrossGrid partnershttp://www.eu-crossgrid.org/ Thanks for your attentionAdditional information and download:http://www.hlrs.de/organization/tsc/projects/marmot

Back Up

Impact and Exploitation • MARMOT is mentioned in the U.S. DoD High Performance Computing Modernization Program (HPCMP) Programming Environment and Training (PET) project as one of two freely available tools (MPI-Check and MARMOT) for the development of portable MPI programs. • Contact with tool developers of the other 2 known verification tools like MARMOT (MPI-Check, umpire) • MARMOT has been presented at various conferences and publications http://www.hlrs.de/organization/tsc/projects/marmot/pubs.html

Impact and Exploitation cont’d • Tescico proposal (“Technical and Scientific Computing”) • French/German consortium: • Bull, CAPS Entreprise, Université Versailles, • T-Systems, Intel Germany, GNS, Universität Dresden, HLRS • ITEA-labelled • French authorities agreed to give funding • German authorities did not

Impact and Exploitation cont’d • MARMOT is integrated in the HLRS training classes • Deployment in GermanyMARMOT is in use in two of the 3 National High Performance Computing Centers: • HLRS at Stuttgart (IA32, IA64, NEC SX) • NIC at Jülich (IBM Regatta): • Just replaced their Cray T3E with a IBM Regatta • spent a lot of time finding a problem that would have been detected automatically by MARMOT

Feedback of Crossgrid Applications • Task 1.1 (biomedical) • C application • Identified issues: • Possible race conditions due to use of MPI_ANY_SOURCE • deadlock • Task 1.2 (flood): • Fortran application • Identified issues: • Tags outside of valid range • Possible race conditions due to use of MPI_ANY_SOURCE • Task 1.3 (hep): • ANN (C application) • no issues found by MARMOT • Task 1.4 (meteo): • STEMII (Fortran) • MARMOT detected holes in self-defined datatypes used in MPI_Scatterv, MPI_Gatherv. These holes were removed, which helped to improve the performance of the communication.

Examples of Log File 54 rank 1 performs MPI_Cart_shift 55 rank 2 performs MPI_Cart_shift 56 rank 0 performs MPI_Send 57 rank 1 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 58 rank 2 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 59 rank 0 performs MPI_Send 60 rank 1 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 61 rank 0 performs MPI_Send 62 rank 1 performs MPI_Bcast 63 rank 2 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 64 rank 0 performs MPI_Pack 65 rank 2 performs MPI_Bcast 66 rank 0 performs MPI_Pack

Examples of Log File (continued) 7883 rank 2 performs MPI_Barrier 7884 rank 0 performs MPI_Sendrecv 7885 rank 1 performs MPI_Sendrecv 7886 rank 2 performs MPI_Sendrecv 7887 rank 0 performs MPI_Sendrecv 7888 rank 1 performs MPI_Sendrecv 7889 rank 2 performs MPI_Sendrecv 7890 rank 0 performs MPI_Barrier 7891 rank 1 performs MPI_Barrier 7892 rank 2 performs MPI_Barrier 7893 rank 0 performs MPI_Sendrecv 7894 rank 1 performs MPI_Sendrecv

Examples of Log File (Deadlock) 9310 rank 1 performs MPI_Sendrecv 9311 rank 2 performs MPI_Sendrecv 9312 rank 0 performs MPI_Barrier 9313 rank 1 performs MPI_Barrier 9314 rank 2 performs MPI_Barrier 9315 rank 1 performs MPI_Sendrecv 9316 rank 2 performs MPI_Sendrecv 9317 rank 0 performs MPI_Sendrecv 9318 rank 1 performs MPI_Sendrecv 9319 rank 0 performs MPI_Sendrecv 9320 rank 2 performs MPI_Sendrecv 9321 rank 0 performs MPI_Barrier 9322 rank 1 performs MPI_Barrier 9323 rank 2 performs MPI_Barrier 9324 rank 1 performs MPI_Comm_rank 9325 rank 1 performs MPI_Bcast 9326 rank 2 performs MPI_Comm_rank 9327 rank 2 performs MPI_Bcast 9328 rank 0 performs MPI_Sendrecv WARNING: all clients are pending!

Examples of Log File (Deadlock: traceback on 0) timestamp= 9298: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9300: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9304: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9307: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9309: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9312: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9317: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9319: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9321: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9328: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status)

Examples of Log File (Deadlock: traceback on 1) timestamp= 9301: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9302: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9306: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9310: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9313: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9315: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9318: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9322: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9324: MPI_Comm_rank(comm = MPI_COMM_WORLD, *rank) timestamp= 9325: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD)

Examples of Log File (Deadlock: traceback on 2) timestamp= 9303: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9305: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9308: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9311: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9314: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9316: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9320: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9323: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9326: MPI_Comm_rank(comm = MPI_COMM_WORLD, *rank) timestamp= 9327: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD)

Classical Solutions I: Parallel Debugger • Examples: totalview, DDT, p2d2 • Advantages: • Same approach and tool as in serial case • Disadvantages: • Can only fix problems after and if they occur • Scalability: How can you debug programs that crash after 3 hours on 512 nodes? • Reproducibility: How to debug a program that crashes only every fifth time? • It does not help to improve portability

Classical Solutions II: Debug version of MPI Library • Examples: • catches some incorrect usage: e.g. node count in MPI_CART_CREATE (mpich) • deadlock detection (NEC mpi) • Advantages: • good scalability • better debugging in combination with totalview • Disadvantages: • Portability: only helps to use this implementation of MPI • Trade-of between performance and safety • Reproducibility: Does not help to debug irreproducible programs

Comparison with related projects

Examples

Example 1: request-reuse (source code) /* ** Here we re-use a request we didn't free before */ #include <stdio.h> #include <assert.h> #include "mpi.h" int main( int argc, char **argv ) { int size = -1; int rank = -1; int value = -1; int value2 = -1; MPI_Status send_status, recv_status; MPI_Request send_request, recv_request; printf( "We call Irecv and Isend with non-freed requests.\n" ); MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); printf( " I am rank %d of %d PEs\n", rank, size );

Example 1: request-reuse (source code continued) if( rank == 0 ){ /*** this is just to get the request used ***/ MPI_Irecv( &value, 1, MPI_INT, 1, 18, MPI_COMM_WORLD, &recv_request ); /*** going to receive the message and reuse a non-freed request ***/ MPI_Irecv( &value, 1, MPI_INT, 1, 17, MPI_COMM_WORLD, &recv_request ); MPI_Wait( &recv_request, &recv_status ); assert( value = 19 ); } if( rank == 1 ){ value2 = 19; /*** this is just to use the request ***/ MPI_Isend( &value, 1, MPI_INT, 0, 18, MPI_COMM_WORLD, &send_request ); /*** going to send the message ***/ MPI_Isend( &value2, 1, MPI_INT, 0, 17, MPI_COMM_WORLD, &send_request ); MPI_Wait( &send_request, &send_status ); } MPI_Finalize(); return 0; }

Example 1: request-reuse (output log) We call Irecv and Isend with non-freed requests. 1 rank 0 performs MPI_Init 2 rank 1 performs MPI_Init 3 rank 0 performs MPI_Comm_size 4 rank 1 performs MPI_Comm_size 5 rank 0 performs MPI_Comm_rank 6 rank 1 performs MPI_Comm_rank I am rank 0 of 2 PEs 7 rank 0 performs MPI_Irecv I am rank 1 of 2 PEs 8 rank 1 performs MPI_Isend 9 rank 0 performs MPI_Irecv 10 rank 1 performs MPI_Isend ERROR: MPI_Irecv Request is still in use !! 11 rank 0 performs MPI_Wait ERROR: MPI_Isend Request is still in use !! 12 rank 1 performs MPI_Wait 13 rank 0 performs MPI_Finalize 14 rank 1 performs MPI_Finalize

Example 2: deadlock (source code) /* This program produces a deadlock. ** At least 2 nodes are required to run the program. ** ** Rank 0 recv a message from Rank 1. ** Rank 1 recv a message from Rank 0. ** ** AFTERWARDS: ** Rank 0 sends a message to Rank 1. ** Rank 1 sends a message to Rank 0. */ #include <stdio.h> #include "mpi.h" int main( int argc, char** argv ){ int rank = 0; int size = 0; int dummy = 0; MPI_Status status;

Example 2: deadlock (source code continued) MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); if( size < 2 ){ fprintf( stderr," This program needs at least 2 PEs!\n" ); } else { if ( rank == 0 ){ MPI_Recv( &dummy, 1, MPI_INT, 1, 17, MPI_COMM_WORLD, &status ); MPI_Send( &dummy, 1, MPI_INT, 1, 18, MPI_COMM_WORLD ); } if( rank == 1 ){ MPI_Recv( &dummy, 1, MPI_INT, 0, 18, MPI_COMM_WORLD, &status ); MPI_Send( &dummy, 1, MPI_INT, 0, 17, MPI_COMM_WORLD ); } } MPI_Finalize(); }

Example 2: deadlock (output log) $ mpirun -np 3 deadlock1 1 rank 0 performs MPI_Init 2 rank 1 performs MPI_Init 3 rank 0 performs MPI_Comm_rank 4 rank 1 performs MPI_Comm_rank 5 rank 0 performs MPI_Comm_size 6 rank 1 performs MPI_Comm_size 7 rank 0 performs MPI_Recv 8 rank 1 performs MPI_Recv 8 Rank 0 is pending! 8 Rank 1 is pending! WARNING: deadlock detected, all clients are pending

Example 2: deadlock (output log continued) Last calls (max. 10) on node 0: timestamp = 1: MPI_Init( *argc, ***argv ) timestamp = 3: MPI_Comm_rank( comm, *rank ) timestamp = 5: MPI_Comm_size( comm, *size ) timestamp = 7: MPI_Recv( *buf, count = -1, datatype = non-predefined datatype, source = -1, tag = -1, comm, *status) Last calls (max. 10) on node 1: timestamp = 2: MPI_Init( *argc, ***argv ) timestamp = 4: MPI_Comm_rank( comm, *rank ) timestamp = 6: MPI_Comm_size( comm, *size ) timestamp = 8: MPI_Recv( *buf, count = -1, datatype = non-predefined datatype, source = -1, tag = -1, comm, *status )

Bandwidth on a IA64 cluster with Myrinet

Latency on an IA64 cluster with Myrinet

cg.B on an IA32 cluster with Myrinet

is.B on an IA32 cluster with Myrinet

is.A

MARMOT- MPI Analysis and Checking Tool