1 / 44

MARMOT- MPI Analysis and Checking Tool

MARMOT- MPI Analysis and Checking Tool Bettina Krammer, Matthias Müller krammer@hlrs.de , mueller@hlrs.de HLRS High Performance Computing Center Stuttgart Allmandring 30 D-70550 Stuttgart http://www.hlrs.de Overview General Problems of MPI Programming Related Work New Approach: MARMOT

lotus
Télécharger la présentation

MARMOT- MPI Analysis and Checking Tool

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MARMOT- MPI Analysis and Checking Tool Bettina Krammer, Matthias Müller krammer@hlrs.de , mueller@hlrs.de HLRS High Performance Computing Center Stuttgart Allmandring 30 D-70550 Stuttgart http://www.hlrs.de

  2. Overview • General Problems of MPI Programming • Related Work • New Approach: MARMOT • Results within CrossGrid • Outlook

  3. Problems of MPI Programming • All problems of serial programming • Additional problems: • Increased difficulty to verify correctness of program • Increased difficulty to debug N parallel processes • Portability issues between different MPI implementations and platforms, e.g. with mpich on the Crossgrid testbed: WARNING: MPI_Recv: tag= 36003 > 32767 ! MPI only guarantees tags up to this. THIS implementation allows tags up to 137654536 versions of LAM-MPI < v 7.0 only guarantee tags up to 32767 • New problems such as race conditions and deadlocks

  4. Related Work • Parallel Debuggers, e.g. totalview, DDT, p2d2 • Debug version of MPI Library • Post-mortem analysis of trace files • Special verification tools for runtime analysis • MPI-CHECK • Limited to Fortran • Umpire • First version limited to shared memory platforms • Distributed memory version in preparation • Not publicly available • MARMOT

  5. What is MARMOT? • Tool for the development of MPI applications • Automatic runtime analysis of the application: • Detect incorrect use of MPI • Detect non-portable constructs • Detect possible race conditions and deadlocks • MARMOT does not require source code modifications, just relinking • C and Fortran binding of MPI -1.2 is supported

  6. Application Debug Server (additional process) Profiling Interface MARMOT core tool MPI library Design of MARMOT

  7. Examples of Client Checks: verification on the local nodes • Verification of proper construction and usage of MPI resources such as communicators, groups, datatypes etc., for example • Verification of MPI_Request usage • invalid recycling of active request • invalid use of unregistered request • warning if number of requests is zero • warning if all requests are MPI_REQUEST_NULL • Check for pending messages and active requests in MPI_Finalize • Verification of all other arguments such as ranks, tags, etc.

  8. Examples of Server Checks: verification between the nodes, control of program • Everything that requires a global view • Control the execution flow, trace the MPI calls on each node throughout the whole application • Signal conditions, e.g. deadlocks (with traceback on each node.) • Check matching send/receive pairs for consistency • Check collective calls for consistency • Output of human readable log file

  9. MARMOT within CrossGrid

  10. 40 MHz (40 TB/sec) level 1 - special hardware 75 KHz (75 GB/sec) 5 KHz (5 GB/sec) level 2 -embedded processors 100 Hz (100 MB/sec) level 3-PCs data recording & offline analysis CrossGrid Application Tasks

  11. CrossGrid Application: WP 1.4: Air pollution modelling (STEM-II) • Air pollution modeling with STEM-II model • Transport equation solved with Petrov-Crank-Nikolson-Galerkin method • Chemistry and Mass transfer are integrated using semi-implicit Euler and pseudo-analytical methods • 15500 lines of Fortran code • 12 different MPI calls: • MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Type_extent, MPI_Type_struct, MPI_Type_commit, MPI_Type_hvector, MPI_Bcast, MPI_Scatterv, MPI_Barrier, MPI_Gatherv, MPI_Finalize.

  12. STEM application on an IA32 cluster with Myrinet

  13. Conclusion • MARMOT supports MPI 1.2 for C and Fortran binding • Tested with several applications and benchmarks • Tested on different platforms, using different compilers and MPI implementations, e.g. • IA32/IA64 clusters (Intel, g++ compiler) with mpich (CrossGrid testbed, …) • IBM Regatta • NEC SX5, SX6 • Hitachi SR8000, Cray T3e

  14. Impact and Exploitation • MARMOT is mentioned in the U.S. DoD High Performance Computing Modernization Program (HPCMP) Programming Environment and Training (PET) project as one of two freely available tools (MPI-Check and MARMOT) for the development of portable MPI programs. • Contact with tool developers of the other 2 known verification tools like MARMOT (MPI-Check, umpire) • Contact with users outside CrossGrid • MARMOT has been presented at various conferences and publications http://www.hlrs.de/organization/tsc/projects/marmot/pubs.html

  15. Future Work • Scalability and general performance improvements • Better user interface to present problems and warnings • Extended functionality: • more tests to verify collective calls • MPI-2 • Hybrid programming

  16. Thanks to all CrossGrid partnershttp://www.eu-crossgrid.org/ Thanks for your attentionAdditional information and download:http://www.hlrs.de/organization/tsc/projects/marmot

  17. Back Up

  18. Impact and Exploitation • MARMOT is mentioned in the U.S. DoD High Performance Computing Modernization Program (HPCMP) Programming Environment and Training (PET) project as one of two freely available tools (MPI-Check and MARMOT) for the development of portable MPI programs. • Contact with tool developers of the other 2 known verification tools like MARMOT (MPI-Check, umpire) • MARMOT has been presented at various conferences and publications http://www.hlrs.de/organization/tsc/projects/marmot/pubs.html

  19. Impact and Exploitation cont’d • Tescico proposal (“Technical and Scientific Computing”) • French/German consortium: • Bull, CAPS Entreprise, Université Versailles, • T-Systems, Intel Germany, GNS, Universität Dresden, HLRS • ITEA-labelled • French authorities agreed to give funding • German authorities did not

  20. Impact and Exploitation cont’d • MARMOT is integrated in the HLRS training classes • Deployment in GermanyMARMOT is in use in two of the 3 National High Performance Computing Centers: • HLRS at Stuttgart (IA32, IA64, NEC SX) • NIC at Jülich (IBM Regatta): • Just replaced their Cray T3E with a IBM Regatta • spent a lot of time finding a problem that would have been detected automatically by MARMOT

  21. Feedback of Crossgrid Applications • Task 1.1 (biomedical) • C application • Identified issues: • Possible race conditions due to use of MPI_ANY_SOURCE • deadlock • Task 1.2 (flood): • Fortran application • Identified issues: • Tags outside of valid range • Possible race conditions due to use of MPI_ANY_SOURCE • Task 1.3 (hep): • ANN (C application) • no issues found by MARMOT • Task 1.4 (meteo): • STEMII (Fortran) • MARMOT detected holes in self-defined datatypes used in MPI_Scatterv, MPI_Gatherv. These holes were removed, which helped to improve the performance of the communication.

  22. Examples of Log File 54 rank 1 performs MPI_Cart_shift 55 rank 2 performs MPI_Cart_shift 56 rank 0 performs MPI_Send 57 rank 1 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 58 rank 2 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 59 rank 0 performs MPI_Send 60 rank 1 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 61 rank 0 performs MPI_Send 62 rank 1 performs MPI_Bcast 63 rank 2 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 64 rank 0 performs MPI_Pack 65 rank 2 performs MPI_Bcast 66 rank 0 performs MPI_Pack

  23. Examples of Log File (continued) 7883 rank 2 performs MPI_Barrier 7884 rank 0 performs MPI_Sendrecv 7885 rank 1 performs MPI_Sendrecv 7886 rank 2 performs MPI_Sendrecv 7887 rank 0 performs MPI_Sendrecv 7888 rank 1 performs MPI_Sendrecv 7889 rank 2 performs MPI_Sendrecv 7890 rank 0 performs MPI_Barrier 7891 rank 1 performs MPI_Barrier 7892 rank 2 performs MPI_Barrier 7893 rank 0 performs MPI_Sendrecv 7894 rank 1 performs MPI_Sendrecv

  24. Examples of Log File (Deadlock) 9310 rank 1 performs MPI_Sendrecv 9311 rank 2 performs MPI_Sendrecv 9312 rank 0 performs MPI_Barrier 9313 rank 1 performs MPI_Barrier 9314 rank 2 performs MPI_Barrier 9315 rank 1 performs MPI_Sendrecv 9316 rank 2 performs MPI_Sendrecv 9317 rank 0 performs MPI_Sendrecv 9318 rank 1 performs MPI_Sendrecv 9319 rank 0 performs MPI_Sendrecv 9320 rank 2 performs MPI_Sendrecv 9321 rank 0 performs MPI_Barrier 9322 rank 1 performs MPI_Barrier 9323 rank 2 performs MPI_Barrier 9324 rank 1 performs MPI_Comm_rank 9325 rank 1 performs MPI_Bcast 9326 rank 2 performs MPI_Comm_rank 9327 rank 2 performs MPI_Bcast 9328 rank 0 performs MPI_Sendrecv WARNING: all clients are pending!

  25. Examples of Log File (Deadlock: traceback on 0) timestamp= 9298: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9300: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9304: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9307: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9309: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9312: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9317: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9319: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9321: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9328: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status)

  26. Examples of Log File (Deadlock: traceback on 1) timestamp= 9301: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9302: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9306: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9310: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9313: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9315: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9318: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9322: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9324: MPI_Comm_rank(comm = MPI_COMM_WORLD, *rank) timestamp= 9325: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD)

  27. Examples of Log File (Deadlock: traceback on 2) timestamp= 9303: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9305: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9308: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9311: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9314: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9316: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9320: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9323: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9326: MPI_Comm_rank(comm = MPI_COMM_WORLD, *rank) timestamp= 9327: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD)

  28. Classical Solutions I: Parallel Debugger • Examples: totalview, DDT, p2d2 • Advantages: • Same approach and tool as in serial case • Disadvantages: • Can only fix problems after and if they occur • Scalability: How can you debug programs that crash after 3 hours on 512 nodes? • Reproducibility: How to debug a program that crashes only every fifth time? • It does not help to improve portability

  29. Classical Solutions II: Debug version of MPI Library • Examples: • catches some incorrect usage: e.g. node count in MPI_CART_CREATE (mpich) • deadlock detection (NEC mpi) • Advantages: • good scalability • better debugging in combination with totalview • Disadvantages: • Portability: only helps to use this implementation of MPI • Trade-of between performance and safety • Reproducibility: Does not help to debug irreproducible programs

  30. Comparison with related projects

  31. Comparison with related projects

  32. Examples

  33. Example 1: request-reuse (source code) /* ** Here we re-use a request we didn't free before */ #include <stdio.h> #include <assert.h> #include "mpi.h" int main( int argc, char **argv ) { int size = -1; int rank = -1; int value = -1; int value2 = -1; MPI_Status send_status, recv_status; MPI_Request send_request, recv_request; printf( "We call Irecv and Isend with non-freed requests.\n" ); MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); printf( " I am rank %d of %d PEs\n", rank, size );

  34. Example 1: request-reuse (source code continued) if( rank == 0 ){ /*** this is just to get the request used ***/ MPI_Irecv( &value, 1, MPI_INT, 1, 18, MPI_COMM_WORLD, &recv_request ); /*** going to receive the message and reuse a non-freed request ***/ MPI_Irecv( &value, 1, MPI_INT, 1, 17, MPI_COMM_WORLD, &recv_request ); MPI_Wait( &recv_request, &recv_status ); assert( value = 19 ); } if( rank == 1 ){ value2 = 19; /*** this is just to use the request ***/ MPI_Isend( &value, 1, MPI_INT, 0, 18, MPI_COMM_WORLD, &send_request ); /*** going to send the message ***/ MPI_Isend( &value2, 1, MPI_INT, 0, 17, MPI_COMM_WORLD, &send_request ); MPI_Wait( &send_request, &send_status ); } MPI_Finalize(); return 0; }

  35. Example 1: request-reuse (output log) We call Irecv and Isend with non-freed requests. 1 rank 0 performs MPI_Init 2 rank 1 performs MPI_Init 3 rank 0 performs MPI_Comm_size 4 rank 1 performs MPI_Comm_size 5 rank 0 performs MPI_Comm_rank 6 rank 1 performs MPI_Comm_rank I am rank 0 of 2 PEs 7 rank 0 performs MPI_Irecv I am rank 1 of 2 PEs 8 rank 1 performs MPI_Isend 9 rank 0 performs MPI_Irecv 10 rank 1 performs MPI_Isend ERROR: MPI_Irecv Request is still in use !! 11 rank 0 performs MPI_Wait ERROR: MPI_Isend Request is still in use !! 12 rank 1 performs MPI_Wait 13 rank 0 performs MPI_Finalize 14 rank 1 performs MPI_Finalize

  36. Example 2: deadlock (source code) /* This program produces a deadlock. ** At least 2 nodes are required to run the program. ** ** Rank 0 recv a message from Rank 1. ** Rank 1 recv a message from Rank 0. ** ** AFTERWARDS: ** Rank 0 sends a message to Rank 1. ** Rank 1 sends a message to Rank 0. */ #include <stdio.h> #include "mpi.h" int main( int argc, char** argv ){ int rank = 0; int size = 0; int dummy = 0; MPI_Status status;

  37. Example 2: deadlock (source code continued) MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); if( size < 2 ){ fprintf( stderr," This program needs at least 2 PEs!\n" ); } else { if ( rank == 0 ){ MPI_Recv( &dummy, 1, MPI_INT, 1, 17, MPI_COMM_WORLD, &status ); MPI_Send( &dummy, 1, MPI_INT, 1, 18, MPI_COMM_WORLD ); } if( rank == 1 ){ MPI_Recv( &dummy, 1, MPI_INT, 0, 18, MPI_COMM_WORLD, &status ); MPI_Send( &dummy, 1, MPI_INT, 0, 17, MPI_COMM_WORLD ); } } MPI_Finalize(); }

  38. Example 2: deadlock (output log) $ mpirun -np 3 deadlock1 1 rank 0 performs MPI_Init 2 rank 1 performs MPI_Init 3 rank 0 performs MPI_Comm_rank 4 rank 1 performs MPI_Comm_rank 5 rank 0 performs MPI_Comm_size 6 rank 1 performs MPI_Comm_size 7 rank 0 performs MPI_Recv 8 rank 1 performs MPI_Recv 8 Rank 0 is pending! 8 Rank 1 is pending! WARNING: deadlock detected, all clients are pending

  39. Example 2: deadlock (output log continued) Last calls (max. 10) on node 0: timestamp = 1: MPI_Init( *argc, ***argv ) timestamp = 3: MPI_Comm_rank( comm, *rank ) timestamp = 5: MPI_Comm_size( comm, *size ) timestamp = 7: MPI_Recv( *buf, count = -1, datatype = non-predefined datatype, source = -1, tag = -1, comm, *status) Last calls (max. 10) on node 1: timestamp = 2: MPI_Init( *argc, ***argv ) timestamp = 4: MPI_Comm_rank( comm, *rank ) timestamp = 6: MPI_Comm_size( comm, *size ) timestamp = 8: MPI_Recv( *buf, count = -1, datatype = non-predefined datatype, source = -1, tag = -1, comm, *status )

  40. Bandwidth on a IA64 cluster with Myrinet

  41. Latency on an IA64 cluster with Myrinet

  42. cg.B on an IA32 cluster with Myrinet

  43. is.B on an IA32 cluster with Myrinet

  44. is.A

More Related