280 likes | 400 Vues
FT-MPI is a fault-tolerant implementation of the Message Passing Interface (MPI), developed under the DOE HARNESS project. It allows applications to continue running despite failures in either applications or systems, enabling recovery and maintaining communication among processes. This system offers different modes of recovery, including ABORT, BLANK, SHRINK, REBUILD, and REBUILD_ALL, allowing users to manage process failures flexibly. FT-MPI is particularly beneficial for distributed applications and large-scale computations, providing a sophisticated way to enhance reliability in parallel computing environments.
E N D
Building and using an FT MPI implementation Graham Fagg HLRS / UTK fagg@hlrs.de
What is FT-MPI • FT-MPI is a fault tolerant MPI system developed under the DOE HARNESS project • What does Fault Tolerant mean? • Failures do not cause instant application termination if there is either an application failure or system failure. • Application gets to decide at the MPI API.
What is FT-MPI • Why Fault Tolerant MPI? • MTBFnode < JobRun • OK for small jobs and small number of nodes… • (MTBFnode * nodes) < JobRunMuchBiggerJob • Or you have a distributed application • Or a very distributed application • Or a very large very distributed application -> GRID…
Communicator Normal MPI semantics Logical Layer Host% Abort: error code XX Host% Rank 0 Rank 1 Rank 2 Rank 2 Physical layer
FT-MPI handling of errors • Under FT-MPI when a member of a communicator dies: • The communicator state changes to indicate a problem • Messages transfers can continue if safe or be stopped (ignored) • To continue: • The users application can ‘fix’ the communicators or abort.
Fixing communicators • Fixing a communicator really means deciding when it is safe to continue. • The application must decide: • Which processes are still members • Which messages need to be sent • The fix happens when a collective communicator creation occurs • MPI_Comm_crearte /MPI_Comm_dup etc • Special shortcut, dup on MPI_COMM_WORLD
5 ways to patch it up • There are 5 modes of recovery, they effect the size (extent) and ordering of the communicators • ABORT: just do as other implementations • BLANK: leave holes • But make sure collectives do the right thing afterwards • SHRINK: re-order processes to make a contiguous communicator • Some ranks change • REBUILD: re-spawn lost processes and add them to MPI_COMM_WORLD • REBUILD_ALL: same as REBUILD except rebuilds all communicators, groups and resets all key values etc.
Uses of different modes • BLANK • Good for parameter sweeps / Monte Carlo simulations where process loss only means resending of data. • SHRINK • Same as BLANK accept where users need the communicators size to match its extent • I.e. when using home grown collectives
Uses of different modes • REBUILD • Applications that need a constant number of processes • Fixed grids / most solvers • REBUILD_ALL • Same as REBUILD except does a lot more work behind the scenes. • Useful for applications where there is multiple communicators (for each dimension) and SOME of key values etc. • Slower and has slightly higher overhead due to extra state it has to distribute
Using FT-MPI /* make sure it is sent example */ Do { Rc = MPI_Send (…. com ); If (rc==MPI_ERR_OTHER) { MPI_Comm_dup (com, newcom ); MPI_Comm_free (com); com = newcom; } } while (rc!=MPI_SUCCESS);
Using FT-MPI • Checking every call is not always necessary • A master-slave code may only need a few of the operations in the master code checked.
Using FT-MPI • Using FT-MPI.. On something more complex. • Made worse by structured programming that hides MPI calls below many layers. • The layers are usually in different libraries.
Using FT-MPI Build an unstructured grid Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) …..
Using FT-MPI Build an unstructured grid Distribute some work Solve my part we detect a failure via an MPI call here... Do I=0, XXX …. MPI_Sendrecv ( ) …..
Using FT-MPI Build an unstructured grid We need To fix it up here Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) ….. You are here
Using FT-MPI • Using MPI Error handlers makes for neater code. • All application recovery operations can occur in the users handler so every MPI call does not need to be checked. /* install my recovery handler just once */ MPI_Errhandler_create (my_recover_function, &errh); MPI_Errhandler_get (MCW, &orghandler); MPI_Errhandler_free (&errh_orghandler); MPI_Errhandler_set (MPI_COMM_WORLD, errh); /* all communicators created from now on get this handler */ /* line by line checking */ /* automatic checking */ if (MPI_Send…) { MPI_Send (…) call recovery MPI_Recv (…) … } MPI_Scatter (…) if (MPI_Recv..) { call recovery … } If (MPI_Scatter..) { call recovery .. }
rc=MPI_Init (…) If normal startup Install Error Handler & Set LongJMP Call Solver (…) MPI_Finalize(…)
rc=MPI_Init (…) Set LongJMP ErrorHandler Do recover ( ) Do JMP Call Solver (…) On error (automatic via the mpi library) MPI_Finalize(…)
rc=MPI_Init (…) If rc==MPI_Restarted ErrorHandler Set LongJMP I am New Do recover ( ) Call Solver (…) MPI_Finalize(…)
Implementation details Built in multiple layers Has tuned collectives and user derived data Type handling. Users need to re-compile to libftmpi and start application with ftmpirun command Can be run both with and without a HARNESS core: with core uses FT-MPI plug-ins standalone uses extra daemon on each host to facilitate startup and (failure) monitoring
Implementation details • Distributed recovery • Uses a single dynamically created ‘master’ list of ‘living’ nodes • List is compiled by a ‘leader’ • Leader picked by using an atomic swap on a record in ‘some naming service’ • List is distributed by an atomic broadcast • Can survive multiple nested failures…
Name Service MPI application MPI application Ftmpi_notifier libftmpi libftmpi Startup_d Startup_d Implementation details
Status and future • Beta version • Limited number of MPI functions supported • Currently working on getting PETSc (The Portable, Extensible Toolkit for Scientific Computation from ANL) working in a FT mode • Target of 100+ functions by SC2002. • WHY so many? Every real world library uses more than the ‘6’ MPI required functions.. If it is in the standard then it will be used. • Covers all major classes of functions in MPI. • Future work • Templates for different classes of MPI applications so users can build on our work • Some MPI-2 support (PIO?) Dynamic tasks is easy for us!
Conclusion • Not condor for MPI • Can do more than a reload-restart • Application must do some work • But they decide what • Middleware for building FT applications with • I.e. do we know how to do this kind of recovery? Yet?? • Not a slower alternative • Cost at recover time (mostly) • Standard gets in the way
Links and further information • HARNESS and FT-MPI at UTK/ICL http://icl.cs.utk.edu/harness/ • HARNESS at Emory University http://www.mathcs.emory.edu/harness/ • HARNESS at ORNL http://www.epm.ornl.gov/harness/