1 / 76

HARNESS and Fault Tolerant MPI

HARNESS and Fault Tolerant MPI. Graham E Fagg A Joint project with Emory University and ORNL. Harness and FT-MPI. A little on HARNESS and plug-ins for FT-MPI Why FT-MPI FT-MPI G_hcore Conclusions and Futures. HARNESS. HARNESS

dale
Télécharger la présentation

HARNESS and Fault Tolerant MPI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HARNESS and Fault Tolerant MPI Graham E Fagg A Joint project with Emory University and ORNL

  2. Harness and FT-MPI • A little on HARNESS and plug-ins for FT-MPI • Why FT-MPI • FT-MPI • G_hcore • Conclusions and Futures

  3. HARNESS • HARNESS • Heterogeneous Adaptable Reconfigurable NEtworked SystemS • Also described as a • “distributed, reconfigurable and heterogeneous computing environment that supports dynamically adaptable parallel applications”

  4. HARNESS • Distributed Virtual Machines (DVMs) built from HARNESS core (hcore) services provided by daemons • Think back to PVM and pvmds • A local DVM -> Personal Grid?

  5. HARNESS

  6. HARNESS

  7. HARNESS Plug-in repository

  8. HARNESS • Research project so there are different versions of the hcore • Emory has produced a Java DVM system • Vaidy talked about this earlier • ORNL a C based system to experiment based on a Symmetric Control Algorithm • ICL/UTK a C based system to experiment with FT-MPI, replication of MetaData and Remote Invocation techniques • More later

  9. FT-MPI • Why FT-MPI and what is it? • A Fault tolerant MPI implementation • Harness needs an MPI implementation • Why not make it as survivable and dynamic as Harness ?

  10. FT-MPI • Current MPI applications live under the MPI fault tolerant model of no faults allowed. • This is great on an MPP as if you lose a node you generally lose a partition/job anyway. • Makes reasoning about results easy. If there was a fault you might have received incomplete/incorrect values and hence have the wrong result anyway.

  11. FT-MPI • Do we need a FT version? • As we push towards PetaFlop systems with 10,000-100,000 + nodes the MTBFs starts becoming a problem • Pacific Blue Benchmark of 5800 CPUs for almost 7 hours took only two attempts due to hardware failures..

  12. FT-MPI • Real goals: • an efficient MPI implementation for Harness • A test bed to develop a new generation of dynamic parallel algorithms on.

  13. Semantics of Failure • What is a failure ? • How do we detect failures? • What would they mean to communicators and communications? • Who is responsible for handling them? • How can we handle them?

  14. Semantics of Failure • What is a failure ? • Direct loss of a MPI process • crash of application sub-program • lose of harness core • lost of a physical node

  15. Semantics of Failure • What is a failure ? • Loss of communications with a node • crash of application sub-program • lose of harness core • lost of a physical node / NIC • partitioning of the network

  16. Semantics of Failure • What would they mean to communicators and communications? • Communicators are invalidated if there is a failure • They can be reused or rebuilt before they are valid again • Yes it means that there are operations that you can call on an invalid communicator.

  17. Semantics of Failure • Who is responsible for handling them? • The users application is responsible unless it has indicated what to the system how to handle it for them.

  18. Semantics of Failure • Constraints on what we can do: • We support the MPI-1.X (and some of MPI-2) API • current code should drop in unchanged • I.e. we avoid changing the MPI API but instead change the semantics of some of the calls and overload other, introduce new constants etc..

  19. Semantics of Failure • Communicators and communications within the communicators follow modes of operation upon errors that are based on their states. • There are two types of mode that are controllable by the application • Communicator and communication modes

  20. Semantics of Failure • Communicator states and modes • Under normal MPI • initialized?->OK->failed or exit (dead either way) • Under FT-MPI • FT_OK, FT_DETECTED, FT_RECOVER, FT_RECOVERED, FT_FAILED • or JUST • OK -> problem -> fail/dead or OK

  21. Semantics of Failure • Communicator states and modes • Modes set using MPI attribute calls • Modes: SHRINK, BLANK, REBUILD and ABORT • ABORT… default MPI behavior

  22. Semantics of Failure • Communicator states and modes • SHRINK • On a rebuild this forces the missing process to disappear from the communicator • Size changes, also process ranks may change

  23. Semantics of Failure • Communicator states and modes : SHRINK

  24. Semantics of Failure • Communicator states and modes : SHRINK

  25. Semantics of Failure • Communicator states and modes : SHRINK

  26. Semantics of Failure • Communicator states and modes : BLANK • Rebuild the communicator so that gaps are allowed • Size returns extent of communicator • P2P operations to a gap fail • collective operations will work but beware on what you think the result should be...

  27. Semantics of Failure • Communicator states and modes : REBUILD • Automatic node (spawning/) recovery when you rebuild a communicator that has died • new process is inserted either in gap or at end • New process is notified by return value from MPI_Init • yes check that value

  28. Semantics of Failure • Communicator states and modes • How do we know? • MPI operation returns MPI_ERR_OTHER • Then you have to check attributes of the communicator

  29. Semantics of Failure • Communicator states and modes • How do we get from one state to another? • A communicator operation such as: • MPI_Comm_split, MPI_Comm_create, MPI_Comm_dup • MPI_COMM_WORLD can rebuilt itself !

  30. Semantics of Failure • Communicator states and modes rc= MPI_Send (----, com); If (rc==MPI_ERR_OTHER) { MPI_Comm_dup (com, newcom); MPI_Comm_free (com); com = newcom; /* continue.. */ /* retry Send on com here.. */ }

  31. rc = MPI_Bcast ( initial_work….); if(rc==MPI_ERR_OTHER)reclaim_lost_work(…); while ( ! all_work_done) { if (work_allocated) { rc = MPI_Recv ( buf, ans_size, result_dt, MPI_ANY_SOURCE, MPI_ANY_TAG, comm, &status); if (rc==MPI_SUCCESS) { handle_work (buf); free_worker (status.MPI_SOURCE); all_work_done--; } else { reclaim_lost_work(status.MPI_SOURCE); if (no_surviving_workers) { /* ! do something ! */ } } } /* work allocated */ /* Get a new worker as we must have received a result or a death */ rank=get_free_worker_and_allocate_work(); if (rank) { rc = MPI_Send (… rank… ); if (rc==MPI_OTHER_ERR) reclaim_lost_work (rank); if (no_surviving_workers) { /* ! do something ! */ } } /* if free worker */ } /* while work to do */

  32. communicator workers Master

  33. communicator workers Master

  34. communicator workers Master

  35. communicator workers Master

  36. communicator workers Master

  37. Semantics of Failure • Communication states and message modes • How communications are handled can also be controlled • Just because a communicator has a problem does not mean the application halts until it is fixed..

  38. Semantics of Failure • Communication states and message modes • Two flavors • CONTINUE (cont) • NO-OP (NOP)

  39. Semantics of Failure • Communication states and message modes • CONT • All messages that can be sent are sent • (You always get to receive if a message is already waiting for you)

  40. Semantics of Failure • Communication states and message modes • NOP • You can not initiate any NEW communications • previous operations should complete if they are still valid • Designed to allow the thread of control for a failed application to float up layers as fast as possible.

  41. Semantics of Failure • The layer problem • Made worse by good software engineering and the use of multiple nested libraries.

  42. Semantics of Failure Build an unstructured grid

  43. Semantics of Failure Build an unstructured grid Distribute some work

  44. Semantics of Failure Build an unstructured grid Distribute some work Solve my part

  45. Semantics of Failure Build an unstructured grid Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) …..

  46. Semantics of Failure Build an unstructured grid Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) ….. Someone died somewhere

  47. Semantics of Failure I can fix it Up here Build an unstructured grid Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) ….. Not down here

  48. Semantics of Failure I can fix it Up here Build an unstructured grid Distribute some work Solve my part NOPs Allow me To get out of this part FAST Do I=0, XXX …. MPI_Sendrecv ( ) …..

  49. Semantics of Failure • Communication states and message modes Collective operations are dealt with differently than p2p Will only return if the operation would have given the same answer as if no failure occurred for the surviving members

  50. Semantics of Failure • Communication states and message modes Collective operations in two classes broadcast / scatter succeed if non root node fails and the data survives gather / reduce fail if there is an error

More Related