1 / 26

FT-MPICH : Providing fault tolerance for MPI parallel applications

FT-MPICH : Providing fault tolerance for MPI parallel applications. Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University. Motivation. Condor supports Checkpoint/Restart(C/R) mechanism only in Standard Universe. Single process jobs

ewingj
Télécharger la présentation

FT-MPICH : Providing fault tolerance for MPI parallel applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FT-MPICH : Providing fault tolerance for MPI parallel applications Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University

  2. Motivation • Condor supports Checkpoint/Restart(C/R) mechanism only in Standard Universe. • Single process jobs • C/R for parallel jobs is not provided in any of current Condor universes. • We would like to make C/R available for MPI programs. Heon Y. Yeom, Seoul National University

  3. Introduction • Why Message Passing Interface (MPI)? • Designing a generic FT framework is extremely hard due to the diversity of hardware and software systems. • We have chosen MPICH series .... • MPI is the most popular programming model in cluster computing. • Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware… Heon Y. Yeom, Seoul National University

  4. Architecture-Concept- Failure Detection FT-MPICH Monitoring C/R Protocol Heon Y. Yeom, Seoul National University

  5. Architecture-Overall System- Management System Communication IPC Message Queue Ethernet IPC Ethernet IPC Ethernet IPC Ethernet Communication Communication Communication MPI Process MPI Process MPI Process Heon Y. Yeom, Seoul National University

  6. Management System Failure Detection Initialization Coordination Management System Output Management Checkpoint Coordination Checkpoint Transfer Recovery Makes MPI more reliable Heon Y. Yeom, Seoul National University

  7. Local Manager Local Manager Local Manager MPI process MPI process MPI process Manager System Leader Manager Initialization, CKPT cmd, CKPT transfer, Failure notification & recovery Stable Storage Communication between MPI process to exchange data Heon Y. Yeom, Seoul National University

  8. Fault-tolerant MPICH_P4 FT-MPICH Collective Operations P2P Operations ADI(Abstract Device Interface) Ch_p4 (Ethernet) FT Module Recovery Module Checkpoint Toolkit Connection Re-establishment Atomic Message Transfer Ethernet Heon Y. Yeom, Seoul National University

  9. Startup in Condor • Precondition • Leader Manager already knows the machines where MPI process is executed and the number of MPI process by user input • Binary of Local Manager and MPI process is located at the same location of each machine Heon Y. Yeom, Seoul National University

  10. #!/bin/sh Leader_manager … exe.sh(shell script) Startup in Condor • Job submission description file • Vanilla Universe • Shell script file is used in submission description file • executable points a shell script • The shell file only executes Leader Manager • Ex) Example.cmd universe = Vanilla executable = exe.sh output = exe.out error = exe.err log = exe.log queue Example.cmd Heon Y. Yeom, Seoul National University

  11. Startup in Condor • User submits a job using condor_submit • Normal Job Startup Condor Pool Central Manager Negotiator Collector Execute Machine Submit Machine Startd Schedd Job (Leader Manager) Submit Shadow Starter Heon Y. Yeom, Seoul National University

  12. Fork() & Exec() Local Manager Local Manager Local Manager MPI Process MPI Process MPI Process Startup in Condor • Leader Manager executes Local Manager • Local Manager executes MPI process Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1 Execute Machine 2 Execute Machine 3 Heon Y. Yeom, Seoul National University

  13. Startup in Condor • MPI processes send Communication Info and Leader Manager aggregates this info • Leader Manager broadcasts aggregated info Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1 Execute Machine 2 Execute Machine 3 Local Manager Local Manager Local Manager MPI Process MPI Process MPI Process MPI Process MPI Process MPI Process Heon Y. Yeom, Seoul National University

  14. Fault Tolerant MPI • To provide MPI fault-tolerance, we have adopted • Coordinated checkpointing scheme (vs. Independent scheme) • The Leader Manager is the Coordinator!! • Application-level checkpointing (vs. kernel-level CKPT.) • This method does not require any efforts on the part of cluster administrators • User-transparent checkpointing scheme (vs. User-aware) • This method requires no modification of MPI source codes Heon Y. Yeom, Seoul National University

  15. Checkpoint is performed!! CKPT SIG Checkpoint is delayed!! CKPT SIG Checkpoint is performed!! CKPT SIG Atomic Message Passing • Coordination between MPI process • Assumption • Communication Channel is FIFO • Lock(), Unlock() • To create atomic operation Proc 0 Proc 1 Atomic Region Lock() Lock() Unlock() Unlock() Heon Y. Yeom, Seoul National University

  16. Atomic Message Passing(Case 1) • When MPI process receive CKPT SIG, MPI process send & receive barrier message Barrier Data Proc 0 Proc 1 CKPT SIG CKPT CKPT SIG Atomic Region Lock() Lock() Unlock() Unlock() CKPT SIG CKPT CKPT SIG Heon Y. Yeom, Seoul National University

  17. Atomic Message Passing(Case 2) • Through sending and receiving barrier message, In-transit message is pushed to the destination Barrier Data Proc 0 Proc 1 CKPT SIG Atomic Region Lock() Lock() CKPT SIG Unlock() Unlock() Delayed CKPT Heon Y. Yeom, Seoul National University

  18. Atomic Message Passing(Case 3) • The communication channel between MPI process is flushed • Dependency between MPI process is removed Barrier Data Proc 0 Proc 1 Atomic Region Lock() Lock() CKPT SIG CKPT SIG Unlock() Unlock() Delayed CKPT Heon Y. Yeom, Seoul National University

  19. Stack Heap ver 1 checkpoint command Data Text ver 2 Checkpointing • Coordinated Checkpointing rank1 rank2 rank3 rank0 Leader Manager storage Heon Y. Yeom, Seoul National University

  20. Stack Stack Heap Heap Data Data Text Text Failure Recovery • MPI process recovery CKPT Image Restarted Process New process Heon Y. Yeom, Seoul National University

  21. Failure Recovery • Connection Re-establishment • Each MPI process re-opens socket and sends IP, Port info to Local Manager  This is the same with the one we did before at the initialization time. Heon Y. Yeom, Seoul National University

  22. ver 1 failure detection Fault Tolerant MPI • Recovery from failure rank1 rank2 rank3 rank0 Leader Manager checkpoint command storage Heon Y. Yeom, Seoul National University

  23. Local Manager Local Manager Local Manager Fault Tolerant MPI in Condor • Leader Manager controls MPI processes by issuing checkpoint command, monitoring Condor is not aware of the failure incident Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1 Execute Machine 2 Execute Machine 3 MPI Process MPI Process MPI Process Heon Y. Yeom, Seoul National University

  24. Fault-tolerant MPICH-variants(Seoul National University) M3 SHIELD MPICH-GF Collective Operations P2P Operations ADI(Abstract Device Interface) Globus2 (Ethernet) GM (Myrinet) MVAPICH (InfiniBand) FT Module Recovery Module Checkpoint Toolkit Connection Re-establishment Atomic Message Transfer Ethernet Myrinet InfiniBand Heon Y. Yeom, Seoul National University

  25. Summary • We can provide fault-tolerance for parallel applications using MPICH on Ethernet, Myrinet, and Infiniband. • Currently, only the P4(ethernet) version works with Condor. • We look forward to working with Condor team. Heon Y. Yeom, Seoul National University

  26. Thank You !

More Related