260 likes | 275 Vues
FT-MPICH : Providing fault tolerance for MPI parallel applications. Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University. Motivation. Condor supports Checkpoint/Restart(C/R) mechanism only in Standard Universe. Single process jobs
E N D
FT-MPICH : Providing fault tolerance for MPI parallel applications Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University
Motivation • Condor supports Checkpoint/Restart(C/R) mechanism only in Standard Universe. • Single process jobs • C/R for parallel jobs is not provided in any of current Condor universes. • We would like to make C/R available for MPI programs. Heon Y. Yeom, Seoul National University
Introduction • Why Message Passing Interface (MPI)? • Designing a generic FT framework is extremely hard due to the diversity of hardware and software systems. • We have chosen MPICH series .... • MPI is the most popular programming model in cluster computing. • Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware… Heon Y. Yeom, Seoul National University
Architecture-Concept- Failure Detection FT-MPICH Monitoring C/R Protocol Heon Y. Yeom, Seoul National University
Architecture-Overall System- Management System Communication IPC Message Queue Ethernet IPC Ethernet IPC Ethernet IPC Ethernet Communication Communication Communication MPI Process MPI Process MPI Process Heon Y. Yeom, Seoul National University
Management System Failure Detection Initialization Coordination Management System Output Management Checkpoint Coordination Checkpoint Transfer Recovery Makes MPI more reliable Heon Y. Yeom, Seoul National University
Local Manager Local Manager Local Manager MPI process MPI process MPI process Manager System Leader Manager Initialization, CKPT cmd, CKPT transfer, Failure notification & recovery Stable Storage Communication between MPI process to exchange data Heon Y. Yeom, Seoul National University
Fault-tolerant MPICH_P4 FT-MPICH Collective Operations P2P Operations ADI(Abstract Device Interface) Ch_p4 (Ethernet) FT Module Recovery Module Checkpoint Toolkit Connection Re-establishment Atomic Message Transfer Ethernet Heon Y. Yeom, Seoul National University
Startup in Condor • Precondition • Leader Manager already knows the machines where MPI process is executed and the number of MPI process by user input • Binary of Local Manager and MPI process is located at the same location of each machine Heon Y. Yeom, Seoul National University
#!/bin/sh Leader_manager … exe.sh(shell script) Startup in Condor • Job submission description file • Vanilla Universe • Shell script file is used in submission description file • executable points a shell script • The shell file only executes Leader Manager • Ex) Example.cmd universe = Vanilla executable = exe.sh output = exe.out error = exe.err log = exe.log queue Example.cmd Heon Y. Yeom, Seoul National University
Startup in Condor • User submits a job using condor_submit • Normal Job Startup Condor Pool Central Manager Negotiator Collector Execute Machine Submit Machine Startd Schedd Job (Leader Manager) Submit Shadow Starter Heon Y. Yeom, Seoul National University
Fork() & Exec() Local Manager Local Manager Local Manager MPI Process MPI Process MPI Process Startup in Condor • Leader Manager executes Local Manager • Local Manager executes MPI process Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1 Execute Machine 2 Execute Machine 3 Heon Y. Yeom, Seoul National University
Startup in Condor • MPI processes send Communication Info and Leader Manager aggregates this info • Leader Manager broadcasts aggregated info Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1 Execute Machine 2 Execute Machine 3 Local Manager Local Manager Local Manager MPI Process MPI Process MPI Process MPI Process MPI Process MPI Process Heon Y. Yeom, Seoul National University
Fault Tolerant MPI • To provide MPI fault-tolerance, we have adopted • Coordinated checkpointing scheme (vs. Independent scheme) • The Leader Manager is the Coordinator!! • Application-level checkpointing (vs. kernel-level CKPT.) • This method does not require any efforts on the part of cluster administrators • User-transparent checkpointing scheme (vs. User-aware) • This method requires no modification of MPI source codes Heon Y. Yeom, Seoul National University
Checkpoint is performed!! CKPT SIG Checkpoint is delayed!! CKPT SIG Checkpoint is performed!! CKPT SIG Atomic Message Passing • Coordination between MPI process • Assumption • Communication Channel is FIFO • Lock(), Unlock() • To create atomic operation Proc 0 Proc 1 Atomic Region Lock() Lock() Unlock() Unlock() Heon Y. Yeom, Seoul National University
Atomic Message Passing(Case 1) • When MPI process receive CKPT SIG, MPI process send & receive barrier message Barrier Data Proc 0 Proc 1 CKPT SIG CKPT CKPT SIG Atomic Region Lock() Lock() Unlock() Unlock() CKPT SIG CKPT CKPT SIG Heon Y. Yeom, Seoul National University
Atomic Message Passing(Case 2) • Through sending and receiving barrier message, In-transit message is pushed to the destination Barrier Data Proc 0 Proc 1 CKPT SIG Atomic Region Lock() Lock() CKPT SIG Unlock() Unlock() Delayed CKPT Heon Y. Yeom, Seoul National University
Atomic Message Passing(Case 3) • The communication channel between MPI process is flushed • Dependency between MPI process is removed Barrier Data Proc 0 Proc 1 Atomic Region Lock() Lock() CKPT SIG CKPT SIG Unlock() Unlock() Delayed CKPT Heon Y. Yeom, Seoul National University
Stack Heap ver 1 checkpoint command Data Text ver 2 Checkpointing • Coordinated Checkpointing rank1 rank2 rank3 rank0 Leader Manager storage Heon Y. Yeom, Seoul National University
Stack Stack Heap Heap Data Data Text Text Failure Recovery • MPI process recovery CKPT Image Restarted Process New process Heon Y. Yeom, Seoul National University
Failure Recovery • Connection Re-establishment • Each MPI process re-opens socket and sends IP, Port info to Local Manager This is the same with the one we did before at the initialization time. Heon Y. Yeom, Seoul National University
ver 1 failure detection Fault Tolerant MPI • Recovery from failure rank1 rank2 rank3 rank0 Leader Manager checkpoint command storage Heon Y. Yeom, Seoul National University
Local Manager Local Manager Local Manager Fault Tolerant MPI in Condor • Leader Manager controls MPI processes by issuing checkpoint command, monitoring Condor is not aware of the failure incident Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1 Execute Machine 2 Execute Machine 3 MPI Process MPI Process MPI Process Heon Y. Yeom, Seoul National University
Fault-tolerant MPICH-variants(Seoul National University) M3 SHIELD MPICH-GF Collective Operations P2P Operations ADI(Abstract Device Interface) Globus2 (Ethernet) GM (Myrinet) MVAPICH (InfiniBand) FT Module Recovery Module Checkpoint Toolkit Connection Re-establishment Atomic Message Transfer Ethernet Myrinet InfiniBand Heon Y. Yeom, Seoul National University
Summary • We can provide fault-tolerance for parallel applications using MPICH on Ethernet, Myrinet, and Infiniband. • Currently, only the P4(ethernet) version works with Condor. • We look forward to working with Condor team. Heon Y. Yeom, Seoul National University