joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette

gl Grand Large MPICH-V2a Fault Tolerant MPI for Volatile Nodes based on PessimisticSender Based Message Logging joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette Parallelism team, Grand Large Project Thomas Hérault herault@lri.fr http://www.lri.fr/~herault 11/18 SC 2003

MPICH-V2 • Computing nodes of clusters are subject to failure • Many applications use MPI as communication library • Design a fault-tolerant MPI library • MPICH-V1 is a fault-tolerant MPI implementation • It requires many stable components to provide high performance • MPICH-V2 addresses this requirements • And provides higher performances 11/18 SC 2003

Outline • Introduction • Architecture • Performances • Perspective & Conclusion 11/18 SC 2003

Large Scale Parallel and Distributed systems and node Volatility • Industry and academia are building larger and larger computing facilities for technical computing (research and production). • Platforms with 1000s of nodes are becoming common: Tera Scale Machines (US ASCI, French Tera), Large Scale Clusters (Score III, etc.), Grids, PC-Grids(Seti@home, XtremWeb,Entropia, UD, Boinc) • These large scale systems have frequent failures/disconnections: • ASCI-Q full system MTBF is estimated (analytic) to few hours (Petrini: LANL), A 5 hours job with 4096 procs has less than 50% chance to terminate. • PC Gridsnodes are volatile  disconnections / interruptions are expected to be very frequent (several/hour) • When failures/disconnections can not be avoided, they become • onecharacteristic of the system calledVolatility • We need a Volatility tolerant Message Passing library 11/18 SC 2003

Programmer’s view unchanged: PC client MPI_send() PC client MPI_recv() Goal: execute existing or new MPI Apps Problems: 1) volatile nodes(any number at any time) 2) non named receptions( should be replayed in the same order as the one of the previous failed exec.) Objective summary: 1) Automatic fault tolerance 2) Transparency for the programmer & user 3) Tolerate n faults (n being the #MPI processes) 4) Scalable Infrastructure/protocols 5) Avoid global synchronizations (ckpt/restart) 6) Theoretical verification of protocols 11/18 SC 2003

Related works A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques. Automatic Non Automatic Checkpoint based Log based Optimistic log (sender based) Pessimistic log Causal log Optimistic recovery In distributed systems n faults with coherent checkpoint [SY85] Coordinated checkpoint Manetho n faults [EZ92] Cocheck Independent of MPI [Ste96] Framework Starfish Enrichment of MPI [AF99] FT-MPI Modification of MPI routines User Fault Treatment [FD00] Egida [RAV99] Clip Semi-transparent checkpoint [CLP97] MPI/FT Redundance of tasks [BNC01] API Pruitt 98 2 faults sender based [PRU98] MPI-FT N fault Centralized server [LNLE00] MPICH-V2 N faults Distributed logging Communication Lib. Sender based Mess. Log. 1 fault sender based [JZ87] Level 11/18 SC 2003

Checkpoint techniques restart Coordinated Checkpoint (Chandy/Lamport) detection/ global stop The objective is to checkpoint the application when there is no in transit messages between any two nodes  global synchronization network flush not scalable failure Ckpt Sync Nodes Uncoordinated Checkpoint • No global synchronization (scalable) • Nodes may checkpoint at any time (independently of the others) • Need to log undeterministic events: In-transit Messages restart detection failure Ckpt Nodes 11/18 SC 2003

2 node Network Get Put Channel Memory Time, sec 0.2 Mean over 100 measurements P4 ch_cm 5.6 MB/s 0.15 X ~2 0.1 10.5 MB/s 0.05 0 size, Kb 0 64 128 192 256 320 384 MPICH-V1 Dispatcher Channel Memories node Network Checkpoint servers node node 11/18 SC 2003

MPICH-V2 protocol A new protocol (never published yet) based on 1) Splitting message logging and event logging 2) Sender based message logging 3) Pessimistic approach (reliable event logger) • Definition 3 (Pessimistic Logging protocol) Let P be a communication protocol, and E an execution of P with at most f concurrent failures. Let MC denotes the set of messages transmitted between the initial configuration and the configuration C of E. • P is a pessimistic message logging protocol if and only if • CE,m  MC, • (|DependC(m)| > 1) ) Re − Executable(m) Theorem 2 The protocol of MPICH-V2 is a pessimistic message logging protocol. Key points of the proof: A. Every non deterministic event has its logical clock logged on reliable media B. Every message reception logged on reliable media is reexecutable the message payload is saved on the sender the sender will produce the message again and associate the same unique logical clock 11/18 SC 2003

Message logger and event logger q m p crash (id, l) event logger for p r q D B C p restart A event logger for p reexecution phase r 11/18 SC 2003

Computing node Event Logger Ckpt Server Reception event Checkpoint Image ack CSAC MPI process Send Send V2 daemon Ckpt Control Receive Receive keep payload Node 11/18 SC 2003

Impact of uncoordinated checkpoint+ sender based message logging 1 2 EL 1, 2 ? Checkpoint image Checkpoint image P0 ? ? Checkpoint image Checkpoint image P1 P1’s ML 1 2 1 CS • Obligation to checkpoint Message Loggers on • computing nodes • Garbage collector required for reducing ML checkpoint size. 11/18 SC 2003

Garbage collection 1 2 EL Checkpoint image P0 Checkpoint image P1 P1’s ML 1 2 1 2 3 3 1 1 and 2 can be deleted  Garbage collector CS Receiver checkpoint completion triggers the garbage collector of senders. 11/18 SC 2003

Scheduling Checkpoint • Uncoordinated checkpoint lead to log in-transit messages • Scheduling checkpoint simultaneously will lead to bursts • in the network traffic. • Checkpoint size can be reduced by removing message logs • Coordinated checkpoint (Lamport). • Requires global synchronization • Checkpoint traffic should be flattened • Checkpoint scheduling should evaluate the cost and benefit • of each checkpoint. 1, 2 and 3 can be deleted  Garbage collector 1 2 1 2 3 1 P0’s ML P0 No message Checkpoint needed 3 needs to be checkpointed P1 P1’s ML 1 2 1 2 3 1 1 and 2 can be deleted  Garbage collector CS 11/18 SC 2003

Node (Volatile) :Checkpointing • User-level Checkpoint : Condor Stand Alone Checkpointing • Clone checkpointing + non blocking checkpoint (1) fork Resume execution using CSAC just after (4), reopen sockets and return code Ckpt order CSAC (2) Terminate ongoing coms (3) close sockets (4) call ckpt_and_exit() libmpichv fork • Checkpoint image is sent to CS on the fly (not stored locally) 11/18 SC 2003

ADI _v2bsend - blocking send _v2brecv - blocking receive Channel Interface _v2probe - check for any message avail. Chameleon Interface Library: based on MPICH 1.2.5 • A new device: ‘ch_v2’ device • All ch_v2 device functions are blocking communication functions built over TCP layer MPI_Send MPID_SendControl MPID_SendChannel _v2from - get the src of the last message Binding _v2Init - initialize the client V2 device Interface _v2bsend _v2Finalize - finalize the client 11/18 SC 2003

Performance evaluation Cluster: 32 1800+ Athlon CPU, 1 GB, IDE Disc + 16 Dual Pentium III, 500 Mhz, 512 MB, IDE Disc + 48 ports 100Mb/s Ethernet switch Linux 2.4.18, GCC 2.96 (-O3), PGI Frotran <5 (-O3, -tp=athlonxp) Checkpoint Server +Event Logger +Checkpoint Scheduler +Dispatcher A single reliable node node Network node node 11/18 SC 2003

Bandwidth and Latency Latency for a 0 byte MPI message : MPICH-P4 (77us), MPICH-V1 (154us), MPICH-V2 (277us) Latency is high due to the event logging.  A receiving process can send a new message only when the reception event has been successfully logged (3 TCP messages for a communication) Bandwidth is high because event messages are short. 11/18 SC 2003

Latency Memory capacity (logging on disc) NAS Benchmark Class A and B Megaflops Megaflops Megaflops 11/18 SC 2003

Breakdown of the execution time 11/18 SC 2003

Faulty execution performance 1 fault Every 45 sec! +190 s (+80%) 11/18 SC 2003

Perspectives • Compare to Coordinated techniques • Treshold of fault frequency where logging techniques are more valuable • MPICH-V/CL • Cluster 2003 • Hierarchical logging for Grids • Tolerate node failures & cluster failures • MPICH-V3 • SC 2003 Poster session • Address the latency of MPICH-V2 • Use causal logging techniques ? 11/18 SC 2003

Conclusion • MPICH-V2 is a completely new protocol replacing MPICH-V1 removing the channel memories • New protocol is pessimistic sender based • MPICH-V2 reach a Ping-Pong Bandwidth • close to the one of MPICH-P4 • MPICH-V2 cannot compete with MPICH-P4 on latency • However for applications with large messages, performance • are close to the one of P4 • In addition, MPICH-V2 resists up to one fault every 45 seconds. • Main conclusion: MPICH-V2 requires much less stable nodes than MPICH-V1 with better performances Come to see MPICH-V demos at the Booth: 3315 INRIA 11/18 SC 2003

Crash Re-execution performance (1) Time for the re-execution of a token ring on 8 nodes According to the token size and number of re-started nodes 11/18 SC 2003

Re-execution performance (2) 11/18 SC 2003

Logging techniques Initial execution crash ckpt The system must provide the messages to be replayed, and discard the re-emissions Replayed execution : starts from last checkpoint (this process) • Main problems: • Discard re-emissions (technical) • Ensure that messages are replayed • in a consistent order 11/18 SC 2003

Large Scale Parallel and Distributed Systems and programing • Many HPC applications use message passing paradigm • Message passing :MPI • We need a Volatility tolerant Message Passing Interface implementation • Based on MPICH-1.2.5 which implements MPI standard 1.1 11/18 SC 2003

Checkpoint Server (stable) Checkpoint images are stored on reliable media: 1 file per Node (name given By Node) Disc Checkpoint images Multiprocess server Poll, treat event and dispatch job to other processes Incoming Message (Put ckpt transaction) Outgoing Message (Get ckpt transaction + control) Open Sockets: -one per attached Node -one per home CM of attached Nodes 11/18 SC 2003

NAS Benchmark Class A and B Latency Memory capacity (logging on disc) 11/18 SC 2003

joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette

joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette

Presentation Transcript

Joint work with :

Piergiacomo Sabino, joint work with Tommaso Pellegrino

* Joint work with Ariel R osenfeld

Joseph Gonzalez Joint work with

Joint work with:

Joseph Gonzalez Joint work with

Joint work with :

Joint work with :

Joint work with and

Joint work with and

Joint work with and

Joint work with :

Joint work with :

Joint work with:

Joint work with Hans Bodlaender

Joint work-Acknowledgments

Joint work with Hiroaki Ino

Alexander Rybko Joint work with S.Shlosman

Kartik Hosanagar Joint work with Daniel Fleder

Based on joint work with X. Ding

Joint work with and