L aboratoire de R echerche en I nformatique Universite de Paris Sud

Laboratoire de Recherche en Informatique Universite de Paris Sud Dr. Franck Cappello

MPICH-V: Toward a scalable fault tolerant MPI for Volatile nodes G. Bosilca, A. Bouteiller,F. Cappello, S. Djilali, G. Fedak, C. Germain, Th. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, A. Selikhov Cluster & GRID group LRI, University of Paris South.

Introduction Motivations & Objectives Architecture Performance Concluding remarks Outline

Large Scale Parallel and Distributed systems and node Volatility • Industry and academia are building larger and larger computing facilities for technical computing (research and production). • Platforms with 1000s of nodes are becoming common: Tera Scale Machines (US ASCI, French Tera), Large Scale Clusters (Score III, etc.), Grids, PC-Grids(Seti@home, XtremWeb,Entropia, UD, Boinc) • These large scale systems have frequent failures/disconnections: • ASCI-Q full system MTBF is estimated (analytic) to few hours (Petrini: LANL), A 5 hours job with 4096 procs has less than 50% chance to terminate. • PC Gridsnodes are volatile  disconnections / interruptions are expected to be very frequent (several/hour) • When failures/disconnections can not be avoided, they become • onecharacteristic of the system calledVolatility • Many HPC applications use message passing paradigm • We need a Volatility tolerant Message passing environment

Related work Fault tolerant Message passing: a long history of research! Transparency: application checkpointing, MP API+Fault management, automatic. application ckpt: application store intermediate results and restart form them MP API+FM: message passing API returns errors to be handled by the programmer automatic: runtime detects faults and handle recovery Checkpoint coordination:no, coordinated, uncoordinated. coordinated: all processes are synchronized, network is flushed before ckpt; all processes rollback from the same snapshot uncoordinated: each process checkpoint independently of the others each process is restarted independently of the others Message logging:no, pessimistic, optimistic, causal. pessimistic: all messages are logged on reliable media and used for replay optimistic: all messages are logged on non reliable media. If 1 node fails, replay is done according to other nodes logs. If >1 node fail, rollback to last coherent checkpoint causal: optimistic+Antecedence Graph, reduces the recovery time

Sprite [Douglis, Ousterhout, 1991] Task migration Transparent Remote procedure calls Kernel level No fault tolerance Condor [Lizkow, Livny, Tannenbaum, 1991] Task migration Transparent User level Include checkpoint servers Compression No parallel applications Clip [Chen, Li, Plank, 1997] Not cross platform Parallel applications Global synchronization (Chandy-Lamport algorithm) Related work • Libckpt [Plank, Beck, Kingsley, Li, 1994] • Transparent (user configurable) • User level • Non blocking checkpoint • Incremental checkpoint • No compression • No parallel applications • No checkpoint server • Cocheck [Stellner, 1996] / Netsolve [Plank, Casanova, Beck, Dongarra,1999] • Based on condor checkpoint mechanisms • Dedicated for parallel applications • Global Synchronization (Chandy-Lamport algorithm) • MPI-FT [Louca, Neophytou, Lachanas, Evripidou, 2000] • transparent • Optimistic Log : decentralized, only one fault • Pessimistic Log : centralized, arbitrary number of faults

Related work A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques. Automatic Non Automatic Checkpoint based Log based Optimistic log (sender based) Pessimistic log Causal log Optimistic recovery In distributed systems n faults with coherent checkpoint [SY85] Coordinated checkpoint Manetho n faults [EZ92] Cocheck Independent of MPI [Ste96] Framework Starfish Enrichment of MPI [AF99] FT-MPI Modification of MPI routines User Fault Treatment [FD00] Egida [RAV99] Clip Semi-transparent checkpoint [CLP97] MPI/FT Redundance of tasks [BNC01] API Pruitt 98 2 faults sender based [PRU98] MPI-FT N fault Centralized server [LNLE00] MPICH-V N faults Distributed logging Communication Lib. Sender based Mess. Log. 1 fault sender based [JZ87] Level No automatic/transparent, n fault tolerant, scalable message passing env.

Outline • Introduction • Motivations & Objectives • Architecture • Performance • Concluding remarks

Programmer’s view unchanged: PC client MPI_send() PC client MPI_recv() Objectives and constraints Goal: execute existing or new MPI Apps Problems: 1) volatile nodes(any number at any time) 2) firewalls(PC Grids) 3) non named receptions( should be replayed in the same order as the one of the previous failed exec.) Objective summary: 1) Automatic fault tolerance 2) Transparency for the programmer & user 3) Tolerate n faults (n being the #MPI processes) 4) Firewall bypass (tunnel) for cross domain execution 5) Scalable Infrastructure/protocols 6) Avoid global synchronizations (ckpt/restart) 7) Theoretical verification of protocols

Checkpoint restart Coordinated Checkpoint (Chandy/Lamport) detection/ global stop The objective is to checkpoint the application when there is no in transit messages between any two nodes  global synchronization network flush not scalable failure Ckpt Sync Nodes Uncoordinated Checkpoint • No global synchronization (scalable) • Nodes may checkpoint at any time (independently of the others) • Need to log undeterministic events: In-transit Messages restart detection failure Ckpt Nodes

Pessimistic message logging on Channel Memories Distributedpessimistic remote logging node A set of reliable nodes called “Channel Memories” logs every message. All communications are Implemented by 1 PUT and 1 GET operation to the CM PUT and GET operations are transactions When a process restarts, it replays all communications using the Channel Memory CM stores and delivers messages in FIFO order for ensuring a consistent state for each receiver Network Firewall Get Put Channel Memory (stable-tunnel) node node Get node Network Get Put node Channel Memory CM also works as a tunnel for firewall protected nodes (PC-Grids)

Crash Rollback to latest process checkpoint Putting all together: Sketch of execution with a crash Worst condition: in-transit message + checkpoint Processes Pseudo time scale 0 CM Ckpt image 1 1 CM Ckpt image 2 2 2 CS 2 1 Ckpt images

Global architecture MPICH-V : • Communications Library: a MPICH device with Channel Memory • Run-time: execute/manage instances of MPI processes on nodes •  requires only to re-link the application with libmpichv instead of libmpich 5 Channel Memory Checkpoint server 2 3 Dispatcher 1 Node Network 4 Node Firewall Node Firewall

Dispatcher (stable) -- Initializes the execution: distributes roles (CM, CS and Nodes) to participant nodes (launches the appropriate job), checks readiness -- Launches the instances of MPI processes on Nodes -- Monitors the Node state (alive signal, or time-out) -- Reschedules tasks on available nodes for dead MPI process instances Checkpoint servers Channel Memories Dispatcher Role distribution MPI proc. instance Alive signal Nodes New MPI proc. instance Faillure

Channel Memory (stable) Out-of-core message storage + garbage collection Disc Removes messages older than the current checkpoint image for each node Memory FIFO Message queues For ensuring total order on receiver messages Multithread server Poll, treat event and release other threads Incoming Message (Put transaction + control) Outgoing Message (Get transaction + control) Open Sockets: -one per attached Node -one per home checkpoint server of attached node -one for the dispatcher

0 0 1024 0 0 0 1024 1 0 3 2 3 5 4 tag Rank src Queue of Control first last Array of Data Queues src = 0,…,np-1 … 0 np-1 0 1 2 3 tag first last Channel Memory Architecture • 1 queue structure (control+data) per receiver • Separate control and data queues: ease access to control • Control queue: • Enforces total order on messages from all senders • Data queue: • 1) Enforces total order on messages by sender

Mapping Channel Memories with nodes Several CM  coordination constraints: 1) Force a total order on the messages for each receiver. 2) Avoid coordination messages among CMs Home For node 2 Channel Memories CM CM CM Nodes N 0 1 2 … • Our solution: • Each Node is “attached” to only one “home” CM • A node Receives messages from its home CM • A node Sends messages to the home CM of the destination node

Checkpoint Server (stable) Checkpoint images are stored on reliable media: 1 file per Node (name given By Node) Disc Checkpoint images Multiprocess server Poll, treat event and dispatch job to other processes Incoming Message (Put ckpt transaction) Outgoing Message (Get ckpt transaction + control) Open Sockets: -one per attached Node -one per home CM of attached Nodes

Node (Volatile) : Checkpointing • User-level Checkpoint : Condor Stand Alone Checkpointing • Clone checkpointing + non blocking checkpoint (1) fork Resume execution using CSAC just after (4), reopen sockets and return code Ckpt order CSAC (2) Terminate ongoing coms (3) close sockets (4) call ckpt_and_exit() libmpichv fork • Checkpoint image is sent to CS on the fly (not stored locally) • Checkpoint order is triggered locally (not by a dispatcher signal)

ADI _cmbsend - blocking send _cmbrecv - blocking receive Channel Interface _cmprobe - check for any message avail. Chameleon Interface Library: based on MPICH • A new device: ‘ch_cm’ device • All ch_cm device functions are blocking communication functions built over TCP layer MPI_Send MPID_SendControl MPID_SendChannel _cmfrom - get the src of the last message Binding _cmInit - initialize the client CM device Interface _cmbsend _cmFinalize - finalize the client

Main differences with p4 device • No message queuing at a node • MPI_Init includes connection to CM servers • All communication functions include sending of special system message • MPI_Finalize includes sending of additional notification message to all CM servers

~4,8 Gb/s ~4,8 Gb/s ~1 Gb/s Experimental platform • Icluster-Imag, 216 PIII 733 Mhz, 256MB/node • 5 subsystems with 32 to 48 nodes, 100BaseT switch • 1Gb/s switch mesh between subsystems • Linux, PGI Fortran or GCC compiler • Very close to a typical Building LAN • Simulate node Volatility XtremWeb as software environment (launching MPICH-V) NAS BT benchmark  complex application (high comm/comp)

Basic performance RTT Ping-Pong : 2 nodes, 2 Channel Memories, blocking coms. Time, sec Mean over 100 measurements 0.2 P4 ch_cm 1 CM out-of-core ch_cm 1 CM in-core 0.15 ch_cm 1 CM out-of-core best 5.6 MB/s X ~2 0.1 10.5 MB/s 0.05 Message size 0 0 64kB 128kB 192kB 256kB 320kB 384kB • Performance degradation of a factor 2 (compared to P4) but MPICH-V tolerates arbitrary number of faults • Reasonable since every message crosses the network • twice (store and forward through CM).

Global operation performance MPI all-to-all for 9 nodes (1CM) 2,1 x3 1 0,7 0,3

Time, sec CM Token size (Bytes) Impact the number of threads in Channel Memory Individual communication time according to the number of nodes attached to 1CM and the number of threads in the CM Asynchronous token ring (# tokens = # nodes) Mean over 100 executions • Increasing the number of threads reduces the CM response time whatever number of nodes are using the same CM.

Time, sec 0.5 12 nodes 0.4 8 nodes 0.3 0.2 4 nodes CM 2 nodes 0.1 1 node Token size 0 0 64kB 128kB 320kB 192kB 256kB 384kB • CM response time (as seen by a node) increases linearly with the number of nodes. • Standard deviation < 3% across nodes •  fair distribution of the CM resource Impact of sharing a Channel Memory Individual communication time according to the number of nodes attached to 1CM (simultaneous communications) Asynchronous token ring (#tokens= # nodes) Mean over 100 executions Tokens are rotating simultaneously around the ring: there are always #nodes communications at the same time

Time, sec 0.3 8 restarts Re-execution is faster than execution: Messages are already stored in CM 0 restart 0 restart 1 restart 0.2 Crash 2 restarts 3 restarts 4 restarts 5 restarts 6 restarts 7 restarts 8 restarts 0.1 token size 0 256kB 128kB 192kB 0 64kB • The system can survive the crash of all MPI Processes • re-execution is faster because messages are available in the CM (stored by the previous execution) Performance of re-execution Time for the re-execution of a token ring on 8 nodes According to the token size and number of re-started nodes

Impact of remote checkpoint on node performance Time between reception of a checkpoint signal and actual restart: fork, ckpt, compress, transfer to CS, way back, decompress, restart RTT Time, sec 250 +2% Dist. Ethernet 100BaseT 214 208 Local (disc) 200 150 +25% 100 78 +14% 62 50 50 44 +28% 1.8 1.4 0 bt.A.4 (43MB) bt.B.4 (21MB) bt.A.1 (201MB) bt.w.4 (2MB) • Cost of remote checkpoint is close to the one of local checkpoint (can be as low as 2%)… …because compression and transfer are overlapped

Stressing the checkpoint server:Ckpt RTT for simultaneous ckpts RTT experienced by every node for simultaneous ckpt, (ckpt signals are sync.) according to #checkpointing nodes 500 RTT Time, sec 450 400 350 300 250 200 2 3 5 6 7 1 4 Number of simultaneous checkpoint on a single CS (BT.A.1) • RTT increases almost linearly according to the number of nodes, after network saturation is reached (from 1 to 2)

Impact of checkpointing on application performance Performance reduction for NAS BT.A.4 according to the number of consecutive checkpoints A single checkpoint server for 4 MPI tasks (P4 driver) Ckpt is performed at random time on each node (no sync.) 100 100 Dual processor Uni processor 99 90 98 80 97 Relative performance (%) 96 70 95 60 Blocking Non blocking 94 50 93 0 4 2 3 1 0 2 3 1 4 Number of checkpoints during BT.A.4 • When 4 checkpoints are performed per process performance is about 94% the one of a non checkpointed execution. • Several nodes can use the same CS

Putting all together: Performance scalability Performance of MPI-PovRay • Parallelized version of the PovRay raytracer application • 1 CM for 8 MPI processes • Render a complex 450x350 scene • Comm/comp ratio is about 10% for 16 MPI processes Execution time • MPICH-V provides similar performance compared to P4 + fault-tolerance (at the cost of 1 CM every 8 nodes)

Putting all together: Performance with volatile nodes Performance of BT.A.9 with frequent faults • 3 CM, 2 CS (4 nodes on 1 CS, 5 on the other) • 1 checkpoint every 130 seconds on each node (non sync.) ~1 fault/110 sec. Total execution time (sec.) 1100 1050 1000 950 900 850 800 Base exec. without ckpt. and fault 750 700 Number of faults during execution 650 610 0 1 2 3 4 5 6 7 8 9 10 • Overhead of ckpt is about 23% • For 10 faults performance is 68% of the one without fault • MPICH-V allows application to survive node volatility (1 F/2 min.) • Performance degradation with frequent faults stays reasonable

MPICH-V (CM but no logs) MPICH-V (CM with logs) MPICH-V (CM+CS+ckpt) MPICH-P4 Putting all together: MPICH-V vs. MPICH-P4 on NAS BT • 1 CM per MPI process, 1 CS for 4 MPI processes • 1 checkpoint every 120 seconds on each node (Whole) MPICH-V Compares favorably to MPICH-P4 for all configurations on this platform for BT class A The differences for the communication times is due to the way asynchronous coms. are handled by each environment.

Concluding remarks • MPICH-V: • full fledge fault tolerant MPI environment (lib + runtime). • uncoordinated checkpoint + distributed pessimistic message logging. • Channel Memories, Checkpoint Servers, Dispatcher and nodes. • Main results: • Raw communication Performance (RTT) is about ½ of MPICH-P4. • Scalability is as good as the one of P4 (128 nodes) for MPI-Pov. • MPICH-V allows application to survive node volatility (1 F/ 2min). • When frequent faults occur, performance degradation is reasonable. • NAS BT performance comparable to MPICH-P4 (up to 25 nodes). www.lri.fr/~fci/Group

Future Channel Memories reduce the communication performance:  change packet transit from Store and Forward to Wormhole  remove CMs (cluster), message logging on node, communication causality vector stored separately on CSs Remove the need of stable resources: add redundancy Channel Memories Checkpoint servers Dispatcher Redundancy Redundancy node Network node Firewall node

MPICH-V2 Architecture A new protocol (SC03) based on 1) Splitting message logging and event logging 2) Sender based message logging 3) Pessimistic approach (reliable event logger)

Message logger and event logger

Computing node Event Logger Ckpt Server Reception event Checkpoint Image CSAC MPI process Send Send V2 deamon Ckpt Control Receive Receive Send payload Disc Node

Impact of uncordinated checkpoint+ sender based message logging 1 2 EL 1, 2 ? Checkpoint image Checkpoint image P0 ? ? Checkpoint image Checkpoint image P1 P1’s ML 1 2 1 CS • Obligation to checkpoint Message Loggers on • computing nodes • Garbage collector required for reducing ML checkpoint size.

Garbage collection 1 2 EL Checkpoint image P0 Checkpoint image P1 P1’s ML 1 2 1 2 3 3 1 1 and 2 can be deleted  Garbage collector CS Receiver checkpoint completion triggers the garbage collector of senders.

Scheduling Checkpoint • Uncoordinated checkpoint lead to log in-transit messages • Scheduling checkpoint simultaneously will lead to bust • in the network traffic. • Checkpoint size can be reduced by removing message logs • Coordinated checkpoint (Lamport). • Requires global synchronization • Checkpoint traffic should be flattened • Checkpoint scheduling should evaluate the cost and benefit • of each checkpoint. 1, 2 and 3 can be deleted  Garbage collector 1 2 1 2 3 1 P0’s ML P0 No message Checkpoint needed 3 needs to be checkpointed P1 P1’s ML 1 2 1 2 3 1 1 and 2 can be deleted  Garbage collector CS

Checkpoint Server (stable) Checkpoint images are stored on reliable media: 1 file per Node (name given By Node) Disc Checkpoint images Multiprocess server Poll, treat event and dispatch job to other processes Incoming Message (Put ckpt transaction) Outgoing Message (Get ckpt transaction + control) Open Sockets: -one per attached Node -one per home CM of attached Nodes

Node (Volatile): Checkpointing • User-level Checkpoint : Condor Stand Alone Checkpointing • Clone checkpointing + non blocking checkpoint (1) fork Resume execution using CSAC just after (4), reopen sockets and return code Ckpt order CSAC (2) Terminate ongoing coms (3) close sockets (4) call ckpt_and_exit() libmpichv fork • Checkpoint image is sent to CS on the fly (not stored locally) • Checkpoint order is triggered locally (not by a dispatcher signal)

Performance evaluation Cluster: 32 1800+ Athlon CPU, 1 GB, IDE Disc + 16 Dual Pentium III, 500 Mhz, 512 MB, IDE Disc + 48 ports 100Mb/s Ethernet switch Linux 2.4.18, GCC 2.96 (-O3), PGI Frotran <5 (-O3, -tp=athlonxp) Checkpoint Server +Event Logger +Checkpoint Scheduler Dispatcher A single reliable node node Network node node

Bandwidth and Latency Latency for a 0 byte MPI message : MPICH-P4 (77us), MPICH-V1 (154us), MPICH-V2 (277us) Latency is high due to the event logging.  A receiving process can send a new message only when the reception event has been successfully logged (6 TCP messages for a communication) Bandwidth is high because event messages are short.

NAS Benchmark Class A&B Latency Memory capacity

L aboratoire de R echerche en I nformatique Universite de Paris Sud

L aboratoire de R echerche en I nformatique Universite de Paris Sud

Presentation Transcript

Valérie HEMAR-NICOLAS Université de Paris Sud, RITM.

UNIVERSITE DE VERSAILLES SAINT QUENTIN-EN-YVELINES