Proactive Fault Tolerance Using Xen Virtualization

Arun Babu NagarajanFrank MuellerNorth Carolina State University Proactive Fault Tolerance for HPC using Xen Virtualization

Problem Statement • Trends in HPC: high end systems with thousands of processors • Increased probability of a node failure: MTBF becomes shorter • MPI widely accepted in scientific computing • Problem with MPI: no recovery from faults in the standard • Currently FT exist but… • only reactive: process checkpoint/restart • must restart entire job • inefficient if only one (few) node(s) fails • overhead due to redoing some of the work • issues: checkpoint at what frequency? • 100 hr job will run for addln 150 hrs on a petaflop machine (w/o failure) [I.philip, 2005]

Our Solution • Proactive FT • anticipates node failure • takes preventive action instead of a ‘reaction’ to a failure • migrate the whole OS to a better physical node • entirely transparent to the application (rather to the OS itself) • hence avoids high overhead compared to reactive scheme (associated overhead w/ our scheme is very little )

Design space • 1. A mechanism to predict/anticipate the failure of a node • OpenIPMI • lm_sensors (more system specific x86 Linux) • 2. A mechanism to identify the best target node • Custom centralized approaches – doesn’t scale + unreliable • Scalable distributed approach – Ganglia • 3. More importantly, a mechanism (for preventive action) which supports the relocation of the running application with • its state preserved • minimum overhead on the application itself • Xen Virtualisation with live migration support [C.Clark et al, May2005] • Open source

Mechanisms explained • 1. Health Monitoring with OpenIPMI • Baseboard Mgmt Controller (BMC) equipped with sensors to monitor diff. properties like temperature, fan speed, voltage etc. of each node • IPMI (Intelligent Platform Management Interface) • increasingly common in HPC • std. message-based interface to monitor H/W • raw messaging harder to use and debug • OpenIPMI: open source, higher level abstraction from raw IPMI message-response system to communicate w/ BMC ( ie. to read sensors) • We use OpenIPMI to gather health information of nodes

Mechanisms explained • 2. Ganglia • widely used, scalable distributed load monitoring tool • All the nodes in the cluster run a ganglia daemon and each node has a approximate view of the entire cluster • UDP used to transfer messages • Measures • cpu usage, mem usage, n/w usage by default • We use ganglia to identify least loaded node  migration target • Also extended to distribute IPMI sensor data

MPI Task Guest VM Privileged VM Xen VMM Mechanisms explained • 3. Fault Tolerance w/ xen • para-virtualized environment • OS modified • application unchanged • Privileged VM & Guest VM runs on Xen hypervisor/ VMM • Guest VMs can live migrate to other hosts  little overhead • State of the VM preserved • VM halted for an insignificant period of time • Migration phases: • phase 1: send guest image  dst node, app running • phase 2: repeated diffs  dst node, app still running • phase 3: commit final diffs  dst node, OS/app frozen • phase 4: activate guest on dst, app running again H/w

PFT Daemon PFT Daemon MPI Task MPI Task MPI Task Ganglia Ganglia Guest VM Guest VM Guest VM Privileged VM Privileged VM Xen VMM Xen VMM H/w BMC Overall set-up of the components • Stand-by Xen host, no guest PFT Daemon PFT Daemon BMC Baseboard Management Contoller Migrate Ganglia Ganglia Privileged VM Privileged VM • Deteriorating health  migrate guest (w/ MPI app) to stand-by host Xen VMM Xen VMM H/w BMC H/w BMC H/w BMC

MPI Task Guest VM H/w BMC Overall set-up of the components • Stand-by Xen host, no guest PFT Daemon PFT Daemon BMC Baseboard Management Contoller Ganglia Ganglia Privileged VM Privileged VM Xen VMM Xen VMM • Deteriorating health  migrate guest (w/ MPI app) to stand-by host • The destination host generates unsolicited ARP reply advertising that Guest VM IP has moved to a new location [C.Clark et. Al 2005] - This will take care of peers to resend packets to the new host H/w BMC H/w BMC PFT Daemon PFT Daemon MPI Task MPI Task Ganglia Ganglia Guest VM Guest VM Privileged VM Privileged VM Xen VMM Xen VMM H/w BMC

Runs on privileged VM (host) Initialize Read safe threshold from config file <Sensor name> <Low Thr> <Hi Thr> CPU temperature, fan speeds extensible (corrupt sectors, network, voltage fluctuations, …) Init connection w/ IPMI BMC using authentication parameters and hostname Gathers a listing of available sensors in the system and validates it against out list Proactive Fault Tolerance (PFT) Daemon PFT Daemon IPMI Baseboard Mgmt Controller Initialize Health Monitor Threshold Breach? N Y Load Balance Ganglia Raise Alarm / Maintenance of the system

PFT Daemon • Health Monitoring • interacts w/ IPMI BMC (via OpenIPMI) to read sensors • Periodic sampling of data (event driven is also supported) • threshold exceeded  control handed over to load balancing • PFTd determines migration target by contacting Ganglia • Load-based selection (lowest load) • Load obtained by /proc file system • Invokes Xen live migration for guest VM • Xen user-land tools (at VM/host) • command line interface for live migration • PFT Daemon initiates migration for guest VM

Experimental Framework • Cluster of 16 nodes (dual core, dual Opteron 265, 1 Gbps Ether) • Xen-3.0.2-3 VMM • Privileged and guest VM run ported Linux kernel version 2.6.16 • Guest VM: • Very same configuration as privileged VM • Has 1GB RAM • Booted on VMM w/ PXE netboot via NFS • Has access to NFS (same as the privileged VM) • Ganglia on Privileged VM (and also Guest VM) in all nodes • Node sensors obtained via OpenIPMI

Experimental Framework • NAS Parallel Benchmarks run on Guest Virtual Machine • MPICH-2 w/ MPD ring on n GuestVMs (no job-pause required!) • Process on Privileged domain • monitors MPI task runs • issues migration command (NFS used for synchronization) • Measured: • wallclock time with and w/o migration • actual downtime + migration overhead (modified Xen migration) • benchmarks run 10 times, results report avg. • NPB V3.2.1: BT, CG, EP, LU and SP benchmarks • IS run is too short • MG requires > 1GB for class C

Experimental Results 2. Double node failure 1. Single node failure NPB Class C / 4 nodes NPB Class B / 4 nodes • Single node failure – overhead of 1-4 % over total wall clock time • Double node failure - overhead of 2-8 % over total wall clock time

Experimental Results 3. Behavior of Problem Scaling • Chart depicts only the overhead section • Dark region represents the part for which the VM was halted • The light region represents the delay incurred due to migration (diff operation.. Etc) NPB 4 nodes • Generally overhead increases with problem size (CG is exception )

Experimental Results 4. Behavior of Task Scaling • Generally we expect a decrease in overhead on increasing the # of nodes • Some discrepancies for BT and LU observed (Migration duration is 40s but here we have 60s) NPB Class C

Experimental Results 5. Migration duration NPB 4 nodes NPB 4/8/16 nodes • Min 13s needed to transfer a 1GB VM w/o any active processes • Max 40 seconds needed before migration is initiated • Depends on the n/w bandwidth, RAM size & on the application

Experimental Results 6. Scalability (Total execution time) NPB Class C • Speedup is not very much affected

Related Work • FT – Reactive approach is more common • Automatic • Checkpoint/restart (eg: BLCR – Berkeley Labs Checkpnt Restart)[S.Sankaran et.al LACSI ’03], [G.Stellner, IPPS ’ 96] • Log based (Log msg + temporal ordering) [G.Bosilica , Supercomputing, 2002] • Non-automatic • Explicit invocation of checkpoint routines [R.T.Aulwes et. Al, IPDPS 2004], [G. E. Fagg and J. J. Dongarra, 2000] • Virtualization in HPC is less/no overhead [W.Hunaf et al, ICS ’06] • To make virtualization competitive for MP environments, vmm-bypass I/o in VM has been experimented[J.Liu et.al USENIX ’06] • n/w virtualization can be optimized [A.Menon et.al USENIX ’06]

Conclusion • In contrast to the currently available reactive FT schemes, we have come up with a proactive system with much less overhead • Transparent and automatic FT for arbitrary MPI applications • Ideally complements long running MPI jobs • Proactive system will complement reactive systems greatly. (It will help to reduce the high overhead associated with reactive schemes greatly)

Proactive Fault Tolerance Using Xen Virtualization