Reliability, Availability, and Serviceability (RAS) for High-Performance Computing

Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. ScottChristian Engelmann Computer Science Research GroupComputer Science and Mathematics Division

Research and development goals • Provide high-level RAS capabilities for current terascale and next-generation petascale high-performance computing (HPC) systems • Eliminate many of the numerous single points of failure and control in today’s HPC systems • Develop techniques to enable HPC systems to run computational jobs 24/7 • Develop proof-of-concept prototypes and production-type RAS solutions

MOLAR: Adaptive runtime support for high-end computing operating and runtime systems • Addresses the challenges for operating and runtime systems to run large applications efficiently on future ultrascale high-end computers • Part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) • MOLAR is a collaborative research effort (www.fastos.org/molar)

Symmetric active/active redundancy • Many active head nodes • Workload distribution • Symmetric replication between head nodes • Continuous service • Always up to date • No fail-over necessary • No restore-over necessary • Virtual synchrony model • Complex algorithms • Prototypes for Torque and Parallel Virtual File System metadata server Active/active head nodes Compute nodes

400 350 4 servers 300 250 Throughput (Requests/sec) 200 2 servers 150 100 1 server 50 0 1 2 4 8 16 32 Number of clients Symmetric active/active Parallel Virtual File System metadata server Writing throughput Reading throughput 120 100 80 Throughput (Requests/sec) 60 40 PVFS A/A 1 A/A 2 A/A 4 20 0 1 2 4 8 16 32 Number of clients

Reactive fault tolerance for HPC with LAM/MPI+BLCR job-pause mechanism Failed node Live node • Operational nodes: Pause • BLCR reuses existing processes • LAM/MPI reuses existing connections • Restore partial process state from checkpoint • Failed nodes: Migrate • Restart process on new node from checkpoint • Reconnect with paused processes • Scalable MPI membership management for low overhead • Efficient, transparent, and automatic failure recovery Paused MPI process Failed MPI process Failed Failed Process migration Existing connection New connection Paused MPI process Migrated MPI process New connection Live node Spare node Shared storage

LAM/MPI+BLCR job pause performance Job pause and migrate LAM reboot Job restart 10 9 8 7 6 Seconds 5 4 3 2 1 • 3.4% overhead over job restart, but • No LAM reboot overhead • Transparent continuation of execution 0 BT CG EP FT LU MG SP • No requeue penalty • Less staging overhead

PFT daemon PFT daemon MPI task Migrate Ganglia Ganglia Guest VM Privileged VM Privileged VM Xen VMM Xen VMM H/w BMC H/w BMC PFT daemon PFT daemon MPI task MPI task Ganglia Ganglia Guest VM Guest VM Privileged VM Privileged VM Xen VMM Xen VMM H/w BMC H/w BMC Proactive fault tolerance for HPC using Xen virtualization • Standby Xen host (spare node without guest VM) • Deteriorating health • Migrate guest VM to spare node • New host generates unsolicited ARP reply • Indicates that guest VM has moved • ARP tells peers to resend to new host • Novel fault-tolerance scheme that acts before a failure impacts a system

VM migration performance impact Double node failure 500 300 without migration one migration without migration one migration two migrations 450 250 400 350 200 300 Seconds Seconds 250 150 200 100 150 Single node failure 100 50 50 0 0 BT CG EP LU SP BT CG EP LU SP • Single node failure: 0.5–5% additional cost over total wall clock time • Double node failure: 2–8% additional cost over total wall clock time

System reliability (MTTF) for k-of-n AND Survivability (k=n) Parallel Execution Model 800 Node MTTF 1000 hrs Node MTTF 3000 hrs Node MTTF 5000 hrs Node MTTF 7000 hrs 600 Total system MTTF (hrs) 400 200 0 10 50 100 500 1000 2000 5000 Number of Participating Nodes HPC reliability analysis and modeling • Programming paradigm and system scale impact reliability • Reliability analysis • Estimate mean time to failure (MTTF) • Obtain failure distribution: exponential, Weibull, gamma, etc. • Feedback into fault-tolerance schemes for adaptation 1 0.8 0.6 Cumulative probability Negative likelihood value 0.4 0.2 0 0 50 100 150 200 250 300 Time between failure (TBF)

Contacts regarding RAS research Stephen L. Scott Computer Science Research GroupComputer Science and Mathematics Division (865) 574-3144 scottsl@ornl.ornl Christian Engelmann Computer Science Research GroupComputer Science and Mathematics Division (865) 574-3132 engelmannc@ornl.ornl 11 Scott_RAS_SC07

Reliability, Availability, and Serviceability (RAS) for High-Performance Computing