1 / 18

Stephen Orchowski – 11/15/2008 CSE 520 – Advanced Computer Architecture

“Proactive Fault Tolerance for HPC with Xen Virtualization” Nagarajan , Mueller, Engelmann, Scott NCSU and Oak Ridge National Laboratory. Stephen Orchowski – 11/15/2008 CSE 520 – Advanced Computer Architecture. Agenda. Background and motivations Monitoring of system health

heath
Télécharger la présentation

Stephen Orchowski – 11/15/2008 CSE 520 – Advanced Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “Proactive Fault Tolerance for HPC with Xen Virtualization”Nagarajan, Mueller, Engelmann, ScottNCSU and Oak Ridge National Laboratory Stephen Orchowski – 11/15/2008 CSE 520 – Advanced Computer Architecture

  2. Agenda • Background and motivations • Monitoring of system health • Role of Xen in Fault Tolerance • Migration process • Management of the FT mechanism • Experimental setup and results

  3. Motivations • What is HPC? • Checkpoints used to save state of system and program execution. • Restarts are issued after a fault occurs and the failing component is removed or isolated. • Checkpointing adds overhead – “prolong a 100 hour job (without failure) by an additional 151 hours in petaflop systems”

  4. Motivations cont. • Current techniques rely on reactive mechanisms • What if we could predict when a failure is going to occur? • Measure the health status of a system by monitoring status of fans, component temperatures, voltages and disk error logs. • Checkpoint frequencies are still necessary but eventually become the exception rather than the norm.

  5. Proactive Fault Tolerance • System must provide the following constructs to carry out proactive fault tolerance • - Node health monitoring • - Failure prediction • - Load balancing • - Migration mechanism

  6. Xen Review • Paravirtualization – Recall that the hosted VM must be modified to run on the VMM. Applications do not need to be modified • Xen provides facilities for live-migration • All state information is transferred before activation occurs on the target node • Preserves the state of all the processes on the guest.

  7. Migration Process • Host VM inquires if target has sufficient resources for new guest, if so, it reserves them • Host VM sends all pages of the guest VM to the destination node. Sets a dirty bit in page table entry of the guest os. Subsequent writes to this cause traps. • Host VM starts sending dirty pages to the destination node. • Guest VM is finally stopped. The last pages are sent and the guest VM on the destination node begins execution. • What is the point of these steps? Why not just stop a guest OS, transfer it and restart it?

  8. Load Balancing • Ganglia – a scalable distributed monitoring system. • Every node runs a daemon which monitors local resources. • Each node sends out multicast packets which contain current status information • Every node has a global view of the current state of the entire system. • Health information is not part of this mechanism. • Target node is selected if it does not yet host a guest VM and has the lowest CPU utilization

  9. Health Monitoring • Intelligent Platform Management Interface – IPMI - provides a standardized message-based mechanism for monitoring and managing hardware • Baseboard Management Controller – BMC - contains sensors to monitor different system components and properties • Periodic sampling is accomplished by means of the theOpenIPMI API which communicates with the BMC.

  10. Putting it all together – The PFT Daemon • The PFT (Proactive Fault Tolerance) Daemon centralizes and controls the main three components • Health monitoring • Decision making • Load Balancing • Initialization is lengthy and loads threshold values and specific parameters to monitor. • After initialization, it begins sampling various sensors via the BMC. • Comparisons with the thresholds are made and if any are exceeded, control is transferred to Ganglia which decides a target migration node. • The PFTd then issues the migration command which begins the live migration of the guest node from the current node to a more “healthy” one.

  11. PFTd cont.

  12. Experimental Setup • 16 node cluster each with 2GB main memory and two AMD Opteron-265 dual-core processors. All interconnected by a 1 Gbps Ethernet switch • NPB parallel benchmarks • How to simulate failures?

  13. What is the experiment testing? • Recall that HPC clusters experience faults and checks have to be built into the overall system. This overhead reduces total performance • Measure wall clock time of system with and without failures • Measure performance for various scenarios • Single-Node failure – 4 nodes • Double-Node failure – 4 nodes • Scaling of the test system (i.e. #1 and #2 over a larger network and system) via 16 nodes

  14. Initial test results

  15. Multi-node tests • Measure performance as problem and network scales larger • Speedup measured with and without migration • One node failure for each test

  16. Live Migration vs. Stop & Copy • Comparison of wall-clock execution time

  17. Conclusions • Node failures reasonably predicted based on health stats • Avoidance of restarts • Problem sizes don’t necessarily increase the overhead of migration • Live migration overhead is larger than stop/copy scheme but, overall, faster because the application continues to execute during the migration • Live migration helps to hide the costs of relocating a guest OS and its associated MPI task • Reduce checkpoint frequency

  18. Questions?

More Related