1 / 28

Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun,

HA-OSCAR: unleashing HA Beowulf. Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate Professor, Computer Science Director, eXtreme Computing Research (XCR). Research Collaborators. National, Academic and Industry Labs ORNL

eliza
Télécharger la présentation

Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun,

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HA-OSCAR: unleashing HA Beowulf • Research Efforts toward Non-Stop Services in High End and Enterprise Computing • Box Leangsuksun, • Associate Professor, Computer Science • Director, eXtreme Computing Research (XCR)

  2. Research Collaborators • National, Academic and Industry Labs • ORNL • Intel, Dell, Ericsson • Lucent, CRAY • IU, NCSA, OSU, NCSU, UNM, TTU • Systran • OSDL (Linus is here) • ANL, LLNL

  3. Service Unavailability Impacts • No Performance and No Functionality • Losses of $195K - $58M with 3.5 hrs (Meta Group report, 2000) – (enterprise) • Enterprise/Shared Major computing resources- 7/24/365 (enterprise/HPC-HEC) • Critical HPC apps such as National Security (Home Land defense) (HPC-HEC) • Service provider Regulation/Mandate • FCC mandate (Class 5 local switch = 5 9’s) • Losses time and opportunities • Life-threatening

  4. RASS Definitions • Reliability (MTTF) • How fast it fails? • Availability • What is the total uptime? • Availability = MTTF / (MTTF + MTTR) • Serviceability • How fast to build, manage, upgrade system • Planned outages – 60% of total outages • Security will impact Availability

  5. HA-OSCAR: unleashing HA Beowulf • High Availability Open Source Cluster Application Resources (HA-OSCAR)

  6. HA-OSCAR overview • Production-quality Open source Linux-cluster project • HA and HPC clustering techniques to enable critical HPC infrastructure Self-configuration Multi-head Beowulf system • HA-enabled HPC Services:Active/Hot Standby • Self-healing with 3-5 sec automatic failover time • The first known field-grade open source HA Beowulf cluster release

  7. Monitoring & Self-healing cores Self-Healing Daemon PBS ,MAUI , NFS,HTTP services are monitored eth0,eth0:1 interfaces are monitored Service Monitor Resource Monitor Health channel Monitor load_average, disk_usage, free_memory are monitored

  8. Monitoring and recovery • Enhancement based kernel.org MON , IPMI, and net-SNMP framework • Recovery • Associative Response • Local recovery, e.g. restart, checkpoint • Failover (simple or impersonate/clone) • Admin-defined actions • Adaptive Response • Previous state and number retry • Acceleration (Time-series) • E.g. maui dies, restart. After 3 times reties within 3 mins, failover

  9. Appeared in a front cover in two major Linux magazines, various technical papers, research exhibitions.web site: http://xcr.cenit.latech.edu/ha-oscar HA-OSCAR beta was released to open source community in March 2004

  10. On-going R&D works(Lab grade enhancements)

  11. Reliability Modeling for dummy

  12. UML Representation of System Architecture XMI Representation with Embedded Dependability Information Extracting Dependability parameters and Building Logical Representation Semantic Mapping andDependability Modeling Results showing Reliability and Availability of System UML based Approach

  13. An example of UML tools

  14. Examples in UML diagrams

  15. Example of HA-OSCAR <RELIABILITY BLOCK DIAGRAM> <component> <name> Node1 <lambda> 3.4E-5 </lambda> <mu> 2.0E-5 </mu> </name> </component> <component> <name> Node2 <lambda> 8.6E-5 </lambda> <mu> 0.0012 </mu> </name> </component> <component> <name> Switch1 <lambda> 1.0E-5 </lambda> <mu> 2.0E-4 </mu> </name> </component> <component> <name> Switch2 <lambda> 1.3E-5 </lambda> <mu> 2.1E-4 </mu> </name> </component> <component> <name> Client4 <lambda> 3.5E-5 </lambda> <mu> 2.1E-4 </mu> </name> </component> <Series id=0> Node1 Switch1 Client1 </Block0> </Series> <Series id=1> Node1 Switch2 Client1 </Block1> </Series> <Series id=2> Node1 Switch1 Client2 </Block2> </Series> <Series id=3> Node1 Switch2 Client2 </Block3> </Series> <Series id=4> Node1 Switch1 Client3 </Block4> </Series> <Series id=5> Node1 Switch2 Client3 </Block5> </Series> <Series id=6> Node1 Switch1 Client4 </Block6> </Series> <Series id=7> Node1 Switch2 Client4 </Block7> </Series> <Series id=8> Node2 Switch1 Client1 </Block8> </Series> <Series id=9> Node2 Switch2 Client1 </Block9> </Series> <Series id=10> Node2 Switch1 Client2 </Block10> </Series> <Series id=11> Node2 Switch2 Client2 </Block11> </Series> <Series id=12> Node2 Switch1 Client3 </Block12> </Series> <Series id=13> Node2 Switch2 Client3 </Block13> </Series> <Series id=14> Node2 Switch1 Client4 </Block14> </Series> <Series id=15> Node2 Switch2 Client4 </Block15> </Series> <Parallel> id=0 id=1 id=2 id=3 id=4 id=5 id=6 id=7 id=8 id=9 id=10 id=11 id=12 id=13 id=14 id=155 </Parallel> <System Unreliability> 9.211E-02 </System Unreliability> <Mean Time to Failure> <days> 331 </days> </Mean Time to Failure> <System Instantaneous Availability per year> 99.997 </System Instantaneous Availability per year> <System DownTime per year> <min> 11 </min> </System DownTime per year> </RELIABILITY BLOCK DIAGRAM>

  16. Policy-based Fault Prediction, Hardware Management abstraction

  17. Policy-based Fault Prediction, Hardware Management abstraction

  18. Hardware Management abstraction • Ability to access and control detailed status for better management (CPU temp, baseboard, power status, system ID/ up/ down etc.) • IPMI (Intelligent Platform Management Interface) • open IPMI and OpenHPI (SA forum) • HW abstraction hinds vendor specific • CPU • Power • Memory • Baseboard • Fan (cooling)

  19. Our early observations • Can set thresholds in managed elements to trigger events with severity levels • Automatic failure trend analysis -> prediction 01/25/2004 | 00:31:19 | Sys Fan 1 | critical 01/25/2004 | 00:31:19 | Sys Fan 3 | critical 01/25/2004 | 00:31:19 | Sys Fan 4 | critical 01/25/2004 | 00:31:19 | Processor 1 Fan | ok 01/25/2004 | 00:31:20 | Processor 2 Fan | ok

  20. A failure prediction & policy-based recovery Cluster management • Detections - the damage done! • Predictions • trend analysis • Anticipate imminent failures • Better handling • More difficult for multiple events/nodes correlations • Example of IPMI events and trend analysis • E.g. CPU temp raising too fast with 5 min -> prepare to checkpoint, failover and restart • Memory bit error detected -> take a node out

  21. HA-OSCAR monitoring, Fault prediction and recovery Restructure

  22. Cluster Power Management (IPMI)

  23. Reliability-aware Runtime

  24. Reliability-Aware Runtime • Programming paradigm and Scalability impact “Reliability”, esp for HPC environment • “AND Survivability” analysis based on • at 10, 100, 1000 nodes all have to survive. • Each node MTTF at 5000 hours • N=10, MTTF = 492.424242 • N=100, MTTF = 49.9902931 • N=1000, MTTF = 4.99999003 • N=10000, MTTF = ½ hour • Reliability and Availability info - Better Job execution (checkpointing, resource management)

  25. MTTF 1000-5000 • e.g. each nodal failure rate 2/year • N=10, MTTF = 492.424242 • N=100, MTTF = 49.9902931 • N=1000, MTTF = 4.99999003 The more & the faster processors, the faster failure rate

  26. Reliability-aware Checkpointing • Consideration of Scalability vs. Reliability in Runtime • MTTF vs. application execution time • HA-OSCAR monitoring -> Failure Prediction and Detection • System-initiated (transparent) and Reliability-aware checkpointing in MPI environments. • Developed smart checkpoint based on above. • Reduce unnecessary overheads yet reliability-aware • Detailed reports in HAPCW2004 and submitted to IEEE cluster 2005

  27. Federated System Architecture (DOE fastOS)

  28. Summary • Problems in Large-scale computing is similar to Wireless Sensor Network • Computing node = SN • Head node = gateway • Reliability issues are similar • Depends on applications • Self-config, self-awareness, self-healing • Routing algorithm = location-aware

More Related