Comprehensive Report on 2002 Fault Tolerance Workshop for Scalable Systems
60 likes | 134 Vues
Summary of workshop on scalable fault tolerance addressing challenges in large COTS systems, educating researchers, and fostering collaboration between universities and laboratories.
Comprehensive Report on 2002 Fault Tolerance Workshop for Scalable Systems
E N D
Presentation Transcript
Report on 2002 FaultTolerance Workshop Patricia D. Hough Computational Sciences and Mathematics Research Department Sandia National Laboratories
Motivation • Large COTS systems are prone to failures • Lots of parts; complex configurations • Applications stress the systems • Few options for application survival • University resources are untapped • DOE researchers unfamiliar with fault tolerance experts • University researchers unfamiliar with DOE problem domain Goal: Bring laboratory and university researchers together to educate each other and discuss issues associated with scalable fault tolerance.
Basic Info • June 10-11, 2002 in Albuquerque, NM • ~40 attendees • Cornell, Denison, Florida, Houston, Indiana, LANL, LLNL, MSTI, SNL, Tennessee, UT Austin • Interest exceeded capacity • Organized by Patty Hough (SNL), Tom Bressoud (Denison), and Lee Ward (SNL) • Sponsored by the CSRI
Agenda • 11 invited talks + 2 hours focused discussion on: • Application descriptions and needs • System monitoring • MPI fault tolerance • Traditional approaches with a twist • Topics not covered • Checkpoint-free algorithms • Preventative measures • System services • Migration • Redistribution • Validation • Run-time environments
Conclusions • MPI support is needed • Programming model needs to be considered • Balance research with timely delivery of capabilities • New ideas are needed • Leverage hardware • More systematic, integrated approach • There are still outstanding issues • Transparency vs. intrusiveness • Can traditional approaches be made scalable? • Workshop was a great success!
For more information… http://csmr.ca.sandia.gov/projects/ftalgs.html