140 likes | 151 Vues
CS Research Group. Counting on Failure 10, 9, 8, 7,…,3, 2, 1. Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory. September 12, 2006 CCGSC Conference Flat Rock, North Carolina. Research sponsored by DOE Office of Science.
E N D
CS Research Group Counting on Failure 10, 9, 8, 7,…,3, 2, 1 Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory September 12, 2006 CCGSC Conference Flat Rock, North Carolina Research sponsored by DOE Office of Science
Rapid growth in scale drives fault tolerance need Eg. ORNL Leadership Computing Facility Hardware roadmap Today 54 TF (56 cabinets) 5294 nodes 10,588 proc 21 TB 1 PF Cray XT3 Cray XT4 Cray Baker 250 TF 100 TF (+68 cab) 11,706 nodes 23,412 proc 46 TB 100 TF 54 TF 25 TF 250 TF (68 quad) 11,706 nodes 36,004 proc 71 TB 1 PF (136 new cab) 24,576 nodes 98,304 proc 175 TB Jul 2006 Nov 2006 Dec 2007 Nov 2008 20X scale change in 2 ½ years
10 nodes a day “Estimated” failure rate for 1 Petaflop system Est: One every day or two today times 20 • With 25,000 nodes this is a tiny fraction (0.0004) of the whole system • The RAS system automatically configures around faults – up for days • But every one of these failures kills the application that was using that node! ORNL 1 PF Cray “Baker” System 2008 Today’s applications and their runtime libraries may scale but are not prepared for the failure rates of these systems
’09 The End of Fault Tolerance as We Know It Point where checkpoint ceases to be viable Time to checkpoint grows larger as problem size increases Crossover point time MTTI grows smaller as number of parts increases 2006 2009 is guess Good news is the MTTI is better than expected for LLNL BG/L and ORNL XT3 a/b 6-7 days not minutes
8 Strategies for application to handle fault Restart – from checkpoint file [large apps today] Restart from diskless checkpoint [Avoids stressing the IO system and causing more faults] Recalculate lost data from in memory RAID Lossy recalculation of lost data [for iterative methods] Recalculate lost data from initial and remaining data Replicate computation across system Reassign lost work to another resource Use natural fault tolerant algorithms Store chkpt in memory Some state saved No chkpt Need to develop rich methodology to “run through” faults
8 (cont) Natural Fault Tolerant algorithms Demonstrated that the scale invariance and natural fault tolerance can exist for local and global algorithms where 100 failures happen across 100,000 processes • Finite Difference (Christian Engelman) • Demonstrated natural fault tolerance w/ chaotic relaxation, meshless, finite difference solution of Laplace and Poisson problems • Global information (Kasidit Chancio) • Demonstrated natural fault tolerance in global max problem w/random, directed graphs • Gridless Multigrid(Ryan Adams) • Combines the fast convergence of multigrid with the natural fault tolerance property. Hierarchical implementation of finite difference above. • Three different asynchronous updates explored • Theoretical analysis (Jeffery Chen) local global
7 / 24 System can’t ignore faults • The file system can’t let data be corrupted by faults. • I/O nodes must recover and cover failures • The heterogeneous OS must be able to tolerate failures of any of its node types and instances. • For example a failed service node shouldn’t take out a bunch of compute nodes. • The schedulers and other system components must be aware of dynamically changing system configuration • So that tasks get assigned around failed components Harness P2P control research Fast recovery from fault Parallel recovery from multiple node failures Support simultaneous updates
6 Options for system to handle jobs • Restart – from checkpoint or from beginning • Notify application and let it handle the problem • Migrate task to other hardware before failure • Reassignment of work to spare processor(s) • Replication of tasks across machine • Ignore the fault altogether What to do? system Need a mechanism for each application (or component) to specify to system what to do if fault occurs
5 recovery modes for MPI applications Harness project’s FT-MPI explored 5 modes of recovery. They effect the size (extent) and ordering of the communicators • ABORT: just do as vendor implementations • BLANK: leave holes • But make sure collectives do the right thing afterwards • SHRINK: re-order processes to make a contiguous communicator • Some ranks change • REBUILD: re-spawn lost processes and add them to MPI_COMM_WORLD • REBUILD_ALL: same as REBUILD except rebuilds all communicators, groups and resets all key values etc. May be time to consider an MPI-3 standard that allows applications to recover from faults
4 Ways to Fail Anyway • Validationof answer on such large systems • Fault may not be detected • Recovery introduces perturbations • Result may depend on which nodes fail • Result looks reasonable but is actually wrong I’ll just keep running the job till I get the answer I want Can’t afford to run every job three (or more) times Yearly Allocations are like $5M-$10M grants
3 Steps to Fault Tolerance • There are three main steps in fault tolerance • Detectionthat something has gone wrong • System – detection in hardware • Framework – detection by runtime environment • Library – detection in math or communication library • Notification of the application, runtime, system components • Interrupt – signal sent to job or system component • Error code returned by application routine • Recoveryof the application to the fault • By the system • By the application • Neither - Natural fault tolerance subscription notification Ace repair staff
2 Reasons the problem is only going to get worse • The drive for large scale simulations in biology, nanotechnology, medicine, chemistry, materials, etc. • Require much larger problems (Space) • Easily consume the 2 GB per core in ORNL LCF systems • Require much longer to run (Time) • Science teams in climate, combustion, and fusion want to run for couple months dedicated From Fault Tolerance perspective: Space means the job ‘state’ to be recovered is huge Time means that many faults will occur during a single run
1 Holistic Solution We need coordinated fault awareness, prediction and recovery across the entire HPC system from the application to the hardware. “Prediction and prevention are critical because the best fault is the one that never happens” Fault Tolerance Backplane Applications Detection Notification Recovery Middleware Monitor Event Manager Recovery Services Operating System Logger Prediction & Prevention Autonomic Actions Configuration Hardware CIFTS project underway at ANL, ORNL, LBL, UTK, IU, OSU
Thanks Questions?