Fault Tolerance Challenges and Solutions

Fault Tolerance Challengesand Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported by the Department of Energy’s Office of ScienceOffice of Advanced Scientific Computing Research

Rapid growth in scaledrives fault tolerance need Example: ORNL Leadership ComputingFacility hardware roadmap 1 PF (136 new cabinets) 24,576 nodes 98,304 processors 175 TB 250 TF(68 quad) 11,706 nodes 36,004 processors 71 TB 1 PF Cray XT3 Cray XT4 Cray Baker 2008 250 TF 54 TF(56 cabinets) 5294 nodes 10,588 processors 21 TB 100 TF 54 TF 2007 25 TF Late2006 20X scale changein 2.5 years Today 100 TF(+68 cabinets) 11,706 nodes 23,412 processors 46 TB Jul 2006 Nov 2006 Dec 2007 Nov 2008

With 25,000 nodes,this is a tiny fraction (0.0004)of the whole system The RAS system automatically configures around faults – up for days Butevery one of these failures kills the application that was using that node! Today’s applications and their runtime libraries may scalebut are not prepared for the failure rates of these systems 10 nodes a day:“Estimated” failure rate for 1 petaflop system (Today’s rate of 1–2 per day times 20) ORNL 1 PFCray “Baker”system, 2008

200 Time to checkpoint increases with problem size 160 120 Crossoverpoint Mean time to interrupt (hours) 80 Time 40 November December January 0 MTTI decreasesas numberof parts increases Good news: MTTI is betterthan expected for LLNL BG/Land ORNL XT3(6–7 days, not minutes) 2009(estimate) 2006 ’09: The end of fault toleranceas we know it(The point where checkpoint ceases to be viable)

Store checkpointin memory Need todevelop rich methodology to “run through” faults 8 strategies for applicationto handle fault Restart from checkpoint file [large apps today] Restart from diskless checkpoint[Avoids stressing the IO systemand causing more faults] Recalculate lost data from in-memory RAID Lossy recalculation of lost data[for iterative methods] Some state saved Recalculate lost data from initialand remaining data Replicate computation across system Reassign lost work to another resource Use natural fault-tolerant algorithms No checkpoint

Support simultaneous updates 7/24 system can’t ignore faults Harness P2P control research Fast recovery from fault Parallel recovery from multiple node failures

6 options for system to handle jobs Restart – from checkpointor from beginning Notify applicationand let it handle the problem Migrate task to other hardwarebefore failure Reassign work to spare processor(s) Replicate tasks across machine Ignore the fault altogether Need a mechanism for each application (or component)to specify to system what to do if fault occurs 7Geist_FT_0611

These modes affect the size (extent) and ordering of the communicators It may be time to consideran MPI-3 standard that allows applications to recover from faults 5 recovery modes for MPI applications Harness project’s FT-MPI explored 5 modes of recovery • ABORT: Just do as vendor implementations • BLANK: Leave holes (but make sure collectivesdo the right thing afterwards) • SHRINK: Re-order processes to makea contiguous communicator (some ranks change) • REBUILD: Re-spawn lost processesand add them to MPI_COMM_WORLD • REBUILD_ALL: Same as REBUILD except rebuilds all communicators, groups and resets all key values etc.

Can’t afford to run every jobthree (or more) times • Yearly allocationsare like $5M–$10M grants 4 ways to fail anyway Validation of answer on such large systems Fault may not be detected Recovery introduces perturbations Result may depend on which nodes fail Result looks reasonable but is actually wrong 9Geist_FT_0611

3 steps to fault tolerance • There are three main steps in fault tolerance • Detection that something has gone wrong • System: Detection in hardware • Framework: Detection by runtime environment • Library: Detection in math or communication library • Notificationof the application,runtime, system components • Interrupt: Signal sent to job or system component • Error code returned by application routine • Recovery of the application to the fault • By the system • By the application • Neither: Natural fault tolerance Subscription Notification 10Geist_FT_0611

From a fault tolerance perspective: • Space means that the job ‘state’ to be recovered is huge • Time means that many faults will occur during a single run 2 reasons the problemis only going to get worse The drive for large-scale simulationsin biology, nanotechnology, medicine,chemistry, materials, etc. • Require much larger problems (space) • Easily consume the 2 GB per corein ORNL LCF systems • Require much longer to run (time) • Science teams in climate, combustion, and fusionwant to run for a couple of months dedicated

1 holistic solution We need coordinated fault awareness, prediction, and recoveryacross the entire HPC system from the application to the hardware CIFTS project Fault Tolerance Backplane Notification Detection Recovery Applications Monitor Event Manager Recovery Services Middleware Logger Predictionand Prevention Operating System Autonomic Actions Configuration Hardware “Prediction and prevention are critical becausethe best fault is the one that never happens” Project under way at ANL, ORNL, LBL, UTK, IU, OSU

Contact Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division (865) 574-3153 gst@ornl.gov 13 Geist_FT_0611

Fault Tolerance Challenges and Solutions

Fault Tolerance Challenges and Solutions

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance Challenges and Solutions

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance