ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering HIGH Level Fault-Tolerance: Checkpointing and recovery (more slides) – Forward error recovery

Overview • Forward error recovery • Summary ECE 753 Fault Tolerant Computing

Global checkpoint state Global checkpoint state Checkpointing: Dist. Sys. (contd.) • Checkpoint based recovery – Comm. induced • Place system wide constraints on message passing (communication pattern) and checkpointing to guarantee progression of recovery line • For example if within every checkpoint interval all messages received precede all messages sent, then the system is domino effect free. Such a message passing system will appear to be as follows ECE 753 Fault Tolerant Computing

Checkpointing: Dist. Sys. (contd.) • Checkpoint based recovery – Comm. induced • Consistency of state can be guaranteed (no domino effect will take place) if we take a checkpoint before every non deterministic event (note this is a special case of what we saw before). The challenge here is to reduce the number of checkpoints • Generalization of the previous statement (all “receives” to precede “sends”) have been studied in literature ECE 753 Fault Tolerant Computing

A P0 m1 m7 m0 m4 B P1 m2 m3 m5 m6 P2 C Checkpointing: Dist. Sys. (contd.) • Log based recovery - Pessimistic logging (contd.) • Example and logs • Logs of • P0 is {m0, m4, m7} • P1 is {m1, m3, m6} • P2 is {m2, m5} ECE 753 Fault Tolerant Computing

Forward error recovery • Consider the following scheme for checkpoint based rollback recovery • Two processors P1 and P2 • At checkpoints P1 and P2 compare results and save state (checkpoints) • Error detected at such a comparison, causes the processors P1 and P2 to roll back • Note: if there were three processors, they could potentially mask the faulty process/processor ECE 753 Fault Tolerant Computing

error P1 chk2 P2 compare spare Forward error recovery (contd.) • Roll forward – two processors and a spare chk1 ECE 753 Fault Tolerant Computing

Forward error recovery (contd.) • Additional issues • If we had three processors to begin with why not use fault masking? • Think of more than one job and the number of processors required – a single spare can be spare for many pairs of jobs • What if a second fault occurs while processor P1 is continuing? • Use spare for one more period ECE 753 Fault Tolerant Computing

Summary • Discussed checkpointing and logging issues at length ECE 753 Fault Tolerant Computing

ECE 753: FAULT-TOLERANT COMPUTING