Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments

Fundamentals of Fault-TolerantDistributed Computing InAsynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003

Presentation Overview • Introduction • Terminology • Formal View of Fault Tolerance • Four Types of Fault Tolerance • Redundancy as the Key to Fault Tolerance • Models of Computation And Their Relevance • Achieving Safety • Achieving Liveness • Conclusions

Introduction • Until early 1990s, work in fault-tolerant computing focused on specific technologies and applications. • Resulted in distinct terminologies and methodologies • Goals • Structure the area clearly. • Survey the fundamental building blocks.

Terminology States, Configurations, and Guarded Commands • distributed system: finite set of processes. • Local state: variables of each process. • State transition: defines event (send, receive, or internal event). • Guarded Commands: abstractly represent a local algorithm. <guard> => <command> • Configuration: consists of local states of all processes plus state of communication subsystem.

process Ping var z : IN init 0 ack : boolean init true begin ¬ack ^ rcv(m) => ack := true; z := z + 1 ack => snd(a); ack := false end process Pong var wait : boolean init true begin ¬wait => snd(m); wait := true wait ^ rcv(a) => wait := false end

Terminology (continued) Defining Faults and Fault Models • Fault: may cause an error • Error: may lead to a failure • Failure: system has left its correctness specification. • Models: • Crash failure, Fail-stop, and Byzantine • Fault: can be modeled as an unwanted state transition of a process

Terminology (continued) Properties of Distributed Systems: Safety and Liveness • Safety property: some specific “bad thing” never happens within system. • Liveness property: claims some “good thing” will eventually happen during system execution. • Problem Specification: consists of a safety and a liveness property

Formal View of Fault Tolerance Definition: • A distributed program A is said to tolerate faults from a fault class F for an invariant P iff there exists a predicate T for which the following requirements hold: • P => T • T is closed in A and T is closed in F • Starting from any state where T holds, every computation that executes actions from A alone eventually reaches a state where P holds.

Four Types of Fault Tolerance Liveness Property Satisfied Yes No Masking Fail Safe Yes Safety Property Satisfied Nonmasking None No

Redundancy as the Key to Fault Tolerance Defining Redundancy: • A distributed program (A) is said to be redundant in space iff for all executions e of A in which no faults occur, the set of all configurations of A contains configurations that are not reached in e. • A is said to redundant in time iff for all executions of e in which no faults occur, the set of actions of A contains actions that are never executed in e. • A program is said to employ redundancy iff it is either redundant in space or time.

Example: program with redundancy in space and in time process Redundancy var x ε {0, 1, 2} init 1 {* local state *} begin {* normal program actions: *} x = 1 => x := 2 {* 1 *} x = 2 => x := 1 {* 2 *} x = 0 => x := 1 {* 3 *} {* fault action: *} true => x := 0 end

Redundancy as the Key to Fault Tolerance (continued) Claim: • If A is a nontrivial distributed program that does not employ redundancy, then A may become incorrect regarding its correctness specification in the presence of faults. Conclusion: • While redundancy is not sufficient for fault tolerance, it is a necessary condition. • Redundancy in space is widespread

Models of Computation And Their Relevance Models of Distributed Systems • Synchronous systems: there are real-time bounds on message transmission and process response times. • Partially synchronous: intermediate models that have bounds to a varying degree. • Asynchronous systems: no bounds made. • Weakest model and realistic model in many applications. • Every algorithm that works on this model, works on all other models. • Cannot detect whether a process has crashed or not?

Achieving Safety: Detection as the Basis for Safety • To ensure safety, we need to employ detection and subsequently inhibit dangerous actions. • Common Detection Mechanisms: parity, checksums • Detection includes checking whether a certain predicate Q holds over the entire system • Q is easier to specify if the type and effect of faults from F are known.

Achieving Safety: Detection in Distributed Settings • Deciding whether a predicate over the global state does or does not hold is not easy. • Cooper and Marzullo introduced two transformers: • Possibility(Q) is true iff there exists a continuous observation of the computation for which Q holds at some point. • Definitely(Q) is true iff for all possible continuous observations of the computation Q holds at some point.

Achieving Safety: Adapting Consensus Algorithms • Set of processes (each process has an initial value) must all decide on a common value. • Central process acts as an observer that can construct all possible observations. • Central process scheme not very fault tolerant: • Central observer can crash • Central observer can send arbitrary messages • Solution: diffuse information among all nodes.

Achieving Safety: Detecting Process Crashes • Fully Asynchronous model: impossible to detect • Chandra and Toueg proposed unreliable failure detectors to extend the asynchronous model. • The main property of failure detectors is accuracy: • Weak: failure detector will never suspect at least one correct process of having crashed. • Eventually Weak: failure detector may suspect every process at one time or another, but there is a time after which some correct process is no longer suspected.

Achieving Liveness: Correction • Liveness tied to notion of correction. • Correction refers to turning a bad state into a good one. • Common methods include: • retransmission, error-correction codes, rollback recovery, rollforward recovery, etc. • On detecting a bad state via a detection predicate Q, the system must try to impose a new target predicate R onto the system.

Achieving Liveness: Correction via Consensus • Correction corresponds to the decision phase of consensus algorithms. • State machine approach (Schneider) • Servers are made fault tolerant by replicating them and coordinating their behavior via consensus algorithms. • Other methods based on several forms of fault-tolerant broadcasts.

Conclusions • This paper introduces a formal approach to structure the area of fault-tolerant distributed computing, survey fundamental methodologies, and discuss their relations. • This approach reveals the inherent limitations of fault-tolerance methodologies and their interactions with system models. • This paper could not integrate the entire area of fault-tolerant distributed computing. • Many topics still need further attention.

Questions

Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments

Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments

Presentation Transcript

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

Fault Tolerant Distributed Systems

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

Consensus problem in fault tolerant distributed computing

ECE 753: FAULT-TOLERANT COMPUTING

Fault Tolerant Distributed Computing system.

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

Synthesis of Fault-Tolerant Distributed Programs

FAULT-TOLERANT COMPUTING

ITEC452 Distributed Computing Lecture 11 Fault Tolerant Systems

FAULT-TOLERANT COMPUTING

Fault-Tolerant Computing Basics

Fault Tolerant Distributed Computing system.

Synthesis of Fault-Tolerant Distributed Programs

Fault-tolerant Computing

Fault-Tolerant Computing Basics