Failure Detectors

Failure Detectors CS 717 Ashish Motivala Dec 6th 2001

Some Papers Relevant Papers • Unreliable Failure Detectors for Reliable Distributed Systems.Tushar Deepak Chandra and Sam Toueg. Journal of the ACM. • A gossip-style failure detection service.R. van Renesse, Y. Minsky, and M. Hayden. Middleware '98. • Scalable Weakly-consistent Infection-style Process Group Membership protocol.Ashish Motivala, Abhinandan Das, Indranil Gupta. To be submitted to DSN 2002 tomorrow. http://www.cs.cornell.edu/gupta/swim • On the Quality of Service of Failure Detectors. Wei Chen, Cornell University (with Sam Toueg, Advisor, and Marcos Aguilera, Contributing Author). DSN 2000. • Fail-aware failure detectors.C. Fetzer and F. Cristian. In Proceedings of the 15th Symposium on Reliable Distributed Systems.

Asynchronous vs Synchronous Model • No value to assumptions about process speed • Network can arbitrarily delay a message • But we assume that messages are sequenced and retransmitted (arbitrary numbers of times), so they eventually get through. • Failures in asynchronous model? • Usually, limited to process “crash” faults • If detectable, we call this “fail-stop” – but how to detect?

No value to assumptions about process speed Network can arbitrarily delay a message But we assume that messages are sequenced and retransmitted (arbitrary numbers of times), so they eventually get through. Assume that every process will run within bounded delay Assume that every link has bounded delay Usually described as “synchronous rounds” Asynchronous vs Synchronous Model

Usually, limited to process “crash” faults If detectable, we call this “fail-stop” – but how to detect? Can talk about message “omission” failures: failure to send is the usual approach But network assumed reliable (loss “charged” to sender) Process crash failures, as in asynchronous setting “Byzantine” failures: arbitrary misbehavior by processes Failures in Asynchronous and Synchronous Systems

Realistic??? • Asynchronous model is too weak since they have no clocks(real systems have clocks, “most” timing meets expectations… but heavy tails) • Synchronous model is too strong (real systems lack a way to implement synchronize rounds) • Partially Synchronous Model: async n/w with a reliable channel • Timed Asynchronous Model: time bounds on clock drift rates and message delays [Fetzer]

Impossibility Results • Consensus: All processes need to agree on a value • FLP Impossibility of Consensus • A single faulty process can prevent consensus • Realistic because a slow process is indistinguishable from a crashed one. • Chandra/Toueg Showed that FLP Impossibility applies to many problems, not just consensus • In particular, they show that FLP applies to group membership, reliable multicast • So these practical problems are impossible in asynchronous systems • They also look at the weakest condition under which consensus can be solved

Byzantine Consensus • Example: 3 processes, 1 is faulty (A, B, C) • Non-faulty processes A and B start with input 0 and 1, respectively • They exchange messages: each now has a set of inputs {0, 1, x}, where x comes from C • C sends 0 to A and 1 to B • A has {0, 1, 0} and wants to pick 0. B has {0, 1, 1} and wants to pick 1. • By definition, impossibility in this model means “xxx can’t always be done”

Chandra/Toueg Idea • Theoretical Idea • Separate problem into • The consensus algorithm itself • A “failure detector:” a form of oracle that announces suspected failure • But the process can change its decision • Question: what is the weakest oracle for which consensus is always solvable?

Sample properties • Completeness: detection of every crash • Strong completeness: Eventually, every process that crashes is permanently suspected by every correct process • Weak completeness: Eventually, every process that crashes is permanently suspected by some correct process

Sample properties • Accuracy: does it make mistakes? • Strong accuracy: No process is suspected before it crashes. • Weak accuracy: Some correct process is never suspected • Eventual {strong/ weak} accuracy: there is a time after which {strong/weak} accuracy is satisfied.

A sampling of failure detectors

Perfect Detector? • Named Perfect, written P • Strong completeness and strong accuracy • Immediately detects all failures • Never makes mistakes

Example of a failure detector • The detector they call W: “eventually weak” • More commonly: W: “diamond-W” • Defined by two properties: • There is a time after which every process that crashes is suspected by some correct process {weak completeness} • There is a time after which some correct process is never suspected by any correct process {weak accuracy} • Eg. we can eventually agree upon a leader. If it crashes, we eventually, accurately detect the crash

W: Weakest failure detector • They show that W is the weakest failure detector for which consensus is guaranteed to be achieved • Algorithm is pretty simple • Rotate a token around a ring of processes • Decision can occur once token makes it around once without a change in failure-suspicion status for any process • Subsequently, as token is passed, each recipient learns the decision outcome

Building systems with W • Unfortunately, this failure detector is not implementable • This is the weakest failure detector that solves consensus • Using timeouts we can make mistakes at arbitrary times

X pi pj’s Membership list Group Membership Service Asynchronous Lossy Network X Process Group pi Join Leave Failure pj

Data Dissemination using Epidemic Protocols • Want efficiency, robustness, speed and scale • Tree distribution is efficient, but fragile and hard configure • Gossip is efficient and robust but has high latency. Almost linear in network load and scales O(nlogn) in detection time with number of processes.

State Monotonic Property • A gossip message contains the state of the sender of the gossip. • The receiver used a merge function to merge the received state and the sent state. • Need some kind of monotonicity in state and in gossip

Simple Epidemic • Assume a fixed population of size n • For simplicity, assume homogeneous spreading • Simple epidemic: any one can infect any one with equal probability • Assume that k members are already in infected • And that the infection occurs in rounds

Probability of Infection • Probability Pinfect(k,n) that a particular uninfected member is infected in a round if k are already in a round if k are already infected? • Pinfect(k,n) = 1 – P(nobody infects member) = 1 – (1 – 1/n)k E(#newly infected members) = (n-k)x Pinfect(k,n) Basically its a Binomial Distribution

2 Phases • Intuition: 2 Phases • First Half: 1 -> n/2 Phase 1 • Second Half: n/2 -> n Phase 2 • For large n, Pinfect(n/2,n) ~ 1 – (1/e)0.5 ~ 0.4

Infection and Uninfection • Infection • Initial Growth Factor is very high about 2 • At the half way mark its about 1.4 • Exponential growth • Uninfection • Slow death of uninfection to start • At half way mark its about 0.4 • Exponential decline

Rounds • Number of rounds necessary to infect the entire population is O(log n) • Robbert uses and base of 1.585 for experiments

How the Protocol Works • Each member maintains a list of (address heartbeat) pairs. • Periodically each member gossips: • Increments his heartbeat • Sends (part of) list to a randomly chosen member • On receipt of gossip, merge the lists • Each member maintains the last heartbeat of each list member

X pi pj’s Membership list SWIMGroup Membership Service Asynchronous Lossy Network X Process Group pi Join Leave Failure pj

System Design • Join, Leave, Failure : broadcast to all processes • Need to detect a process failure at some process quickly (to be able to broadcast it) • Failure Detector Protocol Specifications • Detection Time • Accuracy • Load Specified by application designer to SWIM Optimized by SWIM

pi pj K random processes X X SWIM Failure Detector Protocol Protocol period = T time units

Properties • Expected Detection time e/(e-1) protocol periods • Load: O(K) per process • Inaccuracy probability exponential in K • Process failures detected • in O(log N) protocol periods w.h.p. • in O(N) protocol periods deterministically

Why not Heartbeating ? • Centralized : single failure point • All-to-all : O(N) load per process • Logical ring : unpredictability on multiple failures

LAN Scalability Win2000, 100 Base-T Ethernet LAN Protocol Period = 3*RTT, RTT=10 ms, K=1

Deployment • Broadcast ‘suspicion’ before ‘declaring’ process failure • Piggyback broadcasts through ping messages • Epidemic-style broadcast • WAN • Load on core routers • No representatives per subnet/domain

Failure Detectors