Lecture IX: Coordination And Agreement

Lecture IX: Coordination And Agreement CMPT 401 2008 Dr. Alexandra Fedorova

A Replicated Service servers client network slave W R master W R W slave client data replication write read W W R

A Need For Coordination And Agreement servers client network slave Must coordinate election of a new master master Must agree on a new master slave client

Roadmap • Today we will discuss protocols for coordination and agreement • This is a difficult problem because of failures and lack of bound on message delay • We will begin with a strong set of assumptions (assume few failures), and then we will relax those assumptions • We will look at several problems requiring communication and agreement: distributed mutual exclusion, election • We will finally learn that in an asynchronous distributed system it is impossible to reach a consensus

Distributed Mutual Exclusion (DMTX) • Similar to a local mutual exclusion problem • Processes in a distributed system share a resource • Only one process can access a resource at a time • Examples: • File sharing • Sharing a bank account • Updating a shared database

Assumptions and Requirements • An asynchronous system • Processes do not fail • Message delivery is reliable (exactly once) • Protocol requirements: Safety: At most one process may execute in the critical section at a time Liveness: Requests to enter and exit the critical section eventually succeed Fairness: Requests to enter the critical section are granted in the order in which they were received

Evaluation Criteria of DMTX Algorithms • Bandwidth consumed • proportional to the number of messages sent in each entry and exit operation • Client delay • delay incurred by a process and each entry and exit operation • System throughput • the rate at which processes can access the critical section (number of accesses per unit of time)

DMTX Algorithms • We will consider the following algorithms: • Central server algorithm • Ring-based algorithm • An algorithm based on voting

The Central Server Algorithm

The Central Server Algorithm • Performance: • Entering a critical section takes two messages (a request message followed by a grant message) • System throughput is limited by the synchronization delay at the server: the time between the release message to the server and the grant message to the next client) • Fault tolerance • Does not tolerate failures • What if the client holding the token fails?

A Ring-Based Algorithm

A Ring-Based Algorithm (cont) • Processes are arranged in the ring • There is a communication channel from process pi to process (pi+1) mod N • They continuouslypass the mutual exclusion token around the ring • A process that does not need to enter the critical section (CS) passes the token along • A process that needs to enter the CS retains the token; once it exits the CS, it keeps on passing the token • No fault tolerance • Excessive bandwidth consumption

Maekawa’s Voting Algorithm • To enter a critical section a process must receive a permission from a subset of its peers • Processes are organized in voting sets • A process is a member of M voting sets • All voting sets are of equal size (for fairness)

Maekawa’s Voting Algorithm • Intersection of voting sets guarantees mutual exclusion • To avoid deadlock, requests to enter critical section must be ordered p4 p1 p3 p2

Elections • Election algorithms are used when a unique process must be chosen to play a particular role: • Master in a master-slave replication system • Central server in the DMTX protocol • We will look at the bully election algorithm • The bully algorithm tolerates failstop failures • But it works only in a synchronous system with reliable messaging

The Bully Election Algorithm • All processes are assigned identifiers • The system always elects a coordinator with the highest identifier: • Each process must know all processes with higher identifiers than its own • Three types of messages: • election – a process begins an election • answer – a process acknowledges the election message • coordinator – an announcement of the identity of the elected process

The Bully Election Algorithm (cont.) • Initiation of election: • Process p1 detects that the existing coordinator p4 has crashed an initiates the election • p1 sends an election messages to all processes with higher identifier than itself election election p1 p2 p3 p4

The Bully Election Algorithm (cont.) • What happens if there are no crashes: • p2and p3 receive the election message from p1 send back the answer message to p1 , and begin their own elections • p3 sends answer to p2 • p3 receives no answer message from p4, so after a timeout it elects itself as a leader (knowing it has the highest ID) coordinator coordinator election election election election p1 p2 p3 p4 answer answer answer

The Bully Election Algorithm (cont.) • What happens if p3 also crashes after sending the answer message but before sending the coordinator message? • In that case, p2 will time out while waiting for coordinator message and will start a new election election election election election p1 p2 p2 p3 p4 answer answer answer

The Bully Election Algorithm (summary) • The algorithm does not require a central server • Does not require knowing identities of all the processes • Does require knowing identities of processes with higher IDs • Survives crashes • Assumes a synchronous system (relies on timeouts)

Consensus in Asynchronous Systems With Failures • The algorithms we’ve covered have limitations: • Either tolerate only limited failures (failstop) • Or assume a synchronous system • Consensus is impossible to achieve in an asynchronous system • Next we will see why…

Consensus • All processes agree on the same value (or set of values) • When do you need consensus? • Leader (master) election • Mutual exclusion • Transaction involving multiple parties (banking) • We will look at several variants of consensus problem • Consensus • Byzantine generals • Interactive consensus

System Model • There is a set of processes Pi • There is a set of values {v0, …, vN-1} proposed by processes • Each processes Pi decides on di • di belongs to the set {v0, …, vN-1} • Assumptions: • Synchronous system (for now) • Failstop failures • Byzantine failures • Reliable channels

Consensus algorithm Consensus P1 P1 v1 d1 v3 v2 d2 d3 P2 P3 P2 Step 1 Propose. P3 Step 2 Decide. Courtesy of Jeff Chase, Duke University

Consensus (C) di = vk Pi selects di from {v0, …, vN-1}. All Pi select the same vk (make the same decision) Courtesy of Jeff Chase, Duke University

Conditions for Consensus • Termination: All correct processes eventually decide. • Agreement: All correct processes select the same di. • Integrity: If all correct processes propose the same v, then di = v

Byzantine Generals Problem (BG) leader or commander vleader subordinate or lieutenant di = vleader dj = vleader • Two types of generals: commander and subordinates • A commander proposes an action (vi). • Subordinates must agree Courtesy of Jeff Chase, Duke University

Conditions for Consensus • Termination: All correct processes eventually decide. • Agreement: All correct processes select the same di. • Integrity: If the commander is correct than all correct processes decide on the value that the commander proposed

Interactive Consistency (IC) di = [v0 , …, vN-1] • Each Pi proposes a value vi • Pi selects di = [v0 , …, vN-1] vector reflecting the values proposed by all correct participants. • All Pi must decide on the same vector

Conditions for Consensus • Termination: All correct processes eventually decide. • Agreement: The decision vector of all correct processes is the same • Integrity: If Pi is correct then all correct processes decide on vi as the ith component of their vector

Equivalence of IC and BG • We will show that BG is equivalent to IC • If there is solution to one, there is solution to another • Notation: • BGi(j, v) returns the decision value of pi when the commander pj proposed v • ICi (v1, v2, …., vN)[j] returns the jth value in the decision vector of pi in the solution to IC, where {v1, v2, …., vN} are the values that the processes proposed • Our goal is to find solution to IC given a solution to BG

Equivalence of IC and BG • We run the BG problem N times • Each time the commander pj proposes a value v • Recall that in IC each process proposes a value • After each run of BG problem we record BGi(j, v) for all i – that is what each process decided when the pj proposed v • Similarity with IC: we record what each pi decided for vector position j • We need to record decisions for N vector positions, so we run the problem N times

Equivalence of IC and BG Initialization ? ? ? Empty decision vectors ? ? ? ? ? ? Run #1: Run #2: Run #3: P0 proposes v0 We record d0 for all p P1 proposes v1 We record d1 for all p P2 proposes v2 We record d2 for all p d0 ? ? d0 d1 ? d0 d1 d2 d0 ? ? d0 d1 ? d0 d1 d2 d0 ? ? d0 d1 ? d0 d1 d2

Consensus in a Synchronous System Without Failures • Each process pi proposes a decision value vi • All proposed vi are sent around, such that each process knows all proposed vi • Once all processes receive all proposed v’s, they apply to them the same function, such as:minimum(v1, v2, …., vN) • Each process pi sets di = minimum(v1, v2, …., vN) • The consensus is reached • What if processes fail? Can other processes still reach an agreement?

Consensus in a Synchronous System With Failstop Failures • We assume that at most f out of N processes fail • To reach a consensus despite f failures, we must extend the algorithm to take f+1 rounds • At round 1: each process pi sends its proposed vi to all other processes and receives v’s from other processes • At each subsequent round process pi sends v’s that it has not sent before and receives new v’s • The algorithm terminates after f+1 rounds • Let’s see why it works…

Consensus in a Synchronous System With Failstop Failures: Proof • Will prove by contradiction • Suppose some correct process pi possesses a value that another correct process pj does not possess • This must have happened because some other processes pk sent that value to pi but crashed before sending it to pj • The crash must have happened in round f+1 (last round). Otherwise, pi would have sent that value to pj in round f+1 • How come pj have not received that value in any of the previous rounds? • If at every round there was a crash – some process sent the value to some other processes, but crashed before sending it to pj • But this implies that there must have been f+1 crashes • This is a contradiction: we assumed at most f failures

Consensus in a Synchronous System: Discussion • Can this algorithm withstand other types of failures – omission failures, byzantine failures? • Let us look at consensus in presence of byzantine failures Processes separated by network partition: each group can agree on a separate value

Consensus in a Synchronous System With Byzantine Failures • Byzantine failure: a process can forward to another process an arbitraryvalue v • Byzantine generals: the commander says to one lieutenant that v = A, says to another lieutenant that v = B • We will show that consensus is impossible with only 3 generals • Pease et. al generalized this to impossibility of consensus with N≤3f faulty generals

p p (Commander) (Commander) 1 1 1:w 1:v 1:v 1:x 2:1:v 2:1:w p p p p 2 3 2 3 3:1:u 3:1:x Faulty processes are shown shaded BG: Impossibility With Three General Scenario 1 Scenario 2 • Scenario 1: p2 must decide v (by integrity condition) • But p2 cannot distinguish between Scenario 1 and Scenario 2, so it will decide w in Scenario 2 • By symmetry, p3 will decide x in Scenario 2 • p2 and p3 will have reached different decisions “3:1:u” means “3 says 1 says u”.

Solution With Four Byzantine Generals • We can reach consensus if there are 4 generals and at most 1 is faulty • Intuition: use the majority rule Who is telling the truth? Majority rules! Correct process

p p (Commander) (Commander) 1 1 1:v 1:v 1:u 1:w 1:v 1:v 2:1:v 2:1:u p p p p 3:1:u 3:1:w 2 3 2 3 4:1:v 4:1:v 4:1:v 4:1:v 2:1:v 3:1:w 2:1:u 3:1:w p p Faulty processes are shown shaded 4 4 Solution With Four Byzantine Generals Round 1: The commander sends v to all other generals Round 2: All generals exchange values that they sent to commander The decision is made based on majority

Solution With Four Byzantine Generals p p2 receives: {v, v, u}. Decides v p4 receives: {v, v, w}. Decides v (Commander) 1 1:v 1:v 1:v 2:1:v p p 3:1:u 2 3 4:1:v 4:1:v 2:1:v 3:1:w p 4

Solution With Four Byzantine Generals p (Commander) p2 receives: {u, w, v}. Decides NULL p4 receives: {u, v, w}. Decides NULL p3 receives: {w, u, v}. Decides NULL 1 1:u 1:w 1:v 2:1:u p p 3:1:w 2 3 4:1:v 4:1:v 2:1:u 3:1:w p 4 The result generalizes for system with N ≥ 3f + 1, (N is the number of processes, f is the number of faulty processes)

Consensus in an Asynchronous System • In the algorithms we’ve looked at consensus has been reached by using several rounds of communication • The systems were synchronous, so each round always terminated • If a process has not received a message from another process in a given round, it could assume that the process is faulty • In an asynchronous system this assumption cannot be made! • Fischer-Lynch-Patterson (1985): No consensus can be guaranteed in an asynchronous communication system in the presence of any failures. • Intuition: a “failed” process may just be slow, and can rise from the dead at exactly the wrong time.

Consensus in Practice • Real distributed systems are by and large asynchronous • How do they operate if consensus cannot be reached? • Fault masking: assume that failed processes always recover, and define a way to reintegrate them into the group. • If you haven’t heard from a process, just keep waiting… • A round terminates when every expected message is received. • Failure detectors: construct a failure detector that can determine if a process has failed. • A round terminates when every expected message is received, or the failure detector reports that its sender has failed.

Fault Masking • In a distributed system, a recovered node’s state must also be consistent with the states of other nodes. • Transaction processing systems record state to persistent storage, so they can recover after crash and continue as normal • What if a node has crashed before important state has been recorded on disk? • A functioning node may need to respond to a peer’s recovery. • rebuild the state of the recovering node, and/or • discard local state, and/or • abort/restart operations/interactions in progress • e.g., two-phase commit protocol

Failure Detectors • First problem: how to detect that a member has failed? • pings, timeouts, heartbeats • recovery notifications • Is the failure detector accurate? – Does it accurately detect failures? • Is the failure detector live? – Are there bounds on failure detection time? • In an asynchronous system, it impossible for a failure detector to be both accurate and live

Failure Detectors in Real Systems • Use a failure detector that is live but not accurate. • Assume bounded processing delays and delivery times. • Timeout with multiple retries detects failure accurately with high probability. Tune it to observed latencies. • If a “failed” site turns out to be alive, then restore it or kill it (fencing, fail-silent). • What do we assume about communication failures? • How much pinging is enough? • Tune parameters for your system – can you predict how your system will behave under pressure? • That’s why distributed system engineers often participate in multi-day support calls… • What about network partitions? • Processes form two independent groups, reach consensus independently. Rely on quorum.

Summary • Coordination and agreement are essential in real distributed systems • Real distributed systems are asynchronous • Consensus cannot be reached in an asynchronous distributed system • Nevertheless, people still build useful distributed systems that rely on consensus • Fault recovery and masking are used as mechanisms for helping processes reach consensus • Popular fault masking and recovery techniques are transactions and replication – the topics of the next few lectures

Lecture IX: Coordination And Agreement