PAXOS

PAXOS Lecture by Avi Eyal Based on: Deconstructing Paxos – by Rajsbaum Paxos Made Simple – by Lamport Reconstructing Paxos – by Rajsbaum

Our Goals • Agree on values (Consensus) • Arrange those values in a “Total Order”

The Scene • Complete graph • Asynchronous system and no FIFO • Machine may crash (first we deal with “crash-stop”) • No Byzantine errors • No corruption of messages • The number of machines is known • The system stabilizes after a finite time

A word about stability By the FLP theorem, Consensus is not solvable in an asynchronous system if even a single process might crash. We assume that after an unknown finite time, every process that crashes, crashes for good, and every active process is active for good (i.e. no process is unstable forever)

Consensus • If process Pi proposes a value over and over, then either Pi crashes or Pi decides. • If Pi decides on a value, then eventually every correct process decides the same value.

Consensus How can we assure that only a single value is chosen when some machines are unstable? Do we need a consistent leader? What if we had more than one leader at a time?

Consensus Decision will be taken by at least half the processes, and we will make sure that the rest get the message. We will show that we do NOT need a consistent leader at that point, but… If we have 2 leaders, they might fail each other.

Proposers & Witnesses “Read” • Make sure that more than half of the witnesses will not work with someone whose round number is less than mine. • Get a decided value if exists. “Write” • Set a value to more than half of the witnesses

Proposer Witness [“read”, k] Update readj [ackRead, k, writej, vj] or Update v* or abort [nackRead, k] [“write”, k, v*] Update writej, vj [ackWrite, k] or Decide v* or abort [nackWrite, k]

Consensus Propose(v) k=k+n Send [“read”, k,] to all Wait for n/2 replies [ackRead, k’, v’] if received any nackRead abort v*=v’ with max k’ or v if none exists Send [“write”, k, v*] to all Wait for n/2 replies [ackWrite, k] if received a nackWrite abort decide(v*) Upon receive [read, k] if k < readi or k < writei reply [nackRead, k] else readi=k reply [ackRead, k, writei, vi] Upon receive [write, k, v*] if writei > k or readi > k reply [nackWrite, k] writei = k vi = v* reply [ackWrite, k]

Some notes about the Consensus algorithm • It is possible that Pi proposes a value, does not decide, and then Pj can decide this value even if Pi has crashed (after “write”). • When 2 leaders are proposing simultaneously, possibly none of them will decide. • If less than half the processes have answered the “write” query, we cannot be sure what the decided value will be. (It depends if the next proposer will get an answer from them or not).

Total Order • If Pi delivers m then eventually every correct process delivers m. • If Pi delivers m, m’ in this order then Pj delivers m, m’ in the same order.

Total Order Can we do that without a leader? For how long will we need that leader? What if we had more than one leader?

The Paxos Algorithm • Each process maintains the id of it’s current leader • Proposing values is done through the leader • The leader sequences the orders and then uses Consensus in order to agree on the sequence.

The Paxos Algorithm The messages proposed contain values and order numbers. A leader may take care of a few orders at the same time.

Pi Pj m m’ (7,m*) (7,m*) Leader Propose(6, m) Propose(7, m’) Decide(7, m*) Decide(6, m’*) (6,m’*) (6,m’*)

Data Structures • TO_Delivered[] • TO_Undelivered[] • AwaitToBeDelivered[] used upon delivery • nextBatch

The Paxos Algorithm – leader Converge(L, m) returned = abort while (returned == abort) returned = propose(L, m) // Repeat until dicide send [decision, L, m] to all processes Upon new message m Verify that m has not yet been delivered find k that does not have a Converge(k, *) active Converge(k, m)

The Paxos Algorithm – process Upon new message m or leader change Verify that m has not yet been delivered Send TO_Undelivered+m to the leader. Upon receive m from Pj [decision/update, kj, m) stop Converge(kj, *) if active if kj = nextBatch deliver (kj, m) and return if kj < nextBatch update Pj of his missing messages if kj > nextBatch AwaitToBeDelivered[kj] = m //Will be used upon delivery send [update, nextBatch-1, TO_Delivered] to all in order to be updated

Fail Recovery • Each process holds readi, writei, vi, TO_Delivered and nextBatch on a stable storage in order recover consistently after a crash. • If a leader proposes, crashes, recovers and proposes again, he might consider an answer for the second proposal as an answer for the first one. Replies to the proposer should contain the msg. • A process should remember all the messages and should answer the same for same messages, in case a proposer proposed twice with the same value.

Tradeoffs • If we know that most of the processes never crash, we can rely on them instead of using the stable storage. • If there are unstable processes, who elect themselves as leader over and over, we can store for each process the leaders of all other processes. A process will then elect a leader only if most of the processes have elected that leader (assuming most processes never crash).

PAXOS

PAXOS

Presentation Transcript

Paxos Commit

Paxos Made Simple

Paxos

Implementing Replicated Logs with Paxos

Paxos Made Simple

Paxos Made Simple

CS5412: Paxos

Paxos

Virtual Synchrony, Paxos , and Beyond

CHUBBY and PAXOS

Paxos-Loggos

Paxos

Paxos

Paxos Made Simple

Paxos

Paxos Made Simple

Implementing Consistency -- Paxos

Holiday Accommodation in Paxos