Ordering and dURABILITY IN Isis 2

Ordering and dURABILITY IN Isis2 Cornell University Ken Birman

Isis2 System “joinmyGroup” state transfer myGroup update update • Core functionality: groups of objects • … fault-tolerance, speed (parallelism), coordination • Intended for use in very large-scale settings • The local object instance functions as a gateway • Read-only operations performed on local state • Update operations update all the replicas

Terminology we’ve used • Process group: A term for a collection of programs that are all running (perhaps on different machines, perhaps on the same machine) and that use Isis2 • Each process group has a name (you pick it) • You can have multiple groups in one application • Message: Data encoded to be sent between programs • State transfer: Data to initialize a new group member • Update: Any action that changes the shared data • Lookup: Any action that only queries the data • Multicast: A message sent to every group member

Example: Cloud-Hosted Service Standard Web-Services method invocation A B C D Some service A distributed request that updates group “state”... SafeSend SafeSend SafeSend ... and the response

Multicast properties • In the figure, “SafeSend” is a “multicast” • A message that can be sent to a whole group • What properties do these multicasts need to keep the group members consistent? • In Isis2 we focus on • Ordering properties: relative to group membership changes, and relative to other multicasts • Durability guarantees: what happens if a crash occurs?

Key idea: View ordering In Isis2 new View upcalls are synchronized relative to message delivery

Membership changes Group View is synchronized relative to multicasts • When a group gains or loses a member, the Isis2 Oracle sequences the new view relative to other multicasts. Thus any multicast is delivered in the same view, from the perspective of all recipients. • Also, if a multicast is sent to the group in some view, it reaches all members of the group (of course if some crash, they might not process the message) • State transfers occur after every multicast has been delivered in the prior view and before any are delivered in the new view

Message Ordering • The basic idea of Isis2 is to deliver all multicasts in the same order at all group members receiving them • This keeps the data consistent and allows you to implement “state machine” algorithms: group members perform any desired actions in the same state and in the same order • But we offer various implementations of multicast and if you use them very wisely, some are faster than others. The caveat is that the fast versions can only be used in certain situations, which we’ll discuss.

A multicast arrives in a group… • What information is “the same” for all recipients? • If they call g.GetView(), or remembered properties of the most recently delivered view, all see same view • Also, everyone got the message • And the requested ordering was enforced by Isis2 • What aspects might differ, for different receivers? • Each has its own “rank” in the membership list, obtained by calling v.GetMyRank() or v.GetRankOf(who)

What about failures? What if a failure happens just as a multicast is being sent?

Delayed delivery • In Isis2, a multicast send will often delay (in the platform) for a little while before delivery occurs • As a result, the sender does not know that the group view will be the same when the message is delivered This multicast might have been “sent” in the prior view when r, s and t weren’t yet members!

How can we know for sure? • Suppose the sender of a Query needs to know how many members processed the query, e.g. to notice that some reply is missing due to a failure. What can it do to know? • One option is to have the receivers include View information (such as how many members were in the View, what rank each replying member had) in the Reply() • The sender is also a receiver, so another approach is for the sender to wait for its own multicast or Query to be delivered and then make note of the View

How do we know who sent a message? • You can just include the sender’s Address in the arguments to the message • Cool Isis2 fact: • After you see a View notifying you that some member has failed or voluntarily left the group, you will never receive additional multicasts from that sender! • If a process leaves a group but then tries to send in it, Isis2 throws an exception in that sender.

No messages from the dead • In the Isis2 system, you never receive messages from the deceased • Isis2 watches for “late” messages that came from a process which is already considered to have died • It actively blocks such messages and won’t deliver them • Thus if you reconfigure after a failure, and reassign roles, you can’t get a kind of split-brain effect due to late delivery of a message

Ordering Properties A Everyone receives A first B • The most important form of message ordering is “total order” • Obtained by using g.OrderedSend or g.SafeSend • They both provide the same ordering guarantee. They have different durability properties • Everyone receives these in the same order. Everyone receives B second

Weaker ordering • Some applications want the lowest possible message latency • OrderedSend will usually achieve this best delay, but not always. (Slower case: when multiple group members are calling OrderedSend concurrently) • SafeSend uses a much slower approach. • For the very best speed, protocols guaranteed to be faster are available: Send and RawSend

A FIFO Ordering situation • Suppose one process sends all the multicasts that update some variable in a group. What ordering is really needed? • In this group, only the oldest living membersends multicasts • FIFO suffices! After p and q fail, r is the leader. It has rank 0 in the new view We say that p is the leader. It has rank 0 p q r s t Time: 0 10 20 30 40 50 60 70

A FIFO Ordering Situation • In this group we really only need to deliver messages in the order the leader sent them • For this purpose, the Send primitive is ideal • Send respects the FIFO order its sender used • Guaranteed to be extremely fast • RawSend: Send, but with no effort to guarantee reliability. Respects FIFO order… unless message is lost

What if two senders use Send? • When different senders use Send, the ordering will depend on when the messages showed up! • Different members might see different orderings • Example: r sees A B • … but p sees B A A B

When is FIFO good enough? • Suppose our group manages a collection of data items • Each item has its own leader and only the leader sends updates for that item • Consistency: It suiffices to apply updates in the order they were sent. g.Send() will do this! • But beware… • Multicasts from different senderscan interleave in unpredictable ways

When would you use RawSend? • This primitive doesn’t guarantee reliability • We use it when reporting data from real-time sensors • We want the data delivered in order (new data replaces older data). RawSend is still FIFO ordered • But if data is lost, there is no point “wasting time” in the platform retransmitting it.

What about Query ordering? • Each kind of multicast has an associated Query

CausalSend • Included mostly for academic reasons, but not used very often in Isis2 • Intended for situation in which the leader role moves around for each data item • First p is in charge, then q is the leader for a while, then r, then back to p… • CausalSend will respect the FIFO order “with moving leaders”. But we don’t recommend using it.

CausalSend picture: B is “after” A A B

Causality idea • If B “might have been caused by A”, then B is causally ordered after A (we write A  B) • CausalSend tracks these causality dependencies and makes sure that if A  B, then B will be delivered after A • But the Isis2 implementation of CausalSend is slow and this is why it isn’t used very often

Durability Exactly what happens in the event of a failure?

Durability • A durability guarantee is the property that information will survive a failure • There are several cases to think about • What if the sender of a multicast fails but someone received the multicast? • What if the sender and every receiver (so far) fails? • What if a whole group fails, but later restarts? • What if the group is managing a replicated database or files that aren’t even on the same computers?

Soft State in the Cloud • Many Isis2 applications run in cloud settings.. And the cloud favors “soft state” • After a node crashes, the entire VM is reloaded • Thus any local state (even local files) are restored to their original state! All local data vanishes • We say that a group manages “hard state” if the group members can fail and yet their state lives on • In the cloud a hard-state node costs more $$$

Two cases thus arise • Durability for soft-state scenarios • Here the entire state “lives in the group members” • They might have files, but the files won’t be preserved if those members crash and later restart, even on the same nodes. • Very common in today’s cloud • Durability for hard-state cases • Here the state really is outside the group

Multicast durability • Isis2 offers all-or-nothing delivery guarantees • Either every group member receives your multicast, or no group member receives it, even if the sender fails. As we saw, if a sender fails, its messages will be delivered before Isis2 reports the failure • But this statement didn’t explain what happens when a receiver crashes “instantly”

Two options: Optimistic/Pessimistic • Optimistic case (Send, CausalSend, OrderedSend): • Messages are delivered instantly on arrival (low delay) • But if the sender and all receivers with copies fail, an optimistic message is lost forever even though it might have been delivered to some processes right before they crashed • An optimistic protocol always looks like it was all-or-nothing, but if you could see the details, you might see that in fact, it was delivered, but then “forgotten”

Optimistic delivery A C B • Consider messages B and C • B was delivered to r,s and t. But it didn’t reach p and q because of a network failure. • C was delivered by p and q but never reached r,s,t • But notice that p and q both crashed • In a soft-state case, no evidence survived (unless they talked to someone outside the group – an external client, for example) • In effect, the surviving portion of the system is consistent

Optimistic delivery is fastest • We deliver messages as soon as they arrive • But the price of this speed (which is a big benefit) is that these two “bad cases” can arise. • Nobody can tell when these things happen, unless p or q talked to an external client • … which leads to the idea of g.Flush(k)

How does Flush(k) work? • g.Flush(n) pauses until n group members definitely have all the prior optimistic multicasts. • g.Flush() waits for all members, but this is slow • Normally n=2 or n=3 is fine… • By calling g.Flush(2) or g.Flush(3) before talking to an external client, we can be sure these bad cases will not occur!

With g.Flush(k)… • … those stray delivery events can still occur, but we know that no external observer notices them! • If g.Flush(3) is called prior to talking to the observer, then until there are 3 or more copies of the message, the Flush waits. • In our example the crash would have occurred while we were waiting for g.Flush() to finish • If a tree falls in a forest… If a message is delivered but every processthat saw it crashes, the effect is the sameas if the message wasn’t delivered!

When to call g.Flush(k) • Use this primitive • When working with optimistic multicast protocols like Send, OrderedSend • Call it prior to interacting with something outside of the group, like an external client who issued a request • With g.Flush after g.OrderedSend, we get the guarantee that the group won’t forget the update. Without g.Flush, an unlikely failure sequence could cause a problem (sender+first recipients all die).

Pessimistic Delivery • SafeSend is much more pessimistic • This protocol is a kind of 2-phase commit • Gives the message to recipients, and they hold it • (Two cases: In-memory logging, or on-disk logging) • When all have confirmed receipt, then delivery is authorized • No g.Flush(): it wouldn’t ever need to wait

Where’s the durable state? • SafeSend raises a question of where the state lives • For our optimistic protocols, state lives in the group • But Isis2 can also support two more cases • State lives in a checkpoint that will be reloaded if the whole group shuts down and restarts • State lives in a database or in files external to the group • SafeSend with disk logging aims at this second case

Should I always use SafeSend? • The SafeSend protocol is very costly and scales poorly, so it isn’t a great choice in the cloud • Also, using it correctly is a bit tricky • Better rule of thumb: use g.OrderedSend+g.Flush

Sidebar: Paxos family of protocols • Experts in this area will know about Leslie Lamport’s famous Paxos protocol (Wikipedia has a nice writeup) • It provides ordered, durable “actions” • These are often updates to a replicated database • SafeSend is the Isis2 name for Paxos • You don’t really need to learn about Paxos to understand how SafeSend works, but I’ll include some comments aimed at people who do know about Paxos in this lecture, simply because that work is so famous.

How Paxos works • Paxos is basically a kind of 2-phase commit • In the first phase a leader proposes some action (for us, a multicast) • A quorum of group members (the acceptors) need to vote in favor of the proposed ordering for the message, and they need to first save it in a durable place (usually a log that lives on the disk) • In the second phase, delivery occurs (in Paxos: the learners are informed about the new event)

Paxos has a notion similar to Flush(k) • In Paxos you can specify the number of “acceptors” that must have a copy of a message before it can be delivered. • In Isis2 this same parameter is available by means of a parameter you can set (g.SetSafeSendThreshold(k)) • SafeSend is a true implementation of Paxos if this number is more than half the group members. • With k smaller, like k=2 or k=3, but in a big group SafeSend starts to act exactly like g.OrderedSend()+g.Flush(k)

Isis2:Sendv.s. SafeSend Send scales best, but SafeSend with modern disks (RAM-like performance) and small numbers of acceptors isn’t terrible.

Jitter: how “steady” are latencies? The “spread” of latencies is muchbetter (tighter) with Send: the 2-phaseSafeSend protocol is sensitive to scheduling delays Variance from mean, 32-member case

Flush delay as function of shard size Flush is fairly fast if we only wait foracks from 3-5 members, but is slowif we wait for acks from all members.After we saw this graph, we changedIsis2to let users set the threshold.

Several ways to make data durable Putting our insights to work…

Checkpointing • Any group can be made durable using a checkpointing file • Call g.Persistent(filename) • Checkpoint will periodically be saved, or you can force the creation of checkpoints at times convenient to you • Entire group shares a single checkpoint file and it would normally live in the global file system. It should not live in any sort of soft-state file system! • On restart from a total shutdown, checkpoint is reloaded and the group recovers to its old state

External databases • If a group is being used to replicate something like a set of external mySQL databases, recovering the group state just isn’t good enough • We also need to make sure the mySQL replicas are in the identical states after a recovery • This is the case where we use SafeSend with the disklogging option enabled

What is the disklogger? • The disklogger is a special form of logged checkpoint, similar to the one used for g.Persistent() • But whereas normally there is just one durability log, this log is replicated with one copy per acceptor • Messages delivered by SafeSend are appended to this log during phase one • When an acceptor restarts, its log is scanned and “replayed”. Isis2 will garbage collect a message once all the learners have seen it

Ordering and dURABILITY IN Isis 2