Understanding Atomic Multicast and Virtual Synchrony in Distributed Systems

Fault Tolerance II CSE5306 Lecture Quiz due 7 April 2014

Atomic Multicast • We need to guarantee that… • In the presence of process failures, • a message delivers to all processes or none at all, • messages are delivered to all in the same order; • i.e., “atomic multicast.” • When a replica crashes (a above), it loses its group membership. (The group it abandoned is complete, so condition b above is satisfied.) • When it recovers, it must rejoin the group. (All of the messages that it missed must be received in proper order, to satisfy condition c above.)

R U O K ? • What is an atomic multicast? • A multicast that can perform in the presence of process failures. • It delivers each message to all or no processes. • It delivers all messages in the same order. • All of the above. • None of the above.

Virtual Synchrony • Receiving a message and message delivery are different (see above left). • If a group loses or gains a member (i.e., “view change,” vc) at the same time it receives a message, then that message must not be delivered to anyone. (Atomic multicast prohibits delivery to a nonmember and failing to deliver to a member.) • Purposefully deciding not to deliver to anyone (i.e., all members ignoring whatever fragment of the delivery all members already have seen), on the occasion of a VC, does not make multicasting unreliable. • In fact, ignoring fragments makes multicasting “virtually synchro-nous” (above right). That is, it is equivalent to the message never having been sent. VCs must be delayed till a multicast is complete.

R U O K ? 2. What is virtual synchrony? • Clearly separating message reception in the operating system from message delivery in the application layer. • Purposely delaying message deliveries till the current view change (i.e., a group’s losing or gaining a member) is completed. • Purposefully delaying view changes till pending message deliveries are completed. • All of the above. • None of the above.

Message Ordering • Messages can be reliably and virtually synchronously multicast in four different orders: • Reliable unordered multicasts—messages deliver in any order (see upper left). • R. FIFO-ordered m.—each sender’s messages deliver in the order they were sent (upper right). • R. causally-ordered m.—timestamp-ordered deliveries from all senders. • Totally-ordered m.—all messages delivered in the same order to all group members.

R U O K ? 3. In what orders can messages be reliably and virtually synchronously multicast? • Unordered multicast—messages deliver in any order. • FIFO-ordered multicast—each sender’s messages deliver in the order they were sent. • Causally-ordered multicast—timestamp-ordered deliveries from all senders. • Totally-ordered multicast—all messages delivered in the same order to all group members. • All of the above.

Implementing Virtual Synchrony • Reliable TCP point-to-point messaging to each group member, but not all (e.g., sender could fail halfway through). • Every member’s communication layer holds each message till it is “stable”; i.e. received by every member. Then all deliver together.

R U O K ? 4. How can reliable virtual synchrony be assured, in the event of a sender failing halfway through a group’s message delivery? • Reliable TCP point-to-point messaging delivers to each group member, but not all. • Every member’s communication layer holds each message till it is “stable”(i.e., received by every member), then all deliver together. • Both of the above. • None of the above.

Implementing Virtual Synchrony (continued) What if a processor fails in the middle of a multicast or in the middle of a view change? • Process 4 notices that process 7 has crashed and sends a view change. • Process 6 sends out all its unstable messages, followed by a flush message. • Process 6 installs the new view when it has received a flush message from everyone else.

R U O K ? 5. What if a processor fails in the middle of a multicast or in the middle of a view change? • A functional process, which notices that another process has crashed, sends a view change to all. • A third process sends its unstable (i.e., partially sent) messages to all members, followed by a flush message. • That process installs the new view, after it receives a flush message from everyone else. • All of the above. • None of the above.

Distributed Commit • “Distributed commit” is a distributed transaction in which all members complete a transaction, or none at all. • In a one-phase commit, a coordinator tells all participants to simultaneously perform the transaction. • But what if one participant crashes and cannot tell the coordinator that it was unable to perform…?

R U O K ? 6. Which of the following accurately describes a one-phase commit? • A distributed transaction in which all members complete a transaction or none at all. • One participant crashes without telling the coordinator it was unable to perform. • A coordinator tells all participants to simultaneously perform a transaction. • All of the above. • None of the above.

Two-Phase Commit • “Two-phase commit” is a distributed transaction with a 2-way handshake: • Coordinator sends VOTE_REQUEST to all participants (above left). • Each participant replies with a VOTE_COMMIT or VOTE_ABORT (above center). • If vote was unanimous, coordinator sends GLOBAL_COMMIT, else she sends GLOBAL_ABORT. • Every participant either commits the transaction or aborts it as directed. • In general, all participants block waiting for messages, until time runs out, which aborts the transaction (above right). • But what if the coordinator crashes, after sending global commit to half of the members…?

R U O K ? 7. Which of the following accurately describe a two-phase commit? • Coordinator sends VOTE_REQUEST to all participants. • Each participant replies with a VOTE_COMMIT or VOTE_ABORT. • If vote was unanimous, coordinator sends GLOBAL_COMMIT, else she sends GLOBAL_ABORT. • Every participant either commits the transaction or aborts as directed. • All of the above. • None of the above.

Three-Phase Commit • “Three-phase commit” avoids blocking in fail-stop crashes: • Coordinator: VOTER_REQUEST. Participants: ACK. • Coordinator: PREPARE_COMMIT. Participants: ACK. • Coordinator: GLOBAL_COMMIT. • The states of the coordinator and each participant satisfy the following two conditions: • There is no single state with a transition directly to either a COMMIT or an ABORT state. • There is no state in which it is not possible to make a final decision, and from which a transition to a COMMIT state can be made.

R U O K ? 8. What is the major difference between the 2- and 3-phase commits? • The 2-phase protocol is vulnerable to coordinator failures. • A crashed 2-phase participant can recover to a COMMIT state, while all others remain in their READY states. • Both of the above. • None of the above.

Recovery • After it crashes, a process must recover…. • What does “recovery” mean? • And how is recovery achieved?

Introduction to Recovery • An “error” is that part of a system that can lead to a failure, which must be prevented. • An error is corrected by… • backward recovery: • simply return to the previously correct state (checkpoint) to replay previously logged messages. • e.g., resending a lost packet. • A few checkpoints • forward recovery: • move from an anticipated error to a correct new state. • e.g., error correcting code infers correct packet from existing ones.

R U O K ? 9. How can an error be corrected, before it causes a system failure? • By backward recovery; i.e., simply returning to a previously correct state (checkpoint) to replay previously logged messages. • By forward recovery; i.e., moving from an anticipated error to a correct new state, like an error-correcting code that infers a correct packet from existing data. • Either of the above. • All of the above.

Stable Storage • Information needed to recover from an error must be stored safely on a RAID-like disk drive (above left). • If a process crashes after updating sector ‘a’ but not its copy in the second platter, the recovery process will discover the difference and finish updating the second platter (above center). • If either platter’s sector spontaneously decays, it can be replaced with data from the other platter’s sector (above right).

R U O K ? 10. How can stable storage be achieved, to facilitate backward recovery? • Store messages on an error-correcting RAID disk drive. • After a crash, compare the first and second copies, and correct any omissions in the second. • Replace any data that spontaneously decays with data from a second copy. • Any of the above. • All of the above.

Checkpointing • Fault-tolerant distributed systems regularly save consistent global states (“distributed snapshots”). • In a subsequent backward recovery, the affected process and its conversation partner return to their most recent concurrent correct states (see “recovery line” above; i.e., two checkpoint bars not separated by message arrows).

R U O K ? 11. What is checkpointing? • Fault-tolerant distributed systems regularly saving consistent global states (i.e., “distributed snapshots”). • In subsequent backward recoveries, the affected process and its conversation partners returning to their most recent concurrent correct states (i.e., “recovery line,” mutual state storage operations not separated by message deliveries). • Both of the above. • None of the above.

Independent Checkpointing • When many of a recovering process and its conversation partner’s checkpoints are separated by messages, they may roll back for a long time (i.e., domino effect). • In the example above, P2 logged the receipt of message m, but P1 has no record of having sent it and cannot resend it.

R U O K ? 12. What is the domino effect? • Rolling back to a point in the distant past, when a mutual checkpoint was not marred by message traffic. • Finding the receipt of a message, but finding no record of who might have sent it. • Being unable to resend a checkpointed message. • All of the above. • None of the above.

Coordinated Checkpointing • All processes regularly synchronize to write their global states to local stable storage. • In their 2-phase blocking protocol, a coordinator multicasts a CHECKPOINT_REQUEST message. • All processes… • ACK the coordinator’s message. • Take a checkpoint and • Delay sending messages until… • Coordinator’s multicast CHECKPOINT_DONE message is received. • Incremental snapshot: • Coordinator multicasts a CHECKPOINT_REQUEST only to those to whom it sent a message to since its last checkpoint. • Processes receiving the CHECKPOINT_REQUEST forward it to those to whom it sent a message to since its last checkpoint, etc. • All send CHECKPOINT_DONE similarly to resume operations.

R U O K ? 13. Which of the following accurately describe coordinated checkpointing’sincremental snapshot? • Coordinator multicasts a CHECKPOINT_REQUEST only to those whom it sent messages to, since its last checkpoint. • Processes receiving the CHECKPOINT_REQUEST forward it to those whom they sent messages to, since their last checkpoints, etc. • All affected processes send CHECKPOINT_DONE before resuming operations. • All of the above. • None of the above.

Message Logging • Message logging enables error recovery with a simple replay of all messages sent after the recovery line (i.e., last global checkpoint). • For message replay to work, processes must be piecewise deterministic (i.e., no random responses to received messages). • An “orphan process,” P, has a state inconsistent with the state of a recovered process, Q, because Q failed to log a received message (see above).

R U O K ? 14. Which of the following accurately describe message logging? • Message logging enables error recovery with a simple replay of all messages sent after the recovery line (i.e., last global checkpoint). • For message replay to work, processes must be piecewise deterministic (i.e., no random responses to received messages). • An “orphan process” has a state inconsistent with the state of a recovered process, because it failed to log a received message. • All of the above. • None of the above.

Characterizing Message Logging Schemes • All who received message m are classified as causally dependent, DEP(m), upon m. Those who receive messages from a DEP(m) also inherit the DPE(m) classification, and they can replay m if necessary. • Those that have copies of m, but have not yet logged them in stable storage, are classified as COPY(m). If they crash, they are unable to replay m. • An orphan process is dependent upon m, but cannot replay it, because all of its copies have crashed. To prevent orphans, every process that depends upon the delivery of m also must log m: • Pessimistic logging protocol (simple): for every unstable message, m, there must be at least one process in the DEP(m) class. • Optimistic logging protocol (complicated): if every COPY(m) crashes, roll every DEP(m) orphan back to before it became DEP(m).

R U O K ? 15. How can orphan processes be prevented most easily? • Classify as “causally dependent” upon m, all of those who first received message m, as well as those who receive messages from a causally dependent group member. • For every unstable message, m, ensure that there always is at least one process in the DEP(m) class. • Classify all of those having copies of m, which are not yet logged in stable storage, as COPY(m). • All of the above. • None of the above.

Recovery-Oriented Computing • If the failure can be localized to a few processes, simply reboot them. • If the failure is pervasive, a whole server may need to be restarted, by rolling back to a recovery line and replaying messages. • Relaxing the computing environment (e.g., allocate larger buffers, zero memory before allocation, change message delivery order) can avoid errors without downtime and repairs.

R U O K ? 16. Which of the following accurately characterize recovery-oriented computing? • If a failure can be localized to a few processes, simply reboot them. • If the failure is pervasive, a whole server may need to be restarted, by rolling it back to a recovery line and replaying its messages. • Relaxing the computing environment (e.g., allocate larger buffers, zero memory before allocation, change message delivery order) can avoid errors without downtime and repairs. • All of the above. • None of the above.

Summary • Fault tolerance is masking failures and subsequent recoveries, by operating in the presence of failures. • Failures types are crash, omission, timing and arbitrary (Byzantine). • Fault tolerant cooperating process groups achieve fault tolerance via redundancy. • Communications within groups must be reliable with respect to ordering and automaticity. • Automaticity requires that messages never cross membership-change boundaries. • Reliable group multicasting can be scaled by reducing feedback. • The popular 2-phase commit protocol enables group membership changes. A 3-phase protocol could solve the coordinator crash problem, but it seldom arises. • Combining performance-costly checkpointing with message logging enables crashed processes to replay messages simply and cheaply.

R U O K ? 17. Which of the following are types of failures? • Crash. • Omission. • Timing. • Arbitrary (Byzantine). • All of the above. • None of the above.

Understanding Atomic Multicast and Virtual Synchrony in Distributed Systems

Understanding Atomic Multicast and Virtual Synchrony in Distributed Systems

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance II

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Sea Ice

Sea Ice