Distributed Systems: Atomicity, Decision Making, Faults, Snapshots

Distributed Systems: Atomicity, Decision Making, Faults, Snapshots Slides adapted from Ken's CS514 lectures

Announcements • Prelim II coming up next week: • In class, Thursday, November 20th, 10:10—11:25pm • 203 Thurston • Closed book, no calculators/PDAs/… • Bring ID • Topics: • Everything after first prelim • Lectures 14-22, chapters 10-15 (8th ed) • Review Session Tonight, November 18th, 6:30pm–7:30pm • Location: 315 Upson Hall

Review: What time is it? • In distributed system we need practical ways to deal with time • E.g. we may need to agree that update A occurred before update B • Or offer a “lease” on a resource that expires at time 10:10.0150 • Or guarantee that a time critical event will reach all interested parties within 100ms

Review: Event Ordering • Problem: distributed systems do not share a clock • Many coordination problems would be simplified if they did (“first one wins”) • Distributed systems do have some sense of time • Events in a single process happen in order • Messages between processes must be sent before they can be received • How helpful is this?

Review: Happens-before • Define a Happens-before relation (denoted by ). • 1) If A and B are events in the same process, and A was executed before B, then A B. • 2) If A is the event of sending a message by one process and B is the event of receiving that message by another process, then A  B. • 3) If A  B and B  C then A  C.

Review: Total ordering? • Happens-before gives a partial ordering of events • We still do not have a total ordering of events • We are not able to order events that happen concurrently • Concurrent if (not AB) and (notBA)

Review: Partial Ordering Pi ->Pi+1; Qi -> Qi+1; Ri -> Ri+1 R0->Q4; Q3->R4; Q1->P4; P1->Q2

Review: Total Ordering? P0, P1, Q0, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1, R2, R3, R4 P0, Q0, Q1, P1, Q2, P2, P3, P4, Q3, R0, Q4, R1, R2, R3, R4 P0, Q0, P1, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1, R2, R3, R4

Review: Timestamps • Assume each process has a local logical clock that ticks once per event and that the processes are numbered • Clocks tick once per event (including message send) • When send a message, send your clock value • When receive a message, set your clock to MAX( your clock, timestamp of message + 1) • Thus sending comes before receiving • Only visibility into actions at other nodes happens during communication, communicate synchronizes the clocks • If the timestamps of two events A and B are the same, then use the process identity numbers to break ties. • This gives a total ordering!

Review: Distributed Mutual Exclusion • Want mutual exclusion in distributed setting • The system consists of n processes; each process Piresides at a different processor • Each process has a critical section that requires mutual exclusion • Problem: Cannot use atomic testAndSet primitive since memory not shared and processes may be on physically separated nodes • Requirement • If Pi is executing in its critical section, then no other process Pj is executing in its critical section • Compare three solutions • Centralized Distributed Mutual Exclusion (CDME) • Fully Distributed Mutual Exclusion (DDME) • Token passing

Today • Atomicity and Distributed Decision Making • Faults in distributed systems • What time is it now? • Synchronized clocks • What does the entire system look like at this moment?

Atomicity • Recall: • Atomicity = either all the operations associated with a program unit are executed to completion, or none are performed. • In a distributed system may have multiple copies of the data • (e.g. replicas are good for reliability/availability) • PROBLEM: How do we atomically update all of the copies? • That is, either all replicas reflect a change or none

E.g. Two-Phase Commit • Goal: Update all replicas atomically • Either everyone commits or everyone aborts • No inconsistencies even in face of failures • Caveat: Assume only crash or fail-stop failures • Crash: servers stop when they fail – do not continue and generate bad data • Fail-stop: in addition to crash, fail-stop failure is detectable. • Definitions • Coordinator: Software entity that shepherds the process (in our example could be one of the servers) • Ready to commit: side effects of update safely stored on non-volatile storage • Even if crash, once I say I am ready to commit then a recover procedure will find evidence and continue with commit protocol

Two Phase Commit: Phase 1 • Coordinator send a PREPARE message to each replica • Coordinator waits for all replicas to reply with a vote • Each participant replies with a vote • Votes PREPARED if ready to commit and locks data items being updated • Votes NO if unable to get a lock or unable to ensure ready to commit

Two Phase Commit: Phase 2 • If coordinator receives PREPARED vote from all replicas then it may decide to commit or abort • Coordinator send its decision to all participants • If participant receives COMMIT decision then commit changes resulting from update • If participant received ABORT decision then discard changes resulting from update • Participant replies DONE • When Coordinator received DONE from all participants then can delete record of outcome

Performance • In absence of failure, 2PC (two-phase-commit) makes a total of 2 (1.5?) round trips of messages before decision is made • Prepare • Vote NO or PREPARE • Commit/abort • Done (but done just for bookkeeping, does not affect response time)

Failure Handling in 2PC – Replica Failure • The log contains a <commit T> record. • In this case, the site executes redo(T). • The log contains an <abort T> record. • In this case, the site executes undo(T). • The log contains a <ready T> record • In this case consult coordinator Ci. • If Ci is down, site sends query-statusT message to the other sites. • The log contains no control records concerning T. • In this case, the site executes undo(T).

Failure Handling in 2PC – Coordinator CiFailure • If an active site contains a <commit T> record in its log, then T must be committed. • If an active site contains an <abort T> record in its log, then T must be aborted. • If some active site does not contain the record <ready T> in its log then the failed coordinator Cicannot have decided to commit T. Rather than wait for Cito recover, it is preferable to abort T. • All active sites have a <ready T> record in their logs, but no additional control records. In this case we must wait for the coordinator to recover. • Blocking problem – T is blocked pending the recovery of site Si.

Failure Handling • Failure detected with timeouts • If participant times out before getting a PREPARE can abort • If coordinator times out waiting for a vote can abort • If a participant times out waiting for a decision it is blocked! • Wait for Coordinator to recover? • Punt to some other resolution protocol • If a coordinator times out waiting for done, keep record of outcome • other sites may have a replica.

Failures in distributed systems • We may want to avoid relying on a single server/coordinator/boss to make progress • Thus want the decision making to be distributed among the participants (“all nodes created equal”) => the “consensus problem” in distributed systems. • However depending on what we can assume about the network, it may be impossible to reach a decision in some cases!

Impossibility of Consensus • Network characteristics: • Synchronous - some upper bound on network/processing delay. • Asynchronous - no upper bound on network/processing delay. • Fischer Lynch and Paterson showed: • With even just one failure possible, you cannot guarantee consensus. • Cannot guarantee consensus process will terminate • Assumes asynchronous network • Essence of proof: Just before a decision is reached, we can delay a node slightly too long to reach a decision. • But we still want to do it.. Right?

Distributed Decision Making Discussion • Why is distributed decision making desirable? • Fault Tolerance! Also, atomicity in distributed system. • A group of machines can come to a decision even if one or more of them fail during the process • After decision made, result recorded in multiple places • Undesirable if algorithm is blocking (e.g. two-phase commit) • One machine can be stalled until another site recovers: • A blocked site holds resources (locks on updated items, pages pinned in memory, etc) until learns fate of update • To reduce blocking • add more rounds (e.g. three-phase commit) • Add more replicas than needed (e.g. quorums) • What happens if one or more of the nodes is malicious? • Malicious: attempting to compromise the decision making • Known as Byzantine fault tolerance. More on this next time

Faults

Categories of failures • Crash faults, message loss • These are common in real systems • Crash failures: process simply stops, and does nothing wrong that would be externally visible before it stops • These faults can’t be directly detected

Categories of failures • Fail-stop failures • These require system support • Idea is that the process fails by crashing, and the system notifies anyone who was talking to it • With fail-stop failures we can overcome message loss by just resending packets, which must be uniquely numbered • Easy to work with… but rarely supported

Categories of failures • Non-malicious Byzantine failures • This is the best way to understand many kinds of corruption and buggy behaviors • Program can do pretty much anything, including sending corrupted messages • But it doesn’t do so with the intention of screwing up our protocols • Unfortunately, a pretty common mode of failure

Categories of failure • Malicious, true Byzantine, failures • Model is of an attacker who has studied the system and wants to break it • She can corrupt or replay messages, intercept them at will, compromise programs and substitute hacked versions • This is a worst-case scenario mindset • In practice, doesn’t actually happen • Very costly to defend against; typically used in very limited ways (e.g. key mgt. server)

Models of failure • Question here concerns how failures appear in formal models used when proving things about protocols • Think back to Lamport’s happens-before relationship,  • Model already has processes, messages, temporal ordering • Assumes messages are reliably delivered

Two kinds of models • We tend to work within two models • Asynchronous model makes no assumptions about time • Lamport’s model is a good fit • Processes have no clocks, will wait indefinitely for messages, could run arbitrarily fast/slow • Distributed computing at an “eons” timescale • Synchronous model assumes a lock-step execution in which processes share a clock

Adding failures in Lamport’s model • Also called the asynchronous model • Normally we just assume that a failed process “crashes:” it stops doing anything • Notice that in this model, a failed process is indistinguishable from a delayed process • In fact, the decision that something has failed takes on an arbitrary flavor • Suppose that at point e in its execution, process p decides to treat q as faulty….”

What about the synchronous model? • Here, we also have processes and messages • But communication is usually assumed to be reliable: any message sent at time t is delivered by time t+ • Algorithms are often structured into rounds, each lasting some fixed amount of time , giving time for each process to communicate with every other process • In this model, a crash failure is easily detected • When people have considered malicious failures, they often used this model

Neither model is realistic • Value of the asynchronous model is that it is so stripped down and simple • If we can do something “well” in this model we can do at least as well in the real world • So we’ll want “best” solutions • Value of the synchronous model is that it adds a lot of “unrealistic” mechanism • If we can’t solve a problem with all this help, we probably can’t solve it in a more realistic setting! • So seek impossibility results

Fischer, Lynch and Patterson • Impossibility of Consensus • A surprising result • Impossibility of Asynchronous Distributed Consensus with a Single Faulty Process • They prove that no asynchronous algorithm for agreeing on a one-bit value can guarantee that it will terminate in the presence of crash faults • And this is true even if no crash actually occurs! • Proof constructs infinite non-terminating runs • Essence of proof: Just before a decision is reached, we can delay a node slightly too long to reach a decision.

Tougher failure models • We’ve focused on crash failures • In the synchronous model these look like a “farewell cruel world” message • Some call it the “failstop model”. A faulty process is viewed as first saying goodbye, then crashing • What about tougher kinds of failures? • Corrupted messages • Processes that don’t follow the algorithm • Malicious processes out to cause havoc?

Here the situation is much harder • Generally we need at least 3f+1 processes in a system to tolerate f Byzantine failures • For example, to tolerate 1 failure we need 4 or more processes • We also need f+1 “rounds” • Let’s see why this happens

Byzantine Generals scenario • Generals (N of them) surround a city • They communicate by courier • Each has an opinion: “attack” or “wait” • In fact, an attack would succeed: the city will fall. • Waiting will succeed too: the city will surrender. • But if some attack and some wait, disaster ensues • Some Generals (f of them) are traitors… it doesn’t matter if they attack or wait, but we must prevent them from disrupting the battle • Traitor can’t forge messages from other Generals

Byzantine Generals scenario Attack! No, wait! Surrender! Wait… Attack! Attack! Wait…

A timeline perspective • Suppose that p and q favor attack, r is a traitor and s and t favor waiting… assume that in a tie vote, we attack p q r s t

A timeline perspective • After first round collected votes are: • {attack, attack, wait, wait, traitor’s-vote} p q r s t

What can the traitor do? • Add a legitimate vote of “attack” • Anyone with 3 votes to attack knows the outcome • Add a legitimate vote of “wait” • Vote now favors “wait” • Or send different votes to different folks • Or don’t send a vote, at all, to some

Outcomes? • Traitor simply votes: • Either all see {a,a,a,w,w} • Or all see {a,a,w,w,w} • Traitor double-votes • Some see {a,a,a,w,w} and some {a,a,w,w,w} • Traitor withholds some vote(s) • Some see {a,a,w,w}, perhaps others see {a,a,a,w,w,} and still others see {a,a,w,w,w} • Notice that traitor can’t manipulate votes of loyal Generals!

What can we do? • Clearly we can’t decide yet; some loyal Generals might have contradictory data • Anyone with 4 votes can “decide” • But with 3 votes to “wait” or “attack,” a General isn’t sure (one could be a traitor…) • So: in round 2, each sends out “witness” messages: here’s what I saw in round 1 • General Smith send me: “attack(signed) Smith”

Digital signatures • These require a cryptographic system • For example, RSA • Each player has a secret (private) key K-1 and a public key K. • She can publish her public key • RSA gives us a single “encrypt” function: • Encrypt(Encrypt(M,K),K-1) = Encrypt(Encrypt(M,K-1),K) = M • Encrypt a hash of the message to “sign” it

With such a system • A can send a message to B that only A could have sent • A just encrypts the body with her private key • … or one that only B can read • A encrypts it with B’s public key • Or can sign it as proof she sent it • B can recompute the signature and decrypt A’s hashed signature to see if they match • These capabilities limit what our traitor can do: he can’t forge or modify a message

A timeline perspective • In second round if the traitor didn’t behave identically for all Generals, we can weed out his faulty votes p q r s t

A timeline perspective Attack!! • We attack! p Attack!! q Damn! They’re on to me r Attack!! s Attack!! t

Traitor is stymied • Our loyal generals can deduce that the decision was to attack • Traitor can’t disrupt this… • Either forced to vote legitimately, or is caught • But costs were steep! • (f+1)*n2 ,messages! • Rounds can also be slow…. • “Early stopping” protocols: min(t+2, f+1) rounds; t is true number of faults

Distributed Snapshots

Introducing “wall clock time” • Back to the notion of time… • Distributed systems sometimes needs more precise notion of time other than happens-before • There are several options • Instead of network/process identitity to break ties… • “Extend” a logical clock with the clock time and use it to break ties • Makes meaningful statements like “B and D were concurrent, although B occurred first” • But unless clocks are closely synchronized such statements could be erroneous! • We use a clock synchronization algorithm to reconcile differences between clocks on various computers in the network

Synchronizing clocks • Without help, clocks will often differ by many milliseconds • Problem is that when a machine downloads time from a network clock it can’t be sure what the delay was • This is because the “uplink” and “downlink” delays are often very different in a network • Outright failures of clocks are rare…

Distributed Systems: Atomicity, Decision Making, Faults, Snapshots