Reliable Distributed Systems

Reliable Distributed Systems Logical Clocks

Time and Ordering • We tend to casually use temporal concepts • Example: “membership changes dynamically” • Implies a notion of time: first membership was X, later membership was Y • Challenge: relating local notion of time in a single process to a global notion of time • Will discuss this issue before developing multicast delivery ordering options in more detail

Time in Distributed Systems • Three notions of time: • Time seen by external observer. A global clock of perfect accuracy • Time seen on clocks of individual processes. Each has its own clock, and clocks may drift out of sync. • Logical notion of time: event a occurs before event b and this is detectable because information about a may have reached b.

External Time • The “gold standard” against which many protocols are defined • Not implementable: no system can avoid uncertain details that limit temporal precision! • Use of external time is also risky: many protocols that seek to provide properties defined by external observers are extremely costly and, sometimes, are unable to cope with failures

Time seen on internal clocks • Most workstations have reasonable clocks • Clock synchronization is the big problem (will visit topic later in course): clocks can drift apart and resynchronization, in software, is inaccurate • Unpredictable speeds a feature of all computing systems, hence can’t predict how long events will take (e.g. how long it will take to send a message and be sure it was delivered to the destination)

Logical notion of time • Has no clock in the sense of “real-time” • Focus is on definition of the “happens before” relationship: “a happens before b” if: • both occur at same place and a finished before b started, or • a is the send of message m, b is the delivery of m, or • a and b are linked by a chain of such events

Logical time as a time-space picture p0 p1 p2 p3 a a, b are concurrent c c happens after a, b b d d happens after a, b, c

Notation • Use “arrow” to represent happens-before relation • For previous slide: • a  c, b  c, c d • hence, a  d, b d • a, b are concurrent • Also called the “potential causality” relation

Logical clocks • Proposed by Lamport to represent causal order • Write: LT(e) to denote logical timestamp of an event e, LT(m) for a timestamp on a message, LT(p) for the timestamp associated with process p • Algorithm ensures that if a  b, then LT(a) < LT(b)

Algorithm • Each process maintains a counter, LT(p) • For each event other than message delivery: set LT(p) = LT(p)+1 • When sending message m, LT(m) = LT(p) • When delivering message m to process q, set LT(q) = max(LT(m), LT(q))+1

Illustration of logical timestamps p0 p1 p2 p3 LT(a)=1 LT(d)=2 LT(e)=3 LT(f)=4 LT(b)=1 g LT(c)=1

Concurrent events • If a, b are concurrent, LT(a) and LT(b) may have arbitrary values! • Thus, logical time lets us determine that a potentially happened before b, but not that a definitely did so! • Example: processes p and q never communicate. Both will have events 1, 2, ... but even if LT(e)<LT(e’) e may not have happened before e’

What about “real-time” clocks? • Accuracy of clock synchronization is ultimately limited by uncertainty in communication latencies • These latencies are “large” compared with speed of modern processors (typical latency may be 35us to 500us, time for thousands of instructions) • Limits use of real-time clocks to “coarse-grained” applications

Interpretations of temporal terms • Understand now that “a happens before b” means that information can flow from a to b • Understand that “a is concurrent with b” means that there is no information flow between a and b • What about the notion of an “instant in time”, over a set of processes?

Chandy and Lamport: Consistent cuts • Draw a line across a set of processes • Line cuts each execution • Consistent cut has property that the set of included events is closed under happens-before relation: • If the cut “includes” event b, and event a happens before b, then the cut also includes event a • In practice, this means that every “delivered” message was sent within the cut

Illustration of consistent cuts red cut is inconsistent p0 p1 p2 p3 a d e f b g c green cuts are consistent

Intuition into consistent cuts • A consistent cut is a state that could have arisen during execution, depending on how processes were scheduled • An inconsistent cut could not have arisen during execution • One way to see this: think of process timelines as rubber bands. Scheduler stretches or compresses time but can’t deliver message before it was sent

Illustration of consistent cuts p0 p1 p2 p3 a stretch d shrink e f b g c green cuts are consistent

There may be many consistent cuts through any point in the execution p0 p1 p2 p3 a d e f b g c possible cuts define a range of potentially instantaneous system states

Illustration of consistent cuts to make the red cut “straight” message f has to travel backwards in time! p0 p1 p2 p3 a d e f b g c

Reliable Distributed Systems Quorums

Quorum replication • We developed a whole architecture based on our four-step recipe • But there is a second major approach that also yields a complete group communication framework and solutions • Based on “quorum” read and write operations • Omits notion of process group views

Today’s topic • Quorum methods from a mile high • Don’t have time to be equally detailed • We’ll explore • How the basic read/update protocol works • Failure considerations • State machine replication (a form of lock-step replication for deterministic objects) • Performance issues

A peek at the conclusion • These methods are • Widely known and closely tied to consensus • Perhaps, easier to implement • But they have serious drawbacks: • Need deterministic components • Are drastically slower (10s-100s of events/second) • Big win? • Recent systems combine quorums with Byzantine Agreement for ultra-sensitive databases

Static membership • Subsets of a known set of processes • E.g. a cluster of five machines, each running replica of a database server • Machines can crash or recover but don’t depart permanently and new ones don’t join “out of the blue” • In practice the dynamic membership systems can easily be made to work this way… but usually aren’t

Static membership example Qread = 2, Qwrite = 4 This write will fail: the client only manages to contact 2 replicas and must “abort” the operation (we use this terminology even though we aren’t doing transactions) p To do a write, must update at least 4 replicas To do a read, this client (or even a group member) must access at least 2 replicas q r s t client read write read Write fails

Quorums • Must satisfy two basic rules • A quorum read should “intersect” any prior quorum write at >= 1 processes • A quorum write should also intersect any other quorum write • So, in a group of size N: • Qr + Qw > N, and • Qw + Qw > N

Versions of replicated data • Replicated data items have “versions”, and these are numbered • I.e. can’t just say “Xp=3”. Instead say that Xp has timestamp [7,q] and value 3 • Timestamp must increase monotonically and includes a process id to break ties • This is NOT the pid of the update source… we’ll see where it comes from

Doing a read is easy • Send RPCs until Qr processes reply • Then use the value with the largest timestamp • Break ties by looking at the pid • For example • [6,x] < [9,a] (first look at the “time”) • [7,p] < [7,q] (but use pid as a tie-breaker) • Even if a process owns a replica, it can’t just trust it’s own data. Every “read access” must collect Qr values first…

Doing a write is trickier • First, we can’t support incremental updates (x=x+1), since no process can “trust” its own replica. • Such updates require a read followed by a write. • When we initiate the write, we don’t know if we’ll succeed in updating a quorum of processes • wE can’t update just some subset; that could confuse a reader • Hence need to use a commit protocol • Moreover, must implement a mechanism to determine the version number as part of the protocol. We’ll use a form of voting

The sequence of events • Propose the write: “I would like to set X=3” • Members “lock” the variable against reads, put the request into a queue of pending writes (must store this on disk or in some form of crash-tolerant memory), and send back: • “OK. I propose time [t,pid]” • Here, time is a logical clock. Pid is the member’s own pid • Initiator collects replies, hoping to receive Qw (or more)  Qw OKs < Qw OKs Compute maximum of proposed [t,pid] pairs. Commit at that time Abort

Which votes got counted? • It turns out that we also need to know which votes were “counted” • E.g. suppose there are five group members, A…E and they vote: • {[17,A] [19,B] [20,C] [200,D] [21,E]} • But somehow the vote from D didn’t get through and the maximum is picked as [21,E] • We’ll need to also remember that the votes used to make this decision were from {A,B,C,E}

What’s with the [t,pid] stuff? • Lamport’s suggestion: use logical clocks • Each process receives an update message • Places it in an ordered queue • And responds with a proposed time: [t,pid] using its own process id for the time • The update source takes the maximum • Commit message says “commit at [t,pid]” • Group members who’s votes were considered deliver committed updates in timestamp order • Group members who votes were not considered discard the update and don’t do it, at all.

Reliable Distributed Systems Models for Distributed Computing. The Fischer, Lynch and Paterson Result

Who needs failure “models”? • Role of a failure model • Lets us reduce fault-tolerance to a mathematical question • In model M, can problem P be solved? • How costly is it to do so? • What are the best solutions? • What tradeoffs arise? • And clarifies what we are saying • Lacking a model, confusion is common

Categories of failures • Crash faults, message loss • These are common in real systems • Crash failures: process simply stops, and does nothing wrong that would be externally visible before it stops • These faults can’t be directly detected

Categories of failures • Fail-stop failures • These require system support • Idea is that the process fails by crashing, and the system notifies anyone who was talking to it • With fail-stop failures we can overcome message loss by just resending packets, which must be uniquely numbered • Easy to work with… but rarely supported

Categories of failures • Non-malicious Byzantine failures • This is the best way to understand many kinds of corruption and buggy behaviors • Program can do pretty much anything, including sending corrupted messages • But it doesn’t do so with the intention of screwing up our protocols • Unfortunately, a pretty common mode of failure

Categories of failure • Malicious, true Byzantine, failures • Model is of an attacker who has studied the system and wants to break it • She can corrupt or replay messages, intercept them at will, compromise programs and substitute hacked versions • This is a worst-case scenario mindset • In practice, doesn’t actually happen • Very costly to defend against; typically used in very limited ways (e.g. key mgt. server)

Models of failure • Question here concerns how failures appear in formal models used when proving things about protocols • Think back to Lamport’s happens-before relationship,  • Model already has processes, messages, temporal ordering • Assumes messages are reliably delivered

Recall: Two kinds of models • We tend to work within two models • Asynchronous model makes no assumptions about time • Lamport’s model is a good fit • Processes have no clocks, will wait indefinitely for messages, could run arbitrarily fast/slow • Distributed computing at an “eons” timescale • Synchronous model assumes a lock-step execution in which processes share a clock

Adding failures in Lamport’s model • Also called the asynchronous model • Normally we just assume that a failed process “crashes:” it stops doing anything • Notice that in this model, a failed process is indistinguishable from a delayed process • In fact, the decision that something has failed takes on an arbitrary flavor • Suppose that at point e in its execution, process p decides to treat q as faulty….”

What about the synchronous model? • Here, we also have processes and messages • But communication is usually assumed to be reliable: any message sent at time tis delivered by time t+ • Algorithms are often structured into rounds, each lasting some fixed amount of time , giving time for each process to communicate with every other process • In this model, a crash failure is easily detected • When people have considered malicious failures, they often used this model

Neither model is realistic • Value of the asynchronous model is that it is so stripped down and simple • If we can do something “well” in this model we can do at least as well in the real world • So we’ll want “best” solutions • Value of the synchronous model is that it adds a lot of “unrealistic” mechanism • If we can’t solve a problem with all this help, we probably can’t solve it in a more realistic setting! • So seek impossibility results

Tougher failure models • We’ve focused on crash failures • In the synchronous model these look like a “farewell cruel world” message • Some call it the “failstop model”. A faulty process is viewed as first saying goodbye, then crashing • What about tougher kinds of failures? • Corrupted messages • Processes that don’t follow the algorithm • Malicious processes out to cause havoc?

Here the situation is much harder • Generally we need at least 3f+1 processes in a system to tolerate f Byzantine failures • For example, to tolerate 1 failure we need 4 or more processes • We also need f+1 “rounds” • Let’s see why this happens

Byzantine scenario • Generals (N of them) surround a city • They communicate by courier • Each has an opinion: “attack” or “wait” • In fact, an attack would succeed: the city will fall. • Waiting will succeed too: the city will surrender. • But if some attack and some wait, disaster ensues • Some Generals (f of them) are traitors… it doesn’t matter if they attack or wait, but we must prevent them from disrupting the battle • Traitor can’t forge messages from other Generals

Byzantine scenario Attack! No, wait! Surrender! Wait… Attack! Attack! Wait…

A timeline perspective • Suppose that p and q favor attack, r is a traitor and s and t favor waiting… assume that in a tie vote, we attack p q r s t

A timeline perspective • After first round collected votes are: • {attack, attack, wait, wait, traitor’s-vote} p q r s t

Reliable Distributed Systems