CPSC 668 Distributed Algorithms and Systems

CPSC 668Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch Set 17: Fault-Tolerant Register Simulations

Fault-Tolerant Shared Memory Simulations • What if some processors might crash? • Can we still provide a shared read/write variable on top of message passing? • Yes, even in an asynchronous system, if we have enough nonfaulty processors. • First, we must specify a failure-prone shared memory. Set 17: Fault-Tolerant Register Simulations

Specification of f-Resilient Shared Memory • Inputs are invocations on the shared object. • Outputs are responses of the shared object. • A sequence of inputs and outputs is allowable iff: • there is a partitioning of proc. indices into "faulty" and "nonfaulty" • Correct Interaction: each proc. alternates invocations and matching responses • Nonfaulty Liveness:Every invocation by a nonfaulty proc. has a matching response • Extended Linearizability:Linearizability holds for all the completed operations and some subset of the pending operations some ops might never complete Set 17: Fault-Tolerant Register Simulations

Assumptions for Algorithm • Each read/write variable ("register") to be simulated has • one reader and • one writer • (next topic will be to build more powerful variables out of these) • There are n procs. which are cooperating to simulate a collection of such variables • Underlying communication system is asynchronous message passing • n > 2f (less than half the processors can crash) Set 17: Fault-Tolerant Register Simulations

Main Ideas of Algorithm • Each simulated register has a replica stored at each of the n procs., not just at the designated reader and writer of that register. • Use the redundant storage to provide fault-tolerance. • Describe algorithm just for one simulated register; use a separate copy of the same algorithm in parallel for each simulated register. Set 17: Fault-Tolerant Register Simulations

Writing the Simulated Register • generate the next sequence number • send a message with the value and the sequence number to all the procs. • each recipient updates its local copy of the register • wait to get back an ack from > n/2 procs. • safe since n - f > n/2 • do the ack for the write Set 17: Fault-Tolerant Register Simulations

Reading the Simulated Register • send a request to all the procs. • each recipient sends back current value of its replica • wait to get reply from > n/2 procs. • return value associated with largest sequence number Set 17: Fault-Tolerant Register Simulations

Key Idea for Correctness • Each read should return the value of "the most recent" write. • Each read or write communicates with > n/2 procs., so the set of procs. participating in operation O1 is guaranteed to intersect with the set of procs. participating in any other operation O2. Set 17: Fault-Tolerant Register Simulations

But What About Asynchrony? • The underlying communication system is asynchronous: • message on behalf of one operation could be overtaken by a message on behalf of a later operation. • Avoid such problems by adding additional mechanism to the algorithm: • reader and writer keep track of "status" of each link • don't send a msg on a link until ack from previous msg has been received Set 17: Fault-Tolerant Register Simulations

Outline of Correctness Proof Interesting part is proving linearizability. • Let ts(W) = sequence number of W • Let ts(R) = sequence number of write that R reads from • Let O1 O2 denote O1finishes before O2starts Key lemmas: • If W1 W2, then ts(W1) < ts(W2) • If W R, then ts(W) ≤ ts(R) • If RW, then ts(R ) < ts(W) • If R1 R2, then ts(R1) ≤ ts(R2) Set 17: Fault-Tolerant Register Simulations

Matching Lower Bound on Resiliency Theorem (10.22): No simulation of a 1-reader, 1-writer read/write register using n procs and asynchronous message passing can tolerate f ≥ n/2 crash failures. Proof: Suppose in contradiction there is an algorithm A that tolerates f = n/2 crashes and simulates a 1-reader, 1-writer linearizable register on top of asynchronous message passing. Set 17: Fault-Tolerant Register Simulations

Lower Bound Proof • Partition procs into two sets, S0 and S1, each of size f. • Let 0 be admissible exec. of A s.t. • initial value of simulated register is 0 • all procs. in S1crash initially • proc. p0in S0invokes write(1) at time 0 and no other operations are invoked. • the write completes at some time t0 (must happen since A is supposed to tolerate f failures). Set 17: Fault-Tolerant Register Simulations

Lower Bound Proof • Let 1 be admissible exec. of A s.t. • initial value of simulated register is 0 • all procs. in S0crash initially • proc. p1in S1invokes a read at time t0+1 and no other operations are invoked. • the read completes at some time t1 (must happen since A is supposed to tolerate f failures) • the read returns 0 (must be since A guarantees linearizability). Set 17: Fault-Tolerant Register Simulations

Lower Bound Proof • Now create admissible execution  by "merging" the views of procs in S0 from 0 and the views of procs in S1 from 1: • messages that go between S0 and S1 are delayed so that they don't arrive until after time t1. •  is not linearizable, since read(0) follows write(1). Contradiction. Set 17: Fault-Tolerant Register Simulations

S0 S1 p0 X X 0: X X X X X X 1: X X p1 p0 delay until after t1 : p1 Set 17: Fault-Tolerant Register Simulations

t0 0 t0+1 t1 time p0 o: p1 X p0 X 1: p1 p0 : p1 Lower Bound Diagram for n = 2 write(1) read(0) write(1) read(0) Set 17: Fault-Tolerant Register Simulations

Simulating R/W Registers Using R/W Registers • The previous algorithm showed how to simulate a 1-reader, 1-writer register on top of message passing. • How can we get more powerful (flexible) registers, i.e., with • more readers • more writers • We'll start with a warm-up: • simulate multi-valued register using binary-valued registers • 1-reader and 1-writer Set 17: Fault-Tolerant Register Simulations

Wait-Free Register Simulations • Asynchronous model • Linearizable shared registers • Wait-free • tolerate any number of crash failures • We want to simulate one kind of (n-1)-resilient shared memory with another kind of (n-1)-resilient memory • recall earlier definition of f-resilient shared memory • recall earlier definition of one kind of communication system simulating another Set 17: Fault-Tolerant Register Simulations

Alternative Definition of Wait-Free Simulation • Alternative definition for the wait-free case: • The failure-free version of one communication system simulates the failure-free version of the other, and • for any prefix of an admissible execution of the simulation algorithm in which pi has a pending operation, there is an extension in which the operation completes and only pi takes steps. • Equivalent to previous definition, sometimes more convenient. Set 17: Fault-Tolerant Register Simulations

Proving Linearizability • We've seen one approach: • explicitly construct a permutation and prove that it has the desired properties • Alternative approach: • identify a time point for each operation, between invocation and response: linearization points • Linearization points give the permutation • Obviously real-time order is preserved • Just need to show that legality holds Set 17: Fault-Tolerant Register Simulations

multi-reader single-writer multi-valued single-reader single-writer multi-valued multi-reader multi-writer multi-valued single-reader single-writer binary-valued Overview of Register Simulations Set 17: Fault-Tolerant Register Simulations

Multi-Valued From Binary • Some ideas… • Use a different binary register to store each bit of the multi-valued register being simulated • Read algorithm is to read all the binary registers and return the resulting value • Write algorithm is to write the new bits in some order • Difficulties arise if the reader overlaps a slow write and sees some new bits and some old bits Set 17: Fault-Tolerant Register Simulations

A Unary Approach • Suppose the simulated register is to take on the values {0,…,K-1}. • Use an array of K binary registers, B[0..K-1] • represent value v by having B[v] = 1 and the other entries 0 • Read algorithm: read B[0], B[1],…, until finding the first 1; return the index • Write algorithm: zero out the old entry of B and set the new entry Set 17: Fault-Tolerant Register Simulations

Problems with Unary Approach • OK if reads and writes don't overlap. • If they do, have to worry about • reader never finding a 1 in B • new-old inversion: writer writes 1, then 2, but reader reads 2, then 1. • Counter-example execution on next slide • since binary registers are linearizable, we just mark the linearization points of the reads and writes on the binary registers Set 17: Fault-Tolerant Register Simulations

read 0 from B[1] write 1 to B[1] write 0 to B[3] read 1 from B[2] write 1 to B[2] read 0 from B[0] read 1 from B[1] write 0 to B[1] read 0 from B[0] Counter-Example Initially B[0] = B[1] = B[2] = 0 and B[3] = 1 read 2 read 1 write 2 write 1 Set 17: Fault-Tolerant Register Simulations

Corrected Multi-Valued Algorithm • To prevent "falling off the edge" of the end of B without finding a 1, write algorithm only clears (sets to 0) entries that are smaller the entry that is set (to 1) • To prevent new-old inversions, read algorithm scans up to find first 1, and then scans down to make sure those entries are still 0. • returns smallest value associated with a 1 entry in B that is observed during the downward scan Set 17: Fault-Tolerant Register Simulations

reader alg. writer alg. Multi-Valued Construction B[0] 0/1 read write reader writer . . . B[K-1] read write 0/1 Set 17: Fault-Tolerant Register Simulations

Algorithm is Wait-Free • Algorithm for writer does not involve any waiting: just do at most K (low-level) writes • Algorithm for reader does not involve any waiting: just do at most 2K-1 (low-level) reads. Set 17: Fault-Tolerant Register Simulations

Algorithm Ensures Linearizability • Describe an ordering of the (high-level) operations that is obviously legal (by the definition of the ordering) • Then show that it respects real-time ordering of non-overlapping operations. • Fix any admissible execution of the algorithm. • Fix any linearization of the low-level operations (on the binary registers) • exists since the execution is admissible, which implies the underlying communication system (the binary registers) behaves properly (is linearizable) Set 17: Fault-Tolerant Register Simulations

Reads-From Relations • Low-level read r on a binary register B[v] reads from low-level write w on the register if w is the latest write to B[v] that precedes r in the linearization of the low-level operations. • High-level read R on the simulated multi-valued register reads from high-level write W on the register if W returns v and W contains the low-level write that R's last read of B[v] reads from. Set 17: Fault-Tolerant Register Simulations

read 1 write 0 to B[0] write 1 to B[1] read 1 from B[1] read 0 from B[0] read 0 from B[0] write 1 Reads-From Diagram low-level reads-from relationships high-level reads-from relationship Set 17: Fault-Tolerant Register Simulations

Construct Permutation • Place all (high-level) writes in the order in which they occur • no concurrent writes • Consider each (high-level) read in the occur in which they occur • no concurrent reads • Suppose read R reads from write W. Place R immediately before the write that follows W in the permutation. Set 17: Fault-Tolerant Register Simulations

Correctness of Permutation • Permutation is legal by construction • each read is placed after the write that it reads from • Why does it preserve order of non-overlapping operations? • two writes: by construction • a read that precedes a write in the execution: OK, since the read cannot read from a later write. Set 17: Fault-Tolerant Register Simulations

Correctness of Permutation Lemma (10.1): Suppose • (high-level) read R returns v • R reads B[u], with u < v, during its upward scan • this read of B[u] reads from a (low-level) write contained in high-level write W1 Then R reads from a write that follows W1. Set 17: Fault-Tolerant Register Simulations

top of upward scan or during downward scan during upward scan, u < v read 0 from B[u] read 1 from B[v] write 1 to B[w] write 0 to B[u] write 1 to B[v] read v write v write w low-level reads-from relationships high-level reads-from relationship Figure for Lemma 10.1 can't happen Set 17: Fault-Tolerant Register Simulations

Correctness of Permutation • Two cases remain to show that real-time order of non-overlapping operations is preserved: • a write that precedes a read in the execution • two reads • Proof of both cases are by contradiction and showing that there is a situation that violates Lemma 10.1. Set 17: Fault-Tolerant Register Simulations

Multi-Reader from Single-Reader • First consider a simple idea: • Use a different single-reader register for each reader (Val[1],…,Val[n]). • n is number of readers • Write algorithm: write the new value in each of the single-reader registers • Read algorithm: read your own single-reader register and return that value Set 17: Fault-Tolerant Register Simulations

write 1 pw write 1 to Val[1] write 1 to Val[2] read 0 from Val[2] read 1 from Val[1] read 1 p1 read 0 p2 Counter-Example Suppose 0 is initial value of multi-reader register. Suppose n = 2. new-old inversion Set 17: Fault-Tolerant Register Simulations

New Idea for Correct Algorithm • Have the multi-reader algorithm write some information to the single-reader registers to prevent new-old inversions on the simulated register. • This is provably necessary… Set 17: Fault-Tolerant Register Simulations

Readers Must Write Theorem (10.3): In any wait-free simulation of a multi-reader single-writer register from single-reader single-writer registers, at least one reader must write. Proof: Suppose in contradiction there is an algorithm in which readers never write. Set 17: Fault-Tolerant Register Simulations

Readers Must Write • pw is the writer, p1 and p2 are the readers • initial value of simulated register is 0 • S1 is the set of single-reader registers that are read by p1 • S2is the set of single-reader registers that are read by p2 Set 17: Fault-Tolerant Register Simulations

Readers Must Write • Consider execution in which pw writes 1 to the simulated register. • The write algorithm performs a series of writes, w1,…,wk, to the single-reader registers. • Each wjis a write to a register in either S1 or S2. • Let vji be the value that would be returned if piwere to do a read immediately after wj Set 17: Fault-Tolerant Register Simulations

write to w1 write to wj write to wj+1 write to wk Readers Must Write write 1 pw … … pi read vji Set 17: Fault-Tolerant Register Simulations

Readers Must Write • For each reader (p1and p2), there is a point when the writes w1, …, wk cause the value of the simulated register, as it would be observed by that reader, to "switch" from 0 (old) to 1 (new). • For p1: • v11 = v21 = … = va-11 = 0 • va1 = … = vk1= 1 • For p2: • v12 = v22 = … = vb-12 = 0 • vb2 = … = vk2= 1 a cannot equal b! Set 17: Fault-Tolerant Register Simulations

Readers Must Write • Why must a and b be different? • a marks the point when p1's view of the simulated register's current value changes from old to new. So wamust write to a register in S1. • Similarly, wb must write to a register in S2. • W.l.o.g., assume a < b. Set 17: Fault-Tolerant Register Simulations

write to w1 write to wa write to wa+1 write to wk p1 read va1 = 1 p2 read va2 = 0 Readers Must Write write 1 pw … … not linearizable! Set 17: Fault-Tolerant Register Simulations

Readers Must Write • Where did we use the assumption in this proof that readers don't write? • The writer doing the slow write of 1 is oblivious to whether any readers are concurrently reading. • The readers are oblivious to each other. Set 17: Fault-Tolerant Register Simulations

Corrected Multi-Reader Algorithm • As part of the algorithm for the read on the simulated register, announce the value to be returned. • Before deciding what value to return, check what values have been returned by previous reads and don't pick anything earlier. • Need timestamps to be able to determine relative age of returned values. • Reader pi uses row i of a matrix to report its most recently returned value to all the other readers (remember, we only have single-reader variables at our disposal) Set 17: Fault-Tolerant Register Simulations

Writer's Algorithm • get the next sequence number • use integers that are increased by one each time • write value and sequence number to Val[1],…,Val[n] (one copy for each reader) Set 17: Fault-Tolerant Register Simulations

Reader pi's Algorithm • read the value and timestamp written by the writer to Val[i] • read the value and timestamp written by each reader to Report[j,i] • choose the value-timestamp pair with the largest timestamp • write that pair to row i of Report • return value associated with that pair Set 17: Fault-Tolerant Register Simulations

CPSC 668 Distributed Algorithms and Systems