Applications of Probabilistic Quorums to Iterative Algorithms

Applications of Probabilistic Quorums to Iterative Algorithms HyunYoung Lee, University of Denver Jennifer L. Welch, Texas A&M University presented at ICDCS 2001

Outline • The Probabilistic Quorum Algorithm (PQA) • Abstracting PQA into Random Register (RR) • Using RRs in Iterative Convergent Algorithms • Monotone RRs and their Performance • Simulation Results • Conclusions

Distributed Shared Memory • Provides illusion of shared variables for inter-process communication on top of a message-passing distributed system • Benefits of shared memory paradigm: • familiar from uniprocessor case • supports good software development practice • Examples: Treadmarks [Amza+], DASH [Gharachorloo+], ...

Distributed Shared Memory app proc r app proc 1 read(Y) return(Y,5) write(X,3) ack(X) client r client 1 send recv send recv network Implements shared variables X, Y, Z, ... recv send recv send server 1 server n

Replicated Data with Quorums • Keep a copy of shared variable at nreplica servers that communicate by messages. • A quorum is a subset of replica servers. • To write: client updates copies in a quorum with new value plus timestamp. • To read: client receives copies from a quorum and returns value with latest timestamp.

Quorum Intersection • To ensure each read obtains latest value written, every read quorum must intersect every write quorum. 4,9:00 10,8:00 4,9:00 a write quorum 4,9:00 12,7:00 a read quorum

Performance Measures for Quorum Systems • Availability: minimum number of servers that must fail to disable every quorum [Peleg & Wool]. • Optimal (largest) availability is (n). Achieved when every set of size n/2 +1 is a quorum. • Load: probability of accessing the busiest server, in the best case [Naor & Wool]. • Optimal (smallest) load is (1/n). Tradeoff Theorem [Naor & Wool]: For any quorum system, if load is optimal (1/n), then availability is at most (n).

Breaking the Tradeoff with Probabilistic Quorums[Malkhi, Reiter & Wright] • Relax requirement that every read quorum overlap every write quorum. • Instead, choose each quorum uniformly at random from the set of all k-sized subsets of the n replica servers, for k < n/2. Theorem: If k = (n), then • availability is n - k = (n) • load is (1/n) • To handle server failures: keep trying until enough responses to form a quorum are received.

4,9:00 10,8:00 12,7:00 A read quorum 4,9:00 12,7:00 Probabilistic Quorums [MRW] Drawback: A read quorum might not overlap the most recent write quorum, causing a read to return an out-of-date value. a write quorum Theorem: Probability of not overlapping is < e-h2, when k = hn.

Programming with PQs • What are the semantics of the shared variable (register) implemented by the PQA? • What kind of applications can tolerate reads returning, with low probability, out-of-date values?

W1(c) W 2(b) W3(a) W4(c) Definition of Random Register One writer and multiple readers. [R1] Every read or write invocation has a response. [R2] Every read Rreads from some write W: (1) W begins before R ends. (2) R’s value is same as W ’s value. (3) W is latest such write. R(c)

Definition of Random Register [R3] For every finite execution ending with a write W, probability that W is read from infinitely often is 0 (over all extensions with an infinite number of writes). Related Work: • Most work on randomized shared objects concerns termination, not correct responses. • [Afek+] and [Jayanti+] assumed a fixed subset of shared objects that can return incorrect values.

PQA implements an RR Theorem 1: PQA implements an RR. Proof: [R1]: Each invocation gets a response since no lost messages and only crash failures of servers. [R2]: Each read reads a value written by a previous or overlapping write, since no data corruption.

PQA Implements an RR [R3]: Show probability that at least one replica in a write quorum is never overwritten is 0: Pr(  1 replica survives h writes )  k  Pr( replica j survives h writes ) = k  Pr( j  Q1  …  j  Qh ) = k  hi=1 Pr( j  Qi ) = k((n-k)/n)h  0 as h  .

Iterative Convergent Algorithms[Uresin & Dubois] • Repeatedly apply a function to a vector to produce another vector until reaching a fixed point. • Responsibility for vector components is distributed across several processes. • Vector component updates are based on possibly out-of-date views of the vector components.

Iterative Convergent Algorithms [UD] Requirements: [A1]: All views come from the past. [A2]: Every component is updated infinitely often. [A3]: Each view is used only finitely often. time vector components 0 Red views are updated ones. 1 2 Arrows indicate views used in last update. 3

Iterative Convergent Algorithms [UD] [A1], [A2], [A3] are equivalent to the existence of a partition of the update sequence into pseudocycles (p.c.’s): • at least one update per component, and • every view used was created in current or previous p.c. X p.c. i -1 p.c. i

Asynchronously Contracting Operators Theorem [UD]: Sufficient condition on F for convergence to fixed point, if update sequence satisfies [A1]-[A3]: There exists integer M and sequence of sets D0, D1,… such that • each DK is Cartesian product of m sets (independence) • D0 D1 … DM = DM+1 = …= { fixed point } • If x DK, then F(x) DK+1 for all K. • Why? At end of K-th p.c., computed vector is in DK. m-vector ... DM DM-1 D1 D0 fixed point

Example: All Pairs Shortest Path • G is weighted directed graph with n nodes. • Compute n x n vector x; process i updates i-th row of x, 1 in. • Initially x is adjacency matrix for G. • F(x) computes y, where yij = min 1 kn { xik + xkj}. Shown to be an ACO by [UD]. Claim: Worst-case number of pseudocycles for F to converge is log2 diameter(G).

ACOs Correct with RRs Theorem 2: If F is an ACO, then every iterative execution using RRs for the vector components converges with probability 1. Proof: Show the sequence of updates in the execution satisfies [A1], [A2] and [A3] with probability 1. [A1]: All views are from the past by [R2]. [A2]: Application ensures every component is updated i.o. [A3] holds with probability 1: Each view is used finitely often with probability 1 by [R3].

Implications • RRs can be used to implement any ACO, which includes algorithms for • APSP • transitive closure • constraint satisfaction • solving system of linear equations • If PQA is used for the RRs, improved load and availability are provided. • Convergence is guaranteed with probability 1. • But how long does it take to converge?

Measuring Time with Rounds • A round finishes when every process has • read all the vector components • applied the function • updated its own vector components at least once. How many (expected) rounds per p.c.? We don’t know with current RR definition, so modify definition...

Monotone RR Definition [R1] - [R3] plus [R4]: If read R by process i reads from W and a later read R' by i reads from W' , then W' does not precede W. R(c) R'(b) X W '(b) W(c)

Monotone RR Definition (cont’d) [R5]: There exists q s.t. for all r, Pr[r reads are needed until W or a later write is read from]  (1 - q)r-1q. So q is the probability of a “successful” read (w.r.t. W).

( ) n k Monotone RR Algorithm Same as previous probabilistic quorum algorithm, except: • Read client keeps track of value with latest timestamp that it has seen so far. • This value is returned if its timestamp is later than all those obtained from current quorum. Theorem 3: Attains q = 1 - W ’s or later value is read if a subsequent read quorum overlaps W ’s quorum. ( ) n - k k

Monotone RR Rounds per P.C. Theorem 4: Expected number of rounds per pseudocycle, when implementing an ACO with monotone RRs, is at most 1/q. Proof: For p.c. h to end, each process i must read from a write  first write in p.c. h-1. Once this read occurs for i, every later read by i is at least as recent, since monotone. Expected # rounds for first read is  1/q by [R5].

Messages vs. Rounds for ACOs Corollary: For monotone PQA, expected # rounds per p.c. is  (1 - ((n-k)/n)k)-1. Expression is between 1 and 2 when k = n. Strict quorum system has 1 round per p.c. Monotone PQA has > 1 expected round per p.c. but may have fewer messages per p.c. Which has better message complexity?

Message Complexity for ACOs Messages per round in synchronous case: • Each of the m vector components is read by each of the p processes and written by one. • Each operation generates two messages to each of the k quorum members.  2m(p+1)k. MPQA: When k = n, expected # messages per p.c. is c2m(p+1)n, 1 < c < 2.

Comparing Message Complexity Recall when k = n, expected # messages per p.c. for MPQA is c2m(p+1)n, 1 < c < 2. • High availability (n): • Strict: k = n/2 + 1, so # messages per p.c. is 2m(p+1)(n/2 +1). Worse. • Low load (1/n): • Strict: k = n (e.g., rows and columns of grid), so # messages per p.c. is 2m(p+1)n. Asymptotically same.

Simulation Purpose Simulated non-monotone and monotone RR implementations using PQs with APSP application to study: • difference between synchronous and asynchronous cases • expected convergence time in non-monotone case (no analysis) • actual expected convergence time in monotone case compared to computed upper bound

Simulation Details Input graph: log2 33 = 6 pseudocycles to converge. Measured rounds till convergence (when simulated results equaled precomputed actual answer). Each plotted point is average of 7 runs. ... 1 1 1 1 2 34

Simulation Results Computed upper bound is not tight. Synch & asynch are very similar. Monotone is better than non-monotone.

Summary • Proposed two specifications of randomized shared variables that can return wrong answers, monotone and non-monotone random read-write registers. • Both specs can be implemented with PQA of [MRW]. • Our specs can be used to implement a significant class of iterative convergent algorithms, characterized by [UD]; algorithms converge with probability 1. • Computed bounds on convergence time and message complexity for ACOs in monotone case. • Simulation results indicate monotone is faster than non-monotone, asynch and synch are similar, and computed upper bound is not tight.

Future Work • Are our specs of more general interest? Other good algs that implement them? Different specs better? • Useful applications for other shared data structures (e.g., stack) with errors? How to specify and implement them? • How to tolerate client failures? Approximate agreement as an application?

Applications of Probabilistic Quorums to Iterative Algorithms