Alexander A. Shvartsman University of Connecticut, USA

Alexander A. ShvartsmanUniversity of Connecticut, USA FRIDA 2014, Vienna, Austria Specifying, Reasoning About, Optimizing, and Implementing Atomic Data Services for Dynamic Distributed Systems

“Three R’s” (Lesen, Schreiben, Rechnen) • Reading, ’Riting, and ’Rithmetic • Underlay much of human intellectual activity • Venerable foundation of computing technology • With networking, communicationbecame a major activity • Email – electronic counterpart of postal service • Yet, it is natural to deal with reading, writing, and computing • In the Web, a browser app may load (i.e., read) a page, perform computation, and save(i.e., write) the results • In distributed databases we retrieve and store data, and rarely talk about sending and receiving data • Arguably, it is also easier to develop distributed algorithms with readable/writeable objects, than to use message passing

Sharing Memory in a Networked System • Let’s place a shareable object at a node in a network • Not fault-tolerant – single point of failure • Not efficient – performance bottleneck • Not very available, does not provide longevity, etc…

Sharing Memory in a Networked System • So we replicate – we’d have to anyway, since redundancy is the only means for providing fault-tolerance

Sharing Memory in a Networked System • With replication come challenges: • How to preserve consistency while managing replicas? • What kind of consistency? • How to provide it? • How to use it? a a b

Sharing Memory in Networked Systems • Let’s place a shareable object at a node in a network • Not fault-tolerant – single point of failure • Not efficient – performance bottleneck • Not very available, does not provide longevity, etc… • So we replicate – we’d have to anyway, since redundancy is the only means for providing fault-tolerance • With replication come challenges: • How to preserve consistency while managing replicas? • What kind of consistency? • How to provide it? • How to use it?

Consistency • Easiest for users: a single copy view • Sequence of operations; a read sees all previous writes • Captured by atomicity [Lamport]/ linearizability[Herlihy Wing] • Not cheap to implement even without general updates • Cheapest to implement: a read sees a subset of prior writes • Not the most natural semantic for the users • Additional complications in dynamic systems • Ever-changing sets of replicas and participants • Crashes never stop, timing variations persist • Evolving topology, ultimately mobility

Atomicity / Linearizability[Lamport / Herlihy Wing ] • “Shrink” the interval of each operation to a serialization point so that the behavior of the object is consistent with its sequential type write(8) Time read(0) read(8) read(8) write(8) Time read(8) read(0) write(8) Time read(8) read(0)

More Atomicity Examples • Valid (atomic) executions: • Invalid executions: write(8) write(8) ack( ) Time Time read( ) ret(0) read( ) read( ) ret(0) ret(8) write(8) ack( ) write(8) Time Time read( ) ret(0) read( ) ret(0) read( ) ret(8) read( ) ret(0)

Consistency Polemics • Parallel / distributed systems • Performance • Speed-up • Distributed theory focus • Fault-tolerance • Consistency • User: Yes, • mine is wrong… • But it is fast! Yes, mine is slow… But it is correct! I wish they reconciled…

Using Majorities/Quorums for Consistency • Consistency of replicated data: using intersecting sets • Starting with Gifford (79) and Thomas (79) • Upfal and Wigderson (85) • Majority sets of readers and writers to emulate shared memory in a synchronous distributed setting • Vitanyi and Awerbuch (86) • MW/MR registers using matrices of SW/SR registers where rows and columns are readand written • Attiya, Bar-Noy, and Dolev (91/95) • Atomic SW/MR objects in message passing systems, majorities of processors, minority may crash • Two-phase protocol (ABD)

Quorum Systems and Examples 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 • Quorum system Qover P, • a set of processor ids: • Q = {Q1, Q2, … } • Qi P • Qi Qj ≠Ø for alli, j B A C Majorities [Thomas79,Gifford79] Lemma: The join ofquorum system Qa over Pa and system Qbover Pb ,Qa⋈Qb ,is a quorum system over PaUPb . Matrix Quorums: Processor ids arranged in a matrix. A quorum: Row U Column

Main Idea: (Logical) Timestamps and Quorums • An object is represented as a pair (value, timestamp) • A write records (new-value, new-timestamp) in a quorum • A read obtains (value, timestamp) pairs from a quorum, then returns the value with the largest timestamp read(vmax) write(v1) (v1,t1) (vmax,tmax) Time write read

“Readers Must Write” • If operations are concurrent and a reader simply returns the latest value, then atomicity can be violated: • Solution: If readers first help the writer to record the value in a quorum, then it is safe to return the latest value Read returns v1 Read returns v0 v0 v0 v0 v1 v0 v0 Write(v1) . . . (happened to be slow)

Lastly: “Writers Must Read” I assume you’re being facetious, Josef, I distinctly yelled ‘second’ before you did! • Writers can “read” before writing (and riding) to obtain the latest timestamp in order to compute a new timestamp

The ABD Algorithm [Attiya Bar-NoyDolev]

The Complete Algorithm [Attiya Bar-NoyDolev] • Read and write uses identical two-phase communication patterns: • Get phase queries and obtains values from a quorum, • Put phase propagates values to a quorum. • The only difference is in what is sent out in the Put phase. • Replica hosts respond to Get and Put requests

New Frontiers: DynamicAtomic Memory • Goal: Develop and analyze algorithms to solve problems of coherent data sharing in dynamic distributed environments • “Dynamic” encompasses • Changing sets of participants: nodes come and go as they please • Wide range of failures • Asynchrony, timing variations • Crashes, message loss, weak delivery guarantees • Changes in network topology • Processor mobility

Approach: Dynamic Quorums [Lynch Shvartsman] • Objects are replicated at several network locations • To accommodate small, transient changes: • Use quorum configurations: members, quorums • Maintains atomicity during “normal operation” • Allows concurrency • To handle larger, more permanent changes: • Reconfigure • Maintains atomicity across configuration changes • Any configuration can be installed at any time • Reconfigure concurrently with reads/writes -- no heavy-handed view change

Algorithmic Ideas • Reconfigurable quorum systems • Quorums maintain consistency during modest and transient changes • Reconfigurations accommodate more drastic and permanent changes • Read/write operations are frequent • Use quorum access and allow concurrency • Isolate from reconfiguration • Reconfigurations are infrequent • Use consensus to impose total order (Paxos) • Optimistic dissemination without formal installation • Conservative garbage collection of obsolete config-s

Related Other Approaches • Using consensus to agree on each operation [Lamport] • Performance overhead for each reads and write op • Termination of operations depends on consensus • Use group communication service [Birman 87] with TO bcast[Amir, Dolev, Melliar-Smith, Moser 94], [Keidar, Dolev 96] • View change delays reads/writes • One change may trigger view formation • DynaStore: Reconfiguration without consensus[Aguilera, Keidar, Malkhi, Shraer 11] • Initial quorum system, incremental adds/removes • Changes yield DAGs of possibilities • Reads/writes use ABD-like phases, traverse DAGs • Assuming finite reconfigurations guarantees termination

GCS: Group Communication Services (a Detour) • Pioneering work of Ken Birman & Co. made GCSs one of the most important distributed system building blocks • GCS: group membership + multicast communication • Isis, Transis, Consul, Ensemble, Totem, Horus, Newtop, Relacs … • Commercial variants of Birman’s Isis used in • French air traffic control system • Stock exchange on Wall Street • Swiss Electronic Bourse • GCSs for years defied sufficiently formal specification that would both • Detail the utility of the services and • Enable compelling reasoning about correctness

View-Synchrony [Fekete Lynch Shvartsman 01] • Anew, simple formal specification for a partitionable view-oriented group communication service. • To demonstrate the value of our specification, we used it to construct an algorithm for an ordered-broadcast application that reconciles information derived from different views. • Atomic memory is now easily obtained • Our algorithm is based on algorithms of Amir, Dolev, Keidar, Melliar-Smith and Moser. • Our specification has a simple implementation, based on the membership algorithm of Cristian and Schmuck.

View-Synchrony [Fekete Lynch Shvartsman 01] • We proved the correctness and analyzed the performance and fault-tolerance of the algorithm • Our work on GCS was (we believe) the first successful formal specification + rigorous proofs • Safety: Invariants and simulation • Conditional performance analysis • MylaArcher @ NRL checked several invariants using PVS – invaluable contribution to perfecting the proofs • Despite perhaps only modest algorithmic advances the paper was/is highly cited -- the value is in the proofs!

Our Ground-Up Approach: RAMBO • Reconfigurable Atomic Memory for Basic Objects[Lynch, Shvartsman 02], [Gilbert, Lynch, Shvartsman 10] • For small, transient changes, use quorum configurations • Maintain atomicity and allow concurrency, a la ABD • For larger, permanent changes, reconfigure: • Dynamically install anyquorum system • Guarantee atomicity across configurations • Guarantees: • Linearizability in all cases • Good latency in common cases • High availability • Everything proceeds concurrently • Use low-level gossip communication

Methodology • Specify algorithm • Interacting state machines • Using non-deterministic “gossip” • Show correctness/safety for • arbitrary patterns of asynchrony • assuming arbitrary crash-failures and message loss • Analyze performance for a subset of timed executions • Bounded message delay, 0-time local processing • Some “gossip” becomes deliberate, some periodic • Non-failure of certain quorums for certain periods • Reason about operation latency • Of course none of this impacts safety

RAMBO Recon Net RAMBO Reconfigurable Atomic Memory for Basic Objects • Global service specification • Implementation: Main algorithm + “recon” service • Recon service: • “Advance reconnaissance” • Consistent sequence of configurations • Loosely coupled • Main algorithm: • Reading, writing • Receives, disseminates new configuration information; no explicit installation • Reconfigures: upgrade to new and remove old • Reads/writes may use several configurations

Architectural View Application RAMBO Node i Node j Joineri Joinerj Recon R/Wj R/Wi Communication Network

High-Level Functions • Joiner • Introduces new participants to the system • Reader-Writer • Routine read and write operations • Two-phased algorithm using all “known” configurations • Using tags • Recon • Chooses new next configuration, e.g., using Paxos • Informs members of the previous configuration • Configuration upgrade (“packaged” with Reader-Writer) • Identify and remove obsolete configurations(garbage collection)

Configurations and Reconfiguration • Configuration: quorum system • Collection of subsets of replica host idswhere any two subsets intersect • Or collection of read-quorums and write-quorums, where any read-quorum intersects with any write-quorum • Reconfiguration process involves two decoupled steps • Recon: Emitting a new configuration, then, at some later time • Garbage-collecting obsolete configurations while“upgrading” to the latest (locally) known configuration • No constraints on the membership of distinct quorum systems Q1 Q2 Q3 …

Implementation of Recon Recon Consensus Net • Uses distributed consensus to determine new configurations 1,2,3,… • Note: when the universeof configurations isfinite and known, then consensus is not neededeven with unboundedreconfiguration[GeoQuorums] • Members of old configuration may propose a new config. • Proposals reconciled using consensus • Consensus is a fairly heavy mechanism, but it is • Used only for reconfigurations, which are infrequent • Does not delay Read/Write operations

Configurations and Config Maps • Configuration c • members(c) -- set of members of configuration c • read-quorums(c), write-quorums(c) -- sets of quorums • Configuration map cm • mapping from naturals to configurations • cm(k) is configuration k • Can bedefined (c), undefined (), garbage-collected (±) ... ± ± c c c  c  ... c  GC’d Defined Mixed Undefined

Configuration Upgrade [Gilbert, Lynch, S 10] • Reconfigure to last configuration in a contiguous segment • Phase 1: • Informs write-quorum of cj … ck-1 about ck • Collects (value,tag) from read-quorums of cj … ck-1 • Phase 2: • Propagates latest (value, tag) to a write-quorum of ck • Garbage-collect: Set cmap(j…k -1) to ± • Constant-time upgrade regardless of the number of obsolete configurations (conditioned on failures) • Maintains good read/write latency during system instability or frequent reconfigurations . . . . . . . . . ± cj . . . ck 

Configuration Map Changes (Local View) . . . c0           TIME . . . c0 c1          . . . c0 c1 c2    ck     . . . ± c1 c2    ck     . . . ± ± c2    ck     . . . ± ± ± c3   ck     . . . . . . ± ± ± ± ± c c c  c 

On to Reads and Writes: Values and Tags • Each value v has an associated tag t (logical timestamp) • Tag is made up of the sequence-processor pair • Reads: • a set of value-tag pairs is obtained • the result is the value with the maximum tag • Writes: • a set of value-tag pairs is obtained • new-value is propagated with a new-tag that is a lexicographic increment of tag :new-tag := tag.seq+ 1, pid

Dynamic Reader-Writer and Recon R/Wi R/Wj Recon Net • The work is split between Reader-Writer and Recon • Recon emits consistent configurations • Reader-Writer processes run two-phase quorum-based algorithm, with all “active” configurations • Background “gossip” builds fixed-points • If Recon emits new configuration, Reader-Writer continues reads/writes in progress, until fixed-point is reached

Processing Reads and Writes • Reads and Writes perform Query and Propagation phases using known configurations, stored in op.cmap. • Query: Gets “fresh”value, tag, and cmapinformation from read-quorums • Propagation: Gives latest (value,tag) to write-quorums • Both phases: Extend op.cmap with newly-discovered configurations that now must also be involved. • Each phase ends with a fixed point, involving all the configurations currently in op.cmap Read or Write Query Phase Propagation Phase Start Query End Query Start Prop End Prop

Methodology … … Net • Algorithms are presented formally, using interacting state machine models: Input/Output automata) • service specifications • algorithm descriptions • models for applications • Safety: rigorous proof of correctness (atomicity) for arbitrary patterns of asynchrony and change • Conditional performance analysis • E.g., when message latency < d, quorum configurations are “viable”, then read and write operations take time between 4d and 8d, under reasonable “steady-state” assumptions.

Input Output Automata (IOA) • Input Output Automata [Lynch & Tuttle] • A language and formalism based on simple mathematical foundation • Elegant, suitable for analysis, well documented • Supports composition, abstraction • Model system components as a collection of interacting state machines • Allows rigorous reasoning about system properties • Widely used in research – hundreds of algorithms

Example Spec’n: Asynchronous Lossy Channel Internal lose(m) Output recvj,i(m) Input sendi,j(m) Channeli,j

Details: Reader-Writer: Send and Recv Code Send Receive Specification of gossip using Input/Output Automata of [Lynch Tuttle]

Details: Reader-Writer Fixed Points Phase 1 fixed point Phase 2 fixed point Specification of fixed points using Input/Output Automata

Showing Read/Write Atomicity • We show atomicity using a partial order • For be any sequence of reads and writes • Let  be an irreflexive PO of all op-s in  • To prove atomicity show that satisfies: • For any operation , finitely many op-s  • If  precedes , then not • If  is a write then either  or  • Any read returns value written by last write, per (or init value if no such writes) [Lynch, Lemma 13.16] • Rigorous proof that in RAMBO the max-tag of reads and new-tag of writes, lexi-ordered, provide such a PO

Some Latency Analysis Details • Restrict attention to a subset of timed executions • Reminder: Read and write operations are not affected by Recon delays or non-termination • Configuration upgrade (garbage collection) takes 4d • If reconfigurations are “rare” -- operations take 4d • If configurations are in “steady state” -- operations take 8d • Logarithmic time “catch-up” after a bursty period of “bad timing behavior” • A recovering node joins quickly after a long absence

Implementation • Experimental system implementations [Musial 07] • Platform for refinement, optimization, tuning • Observe of algorithms in a local area setting • Cluster with 16+/- Linux machines & fast switch • Developed by manually translating the Input/Output Automata specification to Java code • Precise rules are followed to mitigate error introduction during translation • Rigorous proofs [Georgiou, Musial, Sh., Sonderegger 07, 11] • Next steps: • Specification in Tempo[Lynch Michel S 08] (Timed IOA) • Code generation ([Georgiou Lynch MavrommatisTauber 09])

Optimizations – Rigorously Proved • Parallel upgrade [Gilbert Lynch S 10] • Improving communication efficiency [Georgiou, Musial, S 06] • Explicit “leave” service • Incremental gossip for long-lived data • Improving liveness & gossip efficiency [Gramoli, Musial, S 05] • Non-deterministic heuristic for prodding slow operations • Constrained gossip patterns • Providing complete shared memory [Georgiou, Musial, S 05] • Domain-oriented Rambo • Optimizing performance for groups of related objects

Optimization and Development Methodology Atomicity properties Abstract Rambo Proof Simulation Manual derivation [Musial07] or semi-automated code generation [Georgiou + 09,10] [Lynch, S 02] [Gilbert, L, S 10] Rambo with graceful departures Simulation Running System [Georgiou, Musial, S 04, 06] Long-Lived Rambo Derivation

Forward Simulation • Let C (low-level, concrete) and A (high-level, abstract) be two systems.The goal is to show that “C implements A” by “trace inclusion” • An execution is a finite or infinite alternating sequence s0a1s1a2s2… of states and actions, where s0 is a start state • A trace is obtained by removing all states and internal actions • Forward simulation  from C to A •  states(C) x states(A) • If s  start(C) then (s)  start(A) • If (s,a,s’)  transitions(C), then there exists an execution fragment  = (s),…, (s’) such that trace(a) = trace() Theorem: If there exists a simulation relation from C to A, thentraces(C)  traces(A). • This implies that any safety property of A is also satisfied by C

Simulation (or Abstraction) Function F(s) F(s') F F s s' • Simulation function Fmaps each state of specification C(concrete) to a state of specification A (abstract) • Fis required to satisfy the following two conditions 1. If s is an initial state of Cthen F(s) is an initial state of A 2. If s and F(s) are reachable states of Cand Arespectively, and (s, p, s') is a step of C, then there is a step p’ofAfrom F(s) to F(s'), having the same traceA:C: p’ p

Optimization: Improving performance • Long-Lived RAMBO: Graceful Leave + Incremental Gossip • Rigorous proof of correctness by simulation • Performance study • [Georgiou, Musial, Shvartsman 06]

Alexander A. Shvartsman University of Connecticut, USA