Reliable Group Communication: a Mathematical Approach

… GC Reliable Group Communication: a Mathematical Approach Nancy Lynch Theory of Distributed Systems MIT LCS Kansai chapter, IEEE July 7, 2000

? ? ? ? Dynamic Distributed Systems • Modern distributed systems are dynamic. • Set of clients participating in an application changes, because of: • Network, processor failure, recovery • Changing client requirements • To cope with changes: • Use abstract groups of client processes with changing membership sets. • Processes communicate with group members by sending messages to the group as a whole.

… GC Group Communication Services • Support management of groups • Maintain membership info • Manage communication • Make guarantees about ordering, reliability of message delivery, e.g.: • Best-effort: IP Multicast • Strong consistency guarantees: Isis, Transis, Ensemble • Hide complexity of coping with changes

This Talk • Describe • Group communication systems • A mathematical approach to designing, modeling, analyzing GC systems. • Our accomplishments and ideas for future work. • Collaborators: Idit Keidar, Alan Fekete, Alex Shvartsman, Roger Khazan, Roberto De Prisco, Jason Hickey, Robert van Renesse, Carl Livadas, Ziv Bar-Joseph, Kyle Ingols, Igor Tarashchanskiy

Talk Outline I. Background: Group Communication II. Our Approach III. Projects and Results 1. View Synchrony 2. Ensemble 3. Dynamic Views 4. Scalable Group Communication IV. Future Work V. Conclusions

I. Background: Group Communication

? ? ? ? The Setting • Dynamic distributed system, changing set of participating clients. • Applications: • Replicated databases, file systems • Distributed interactive games • Multi-media conferencing, collaborative work • …

Groups • Abstract, named groups of client processes, changing membership. • Client processes send messages to the group (multicast). • Early 80s: Group idea used in replicated data management system designs • Late 80s: Separate group communication services.

… GC Group Communication Service • Communication middleware • Manages group membership, current views View = membership set + identifier • Manages multicastcommunication among group members • Multicasts respect views • Guarantees within each view: • Reliability constraints • Ordering constraints, e.g., FIFO from each sender, causal, common total order • Global service B A

mcast receive new-view mcast new-view GCS receive Group Communication Service Client A Client B

A B Isis [Birman, Joseph 87] • Primary component group membership • Several reliable multicast services, different ordering guarantees, e.g.: • Atomic Broadcast: Common total order, no gaps • Causal Broadcast: • When partition is repaired, primary processes send state information to rejoining processes. • Virtually Synchronous message delivery

A B C D A B C D Example: Interactive Game • Alice, Bob, Carol, Dan in view {A,B,C,D} • Primary component membership • {A}{B,C,D} split; only {B,C,D} may continue. • Atomic Broadcast • A fires, B moves away; need consistent order

Interactive Game • Causal Broadcast • C sees A enter a room; locks door. • Virtual Synchrony • {A}{BCD} split; B sees A shoot; so do C, D. A B C D A B C D

Applications • Replicated data management • State machine replication [Lamport 78] , [Schneider 90] • Atomic Broadcast provides support • Same sequence of actions performed everywhere. • Example: Interactive game state machine • Stock market • Air-traffic control

Transis [Amir, Dolev, Kramer, Malkhi 92] • Partitionable group membership • When components merge, processes exchange state information. • Virtual synchrony reduces amount of data exchanged. • Applications • Highly available servers • Collaborative computing, e.g. shared whiteboard • Video, audio conferences • Distributed jam sessions • Replicated data management [Keidar , Dolev 96]

Other Systems • Totem [Amir, Melliar-Smith, Moser, et al., 95] • Transitional views, useful with virtual synchrony • Horus[Birman, van Renesse, Maffeis 96] • Ensemble[Birman, Hayden 97] • Layered architecture • Composable building blocks • Phoenix, Consul, RMP, Newtop, RELACS,… • Partitionable

Service Specifications • Precise specifications needed for GC services • Help application programmers write programs that use the services correctly, effectively • Help system maintainers make changes correctly • Safety, performance, fault-tolerance • But difficult: • Many different services; different guarantees about membership, reliability, ordering • Complicated • Specs based on implementations might not be optimal for application programmers.

Early Work on GC Service Specs • [Ricciardi 92] • [Jahanian, Fakhouri, Rajkumar 93] • [Moser, Amir, Melliar-Smith, Agrawal 94] • [Babaoglu et al. 95, 98] • [Friedman, van Renesse 95] • [Hiltunin, Schlichting 95] • [Dolev, Malkhi, Strong 96] • [Cristian 96] • [Neiger 96] • Impossibility results [Chandra, Hadzilacos, et al. 96] • But still difficult…

II. Our Approach

Approach Application • Model everything: • Applications • Requirements, algorithms • Service specs • Work backwards, see what the applications need • Implementations of the services • State, prove correctness theorems: • For applications, implementations. • Methods: Composition, invariants, simulation relations • Analyze performance, fault-tolerance. • Layered proofs, analyses Service Application Algorithm

Math Foundation: I/O Automata • Nondeterministic state machines • Not necessarily finite-state • Input/output/internal actions (signature) • Transitions, executions, traces • System modularity: • Composition, respecting traces • Levels of abstraction, respecting traces • Language-independent, math model

Typical Examples Modeled • Distributed algorithms • Communication protocols • Distributed data management systems

Modeling Style • Describe interfaces, behavior • Program-like behavior descriptions: • Precondition/effect style • Pseudocode or IOA language • Abstract models for algorithms, services • Model several levels of abstraction, • High-level, global service specs … • Detailed distributed algorithms

Modeling Style • Very nondeterministic: • Constrain only what must be constrained. • Simpler • Allows alternative implementations

Describing Timing Features • TIOAs [Lynch, Vaandrager 93] • For describing: • Timeout-based algorithms. • Clocks, clock synchronization • Performance properties

fail recover fail recover Describing Failures • Basic or timed I/O automata, with fail,recover input actions. • Included in traces, can use them in specs.

Describing Other Features • Probabilistic behavior: PIOAs[Segala 95] • For describing: • Systems with combination of probabilistic + nondeterministic behavior • Randomized distributed algorithms • Probabilistic assumptions on environment • Dynamic systems: DIOAs[Attie, Lynch 99] • For describing: • Run-time process creation and destruction • Mobility • Agent systems [NTT collaboration]

Using I/O Automata (General) • Specify systems precisely • Validate designs: • Simulation • State, prove correctness theorems • Analyze performance • Generate validated code • Study theoretical upper and lower bounds

Using I/O Automata for Group Communication Systems • Use for global services + distributed algorithms • Define safety properties separately from performance/fault-tolerance properties. • Safety: • Basic I/O automata; trace properties • Performance/fault-tolerance: • Timed I/O automata with failure actions; timed trace properties

III. Projects and Results

Projects 1. View Synchrony 2. Ensemble 3. Dynamic Views 4. Scalable Group Communication

1. View Synchrony (VS) [Fekete, Lynch, Shvartsman 97, 00] Goals: • Develop prototypes: • Specifications for typical GC services • Descriptions for typical GC algorithms • Correctness proofs • Performance analyses • Design simple math foundation for the area. • Try out,evaluate our approach.

View Synchrony What we did: • Talked with system developers (Isis, Transis) • Defined I/O automaton models for: • VS, prototype partitionable GC service • TO, non-view-oriented totally ordered bcast service • VStoTO, application algorithm based on [Amir, Dolev, Keidar, Melliar-Smith, Moser] • Proved correctness • Analyzed performance/ fault-tolerance.

VStoTO Architecture brcv bcast TO VStoTO VStoTO gprcv newview gpsnd VS

TO TO Broadcast Specification Delivers messages to everyone, in the same order. Safety: TO-Machine Signature: input: bcast(a,p) output: brcv(a,p,q) internal: to-order(a,p) State: queue, sequence of (a,p), initially empty for each p: pending[p], sequence of a, initially empty next[p], positive integer, initially 1

Transitions: bcast(a,p) Effect: append a to pending[p] to-order(a,p) Precondition: a is head of pending[p] Effect: remove head of pending[p] append (a,p) to queue brcv(a,p,q) Precondition: queue[next[q]] = (a,p) Effect: next[q] := next[q] + 1 TO-Machine

Performance/Fault-Tolerance TO-Property(b,d,C):If C stabilizes, then soon thereafter (time b), any message sent or received anywhere in C is received everywhere in C, within bounded time (time d). stabilize send receive b d

VS VS Specification • Partitionable view-oriented service • Safety: VS-Machine • Views presented in consistent order, possible gaps • Messages respect views • Messages in consistent order • Causality • Prefix property • Safe indication • Doesn’t guarantee Virtual Synchrony • Like TO-Machine, but per view

stabilize newview( v) mcast(v) receive(v) b d Performance/Fault-Tolerance VS-Property(b,d,C): If C stabilizes, then soon thereafter (time b), views known within C become consistent, and messages sent in the final view v are delivered everywhere in C, within bounded time (time d).

VStoTO Algorithm • TO must deliver messages in order, no gaps. • VS delivers messages in orderper view. • Problems arise from view changes: • Processes moving between views could have different prefixes. • Processes could skip views. • Algorithm: • Real work done in majority views only • Processes in majority views totally order messages, and deliver to clients messages that VS has said are safe. • At start of new view, processes exchange state, to reconcile progress made in different majority views.

Correctness (Safety) Proof • Show composition of VS-Machine and VStoTO machines implements TO-Machine. • Trace inclusion • Use simulation relation proof: • Relate start states, steps of composition to those of TO-Machine • Invariants, e.g.: Once a message is ordered everywhere in some majority view, its order is determined forever. • Checked using PVS theorem-prover, TAME [Archer] TO Composition

Conditional Performance Analysis • Assume VS satisfies VS-Property(b,d,C): • If C stabilizes, then within time b, views known within C become consistent, and messages sent in the final view are delivered everywhere in C, within time d. • And VStoTO satisfies: • Simple timing and fault-tolerance assumptions. • Then TO satisfies TO-Property(b+d,d,C): • If C stabilizes, then within time b+d, any message sent or delivered anywhere in C is delivered everywhere in C, within time d.

Conclusions: VS • Models for VS, TO, VStoTO • Proofs, performance/f-t analyses • Tractable, understandable, modular • [PODC 97], [TOCS 00] • Follow-on work: • Algorithm for VS [Fekete, Lesley] • Load balancingusing VS [Khazan] • Models for other Transis algorithms [Chockler] • But: VS is only a prototype; lacks some key features, like Virtual Synchrony • Next: Try a real system!

2. Ensemble [Hickey, Lynch, van Renesse 99] Goals: • Try, evaluate our approach on a real system • Develop techniques for modeling, verifying, analyzing more features, of GC systems, including Virtual Synchrony • Improve on prior methods for system validation

Ensemble • Ensemble system [Birman, Hayden 97] • Virtual Synchrony • Layered design, building blocks • Coded in ML [Hayden] • Prior verification work for Ensemble and predecessors: • Proving local properties using Nuprl [Hickey] • [Ricciardi], [Friedman]

Ensemble • What we did: • Worked with developers • Followed VS example • Developed global specs for key layers: • Virtual Synchrony • Total Order with Virtual Synchrony • Modeled Ensemble algorithm spanning between layers • Attempted proof; found logical error in state exchange algorithm (repaired) • Developed models, proofs for repaired system

Conclusions: Ensemble • Models for two layers, algorithm • Tractable, easily understandable by developers • Error, proofs • Low-level models similar to actual ML code (4 to 1) • [TACAS 99] • Follow-on: • Same error found in Horus. • Incremental models, proofs [Hickey] • Next: Use our approach to design new services.

3. Dynamic Views [De Prisco, Fekete, Lynch, Shvartsman 98] Goals: • Define GC services that cope with both: • Long-term changes: • Permanent failures, new joins • Changes in the “universe” of processes • Transient changes • Use these to design consistent total order and consistent replicated data algorithms that tolerate both long-term and transient changes.

A B C D E Dynamic Views • Many applications with strong consistency requirements make progress only in primary views: • Consistent replicated data management • Totally ordered broadcast • Can use staticnotion of allowable primaries,e.g., majorities of universe, quorums • All intersect. • Only one exists at a time. • Information can flow from each to the next. • But: Static notion not good for long-term changes

A B C D E F Dynamic Views • For long-term changes, want dynamic notion of allowable primaries. • E.g., each primary might contain majority of previous: • But: Some might not intersect. Makes it hard to maintain consistency.

Reliable Group Communication: a Mathematical Approach