Replicated Distributed Systems

Replicated Distributed Systems By Eric C. Cooper

Overview • Introduction and Background (Queenie) • A Model of Replicated Distributed Programs • Implementing Distributed Modules and Threads • Implementing Replicated Procedure Calls (Alix) • Performance Analysis • Concurrency Control (Joanne) • Binding Agents • Troupe Reconfiguration

Background • Present a new software architecture for fault-tolerant distributed programs • Designed by Eric C. Cooper • A co-founder of FORE systems – a leader supplier of networks for enterprise and service providers

Introduction • Goal: address the problem of constructing highly available distributed programs • Tolerate crashes of the underlying hardware automatically • Continues to operate despite failure of its components • First approach: replicate each components of the system • By von Neumann (1955) • Drawback: costly - use reliable hardware everywhere

Introduction (contd) Eric C. Cooper’s new approach: • Replication on per-module basis • Flexible & not burdening the programmer • Provide location and replication transparency to programmer • Fundamental mechanism • Troupes – a replicated module • Troupe members - replicas • Replicated procedure call (many-to-many communication between troupes)

Introduction (contd) • Important Properties give this mechanism flexibility and power: • individual members of a troupe do not communicate among themselves • unaware of one another’s existence • each troupe member behaves as no replicas

A Model of Replicated Distributed Programs (contd) A model of replicated distributed program: Replicated Distributed Program State information module Troupe Procedure

A Model of Replicated Distributed Programs (contd) • Module • Package the procedure and state information which is needed to implement a particular abstraction • Separate the interface to that abstraction from its implementation • Express the static structure of a program when it is written

A Model of Replicated Distributed Programs (contd) • Threads • A thread ID – unique identifier • Particular thread runs in exactly one module at a given time • Multiple threads may be running in the same module concurrently

Implementing Distributed Modules and Threads • No machine boundaries • Provide location transparency – the programmer don’t need to know the eventual configuration of a program • Module • implemented by a server whose address space contains the module’s procedure and data • Thread • implemented by using remote procedure calls to transfer control from server to server

Adding Replication • Processor and network failure of the distributed program • Partial failures • Solution: replication • Introduce replication transparency at the module level

Adding Replication (contd) • Assumption: troupe members execute on fail-stop processors • If not => complex agreement • Replication transparency in troupe model is guaranteed by: • All troupes are deterministic • (same input → same output)

Troupe Consistency • When all its members are in the same state • => A troupe is consistent => Its clients don’t need to know that is replicated •  Replication transparency

Call P Call P Call P P: proc P: proc P: proc Troupe Consistency (contd) Execution of a remote procedure call (I) Server Client

Call P Call P Call P P: proc P: proc P: proc Troupe Consistency (contd) Execution of a remote procedure call (II) Server Client

Execution of Procedure call • As a tree of procedure invocations • The invocation trees rooted at each troupe member are identical • The server troupe make the same procedure calls and returns with the same arguments and results • All troupes are initially consistent  All troupes remain consistent

Replicated Procedure Calls • Goal: allow distributed programs to be written in the same as conventional programs for centralized computers • Replicated procedure call is Remote procedure call • Exactly-once execution at all troupe members

Circus Paired Message Protocol • Characteristics: • Paired messages (e.g. call and return) • Reliably delivered • Variable length • Call sequence numbers • Based on the RPC • Use UDP, the DARPA User Datagram Protocol • Connectionless but retransmission

Implementing Replicated Procedure Calls • Implemented on top of the paired message layer • Two subalgorithms in the many-to-many call • One-to-many • Many-to-one • Implemented as part of the run-time system that is linked with each user’s program

One-to-many calls • Client half of RPC performs a one-to-many call • Purpose is to guarantee that the procedure is executed at each server troupe member • Same call message with the same call number • Client waits for return messages • Waits for all the return messages before proceeding in Circus

Synchronization Point • After all the server troupe members have returned • Each client troupe member knows that all server troupe members have performed the procedure • Each server troupe member knows that all client troupe members have received the result

Many-to-one calls • Server will receive call messages from each client troupe member • Server executes the procedure only once • Returns the results to all the client troupe members • Two problems • Distinguishing between unrelated call messages • How many other call messages are expected? • Circus waits for all clients to send a call message before proceeding

Many-to-many calls • A replicated procedure call is called a many-to-many call from a client troupe to a server troupe

Many-to-many steps • A call message is sent from each client troupe member to each server troupe member. • A call message is received by each server troupe member from each client server troupe member. • The requested procedure is run on each server troupe member. • A return message is sent from each server troupe member to each client troupe member. • A return message is received by each client troupe member from each server troupe member.

Multicast Implementation • Dramatic difference in efficiency • Suppose m client troupe members and n server troupe members • Point-to-point • mn messages sent • Multicast • m+n messages sent

Waiting for messages to arrive • Troupes are assumed to be deterministic, therefore all messages are assumed to be identical • When should computation proceed? • As soon as the first messages arrives or only after the entire set arrives?

Waiting for all messages • Able provide error detection and error correction • Inconsistencies are detected • Execution time determined by the slowest member of each troupe • Default in Circus system

First-come approach • Forfeit error detection • Computation proceeds as soon as the first message in each set arrives • Execution time is determined by the fastest member of each troupe • Requires a simple change to the one-to-many call protocol • Client can use call sequence number to discard return messages from slow server troupe members

First-come approach • More complicated changes required in the many-to-one call protocol • When a call message from another member arrives, the server cannot execute the procedure again • Would violate exactly-once execution • Server must retain the return messages until all other call messages have been received from the client troupe members • Return messages is sent when the call is received • Execution seems instantaneous to the client

A better first come approach • Buffer messages at the client rather than at the server • Server broadcasts return messages to the entire client troupe after the first call message • A client troupe member may receive a return message before sending the call message • Return message is retained until the client troupe member is ready to send the call message

Advantages of buffering at client • Work of buffering return messages and pairing them with call messages is placed on the client rather than a shared server • The server can broadcast rather than point-to-point communication • No communication is required by a slow client

What about error detection? • To provide error detection and still allow computation to proceed, a watchdog scheme can be used • Create another thread of control after the first message is received • This thread will watch for remaining messages and compare • If there is an inconsistency, the main computation is aborted

Crashes and Partitions • Underlying message protocol uses probing and timeouts to detect crashes • Relies on network connectivity and therefore cannot distinguish between crashes and network partitions • To prevent troupe members from diverging • Require that each troupe member receives majority of expected set of messages

Collators • Can relax the determinism requirement by allowing programmers to reduce a set of messages into a single message • A collator maps a set of messages into a single result • Collator needs enough messages to make a decision • Three kinds • Unanimous • Majority • First come

Performance Analysis • Experiments conducted at Berkeley during an inter-semester break • Measured the cost of replicated procedure calls as a function of the degree of replication • UDP and TCP echo tests used as a comparison

Performance Analysis • Performance of UDP, TCP and Circus • TCP echo test faster than UDP echo test • Cost of TCP connection establishment ignored • UDP test makes two alarm calls and therefore two settimer calls • Read and Write interface to TCP more streamlined

Performance Analysis • Unreplicated Circus remote procedure call requires almost twice the amount of time as a simple UDP exchange • Due to extra system calls require to handle Circus • Elaborate code to handle multi-homed machines • Some Berkeley machines had as many as 4 network addresses • Design oversight by Berkeley, not a fundamental problem

Performance Analysis • Expense of a replicated procedure call increments linearly as the degree of replication increases • Each additional troupe member adds between 10-20 milliseconds • Smaller than the time for a UDP datagram exchange

Performance Analysis • Execution profiling tool used to analyze Circus implementation in finer detail • 6 Berkeley 4.2BSD system calls account for more than ½ the total CPU time to perform a replicated call • Most of the time required for a Circus replicated procedure call is spent in the simulation of multicasting

Concurrency Control • Server troupe controls calls from different clients using multiple threads • Conflicts arise when concurrent calls need to access the same resource

Concurrency Control • Serialization at each troupe member • Local concurrency control algorithms • Serialization in the same order among members • Preserve troupe consistency • Need coordination between replicated procedure calls mechanism and synchronization mechanism • => Replicated Transactions

Replicated Transactions • Requirements • Serializability • Atomicity • Ensure that aborting a transaction does not affect other concurrently executed transactions • Two-phase locking with unanimous update • Drawback: too strict • Troupe Commit Protocol

Troupe Commit Protocol • Before a server troupe member commits (or aborts) a transaction, it invokes the ready_to_commit remote procedure call to the client troupe – call-back • Client troupe returns whether it agrees to commit (or abort) the transaction • If server troupe members serialize transactions in different order, a deadlock will occur • Detecting conflicting transactions is converted to deadlock detection

Replicated Distributed Systems

Replicated Distributed Systems

Presentation Transcript

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

transactional storage for geo-replicated systems

Distributed Systems

Distributed Systems

Distributed Systems Course Distributed Multimedia Systems

Distributed Systems Course Distributed File Systems

Replicated Distributed Programs

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Replicated FIFO Queue

Distributed Systems Course Distributed File Systems

Distributed Systems

Distributed Systems