Understanding Data Replication: Fault Tolerance, Availability, and Design Considerations

Lecture XII: Replication CMPT 401 2008 Dr. Alexandra Fedorova

Replication

Why Replicate? (I) • Fault-tolerance / High availability • As long as one replica is up, the service is available • Assume each of n replicas has same independent probability p to fail. • Availability = 1 - pn Fault-Tolerance: Take-Over

Why Replicate? (II) • Fast local access (WAN replication) • client can always send requests to closest replica • Goal: no communication to remote replicas necessary during request execution • Goal: client experiences location transparency since all access is fast local access Fast local access Rome Toronto Montreal

Why Replicate? • Scalability and load distribution (LAN replication) • Requests can be distributed among replicas • Handle increasing load by adding new replicas to the system cluster instead of bigger server

Challenges: Data Consistency • We will study systems that use data replication • It is hard, because data must be kept consistent • Users submit operations against the logical copies of data • These operations must be translated into operations against one, some, or all physical copies of data • Nearly all existing approaches follow a ROWA(A) approach: • Read-one-write-all-(available) • Update has to be (eventually) executed at all replicas to keep them consistent • Read can be performed at any replica

Challenges: Fault Tolerance • The goal is to have data available despite failures • If one site fails others should continue providing service • How many replicas should we have? • It depends on: • How many faults we want to tolerate • The types of faults we expect • How much we are willing to pay

Roadmap • Replication architectures • Active replication • Primary-backup (passive, master-slave) replication • Design considerations for replicated services • Surviving failures

Active Replication Replicated Servers A A B A Client C A

Active Replication

Active Replication • The client send request to the servers using totally ordered reliable multicast (logical clocks or vector clocks) • Server coordination is given by the total order property (assumption: synchronous system) • All replicas execute the request in the order they are delivered • No additional coordination necessary (Assumption: determinism) • All replicas produce the same result • All replicas send result to the client; client waits for the first answer

Fault Tolerance: Failstop Failures • As long as at least one replica survives the client will continue receiving service • Assuming there are no partitions! • Suppose B and C are partitioned, so the cannot communicate • They cannot agree on how to order client’s requests Replicated Servers A A B A Client C A

Fault Tolerance: Byzantine Failures • Can survive Byzantine failures (assuming no partitions) • The system must have n ≥ 2f + 1 replicas (f is the number of failures) • The client will compare results of all replicas, will choose the result returned by the majority f + 1 non-faulty replicas • This is the idea used in LOCKSS (Lots of Copies Keep Stuff Safe)

Primary-Backup Replication (PB) Replicated Servers Client A A primary If the primary fails, a backup takes over, becomes the primary B A C A backup backup Also known as passive replication and master-slave replication

System Requirements • How do we want the system to behave? • Just like a single-server system? • Must ensure that there is only one primary at a time • Data is kept consistent: • If a client received an acknowledgement of an update operation, that update must survive system crashes • Results of operations should be the same as they would be if executed on a single-server system • Can we tolerate loose data consistency? • The client eventually gets the consistent data, but not right away

Example of Data Inconsistency • Client operations: write(x = 5) read (x) // should return 5 on a single-server system • On a replicated system: write (x = 5) Primary responds to client Primary crashed before propagating update to other replicas A new primary is selected read (x) // may return x ≠ 5, the new primary does not know about the update to x

Design Considerations for Replicated Services • Where to submit updates? • A designated server or any server? • When to propagate updates? • Eager or lazy? • How many replicas to install?

Where to Submit Updates? • Primary Copy: • Each object has a primary copy • Often there is a designated primary - it holds primary copies for all objects • Updates on object x have to be submitted to the primary copy of x • Primary propagates changes on x to secondary copies • Secondary copies are read-only • Also called master/slave approach

Where to Submit Updates • Update Everywhere: • Both read and write operations can be submitted to any server • This server takes care of the execution of the operation and the propagation of updates to the other copies T1:r(x)w(y) T2:r(y)w(y)

When to Propagate Updates? • Eager: • Within the boundaries of the transaction • Before response is sent to client • Lazy: • After the commit of the transaction • After the response is sent to client

PB Replication with Eager Updates • The client sends the request to the primary • There is no initial coordination • The primary executes the request • The primary coordinates with the other replicas by sending the update information to the backups • The primary (or another replica) sends the answer to the client

Eager Update Propagation

Eager Update Propagation For Transactional Services

When Can a Failure Occur? • F1: Primary fails beforereplica coordination • Client receives no response. It will retry. Eventually will get data from new primary. • F2: Primary fails during replica coordination • Replicas may or may not have reached agreement w.r.t. client’s transaction. Client may receive a response after system recovers.The system may fail to recover (if the agreement protocol blocks). • F3: Primary fails after replica coordination • A new primary responds F1 F2 F3 Phase 1:Client Request Phase 3:Execution Phase 4:Replica Coordination Phase 5:Client response

Lazy Update Propagation (Transactional Services) • Primary Copy: • Upon read: read locally and return to user • Upon write: write locally and return to user • Upon commit/abort: terminate locally • Sometime after commit: multicast changed objects in a single message to other sites (in FIFO)

Lazy Update Propagation (Continued) • Secondary copy: • Upon read: read locally • Upon message from primary copy: install all changes (FIFO) • Upon write from client: refuse (writing clients must submit to primary copy) • Upon commit/abort request (only for read-only txn): local commit

Lazy Update Propagation A client may end up with an inconsistent view of the system

Lazy Propagation: Discussion • Lazy replication has no server/agreement coordination within response time • Faster • Transactions might be lost in case of primary crash • Weak data consistency • Simple to achieve • Secondary copies only need to apply updates in FIFO order • Data at secondary copies might be stale • Multiple Primaries possible (multi-master replication) • More locality

Fault Handling • Properties of correct PB protocol • Property 1: There is at most one primary at any time • Property 2: Each client maintains the identity of the primary, and sends its requests only to the primary • Property 3: If a client update arrives at a backup, it is not processed • When a primary fails, we must elect a new one • Network partitions may cause election of more than one primary • We can avoid network by choosing the right number of replicas (under certain failure assumptions) • How many replicas do we need to tolerate failures?

System Model • Synchronous system (useful for deriving theoretical results) • Fully connected network (exactly one FIFO link between any two processes) • Failure model: • Crash failures: also known as failstop failures • Crash+Link failures: A server may crash or a link may lose messages (but links do not delay, duplicate or corrupt messages) • Receive-Omission failures: A server may crash and also omit to receive some of the messages send over a non-faulty link • Send-Omission failures: A server may fail not only by crashing but also by omitting to send some messages over a non-faulty link • General-Omission failures: A server may exhibit send-omission and receive-omission failures

Lower Bounds on Replication • How many replicas n do you need to tolerate f failures?

Crash Failures, Send-Omission Failures: n > fReplicas FAILED(crashed or fail to send) Becomes primary

Other Failure Models • The rest of the failure models may create partitions • Partitions: Servers are divided into mutually non-communicating partitions • A primary may emerge in each partition, so we’ll have more than one primary – against the rules • To avoid partitions, we use more replication

Crash+Link Failures: n > f+1Replicas Scenario 1: f servers fail Scenario 2: f links fail UNREACHABLE BUT ALIVE FAILED Becomes primary Becomes primary Becomes primary Problem! 2 primaries!!!

Crash+Link Failures: n > f+1Replicas • We need another correct node that would serve as a link between the two partitions • If the new node fails, we have f+1 failures. • This is a contradiction, because we assume at most f failures UNREACHABLE BUT ALIVE Becomes primary Becomes primary

What About Hard Partitions? • We showed how many replicas are needed to preventpartitions in the face of f failures • However partitions do happen due to router failures, for example • So having extra replicas won’t help, because they will also be on one of the sides of the faulty router • Next we’ll talk aboutsurviving failures despitenetwork partitions

Surviving Network Partitions • Most systems operate under assumption that a partition will eventually be repaired • Optimistic approach: • Allow updates in all partitions • When the partition is repaired, eventually synchronize the data • OK for a distributed file system (think about your laptop in disconnected mode) • Pessimistic approach: • Allow updates only in a single partition – used where strong consistency is required (flight reservation system) • Which partition? This is usually decided by quorum consensus • After partition is repaired update copies of data in the other partition

Quorum Consensus • Quorum is a sub-group of servers whose size gives it the right to carry out the operation • Usually the majority gets the quorum • Design/implementation challenges: • Replicas must agree that they are behind a partition – must rely on timeouts, failuredetectors (special devices?) • If the quorum set does not containthe primary, the replicas must electthe new primary • Cost consideration: to tolerate one partition, musthave at least three servers. Implement one as a simple witness? Quorum

Bringing Replicas Up-to-Date • Version numbers: • Each copy has a version number (or a timestamp) • Only copies that are up-to-date have the current version number • Operations should be applied only to copies with the current version number • How does a failed server finds out that its not up-to-date? • Periodically compare all version numbers? • Log sequence numbers: • Each operation is written to a log (like a transactional log) • Each log record has a log sequence number (LSN) • Replica managers compare LSN’s to find out if they are not up-to-date • Used by Berkeley DB replication system

Summary • Discussed replication • Used for performance, high availability • Active replication • Client sends updates to all replicas • Replicas co-ordinate amongst themselves, apply updates in order • Passive replication (primary copy, primary-backup) • Eager/lazy update propagation • Number of replicas to prevent partitions • Handling partitions • Optimistic • Pessimistic (quorum consensus) • Next let us look at real systems that use replication

Understanding Data Replication: Fault Tolerance, Availability, and Design Considerations