Replication and Consistency

Replication and Consistency • Data and Object Replication • Data Consistency Model • Implementation Issues

Replication • What replicate? • Data, File, process, object, service, etc • Why replicate? • Enhance performance: • Caching data in clients and servers to reduce latency • Server farm to share workload • Increase availability • Reduce downtime due to individual server failure through server replication • Support network partitions and disconnected operations through caching (e.g. email cache in POP3) • Fault tolerance • Highly available data is not necessarily current data. Fault-tolerant service should guarantee correct behavior despite a certain number and type of faults. • Correctness: freshness, timeliness • If f servers can exhibit byzantine failures, at least 2f+1 servers needed

Challenges • Replication Transparency • Clients should not be aware of the presence of multiple physical copies; data are organized as logical objects. • Exactly one set of return • “We use replication techniques ubiquitously to guarantee consistent perf and high availablity. Although replication brings us closer to our goals, it cannot achieve them in a perfectly transparent manner” [Wenrner Vogels, CTO of Amazon, Jan 2009] • Consistency • Whether the operations on a collection of replicated objects produce results that meet the specification of correctness for those objects • A read should return the value of last write operation on the data • Global synchronization for atomic update operations (transactions), but they are too costly • Relax the atomic update requirement so as to avoid instantaneous global synchronizations • To what extent consistency can be loosened • Without a global clock, how to define the “last”? Overhead? • Overhead for keeping copies up to data

Replication of Shared Objects • Replicating a shared remote object without correctly handling of concurrent invocations lead to consistency problem. • W(x)  R(x) in one site, but R(x) W(x) in another site • Replicas need additional sync. to ensure that concurrent invocations are performed in a correct order at each replica. • Need support for concurrent access to shared objects • Need support for coordination among local and remote accesses

Consistency Models • A consistency model is a contract between processes and the data store. For replicated objects in the data store that are updated by multiple processes, the model specifies the consistency guarantees that the data store makes about the return values of read ops. • E.g x=1; x=2; print(x) in the same process P1 ? • E.g x=1 (at P1 @5:00pm); x=2 (at P2 @ 1 ns after 5:00pm); print(x) (at P1)?

Overview of Consistency Models • Strict Consistency • Linearizable Consistency • Sequential Consistency • Causal Consistency • FIFO Consistency • Weak Consistency • Release Consistency • Entry Consistency Ease of Programming Less Efficiency Hard to Programming More Efficient

Strict Consistency • Read of data item x returns value of the most recent write on x. • Most recent in the sense of absolute global time • E.g x=1; x=2; print(x) in the same process P1 ? • E.g x=1 (at P1 @5:00pm); x=2 (at P2 @ 1 ns after 5:00pm); print(x) (at P1)? • A strictly consistent store requires all writes are instantaneously visible to all processes. time time (a) Strictly consistent; (b) Non-strictly consistent

Linearizability • Operations are assumed to receive a timestamp using a globally available clock, but one with only finite precision. • A data store is linearizable when each op is timestamped and i) The result of any execution is the same as if the ops by all processes on the data store were executed in some sequential order ii) Ops of each individual process appear in this sequence in the order specified by its program iii) If top1(x)<top2(x), then op1(x) should precede op2(x) in this sequence.

Sequential Consistency • A data store is sequentially consistent if • i) The result of any execution is the same as if the ops by all processes on the data store were executed in some sequential order • ii) Ops of each process appear in this sequence in the order specified by its program • A sequentially consistent data store. • A data store that is not sequentially consistent. Writer Writer Observer Observer time time

Possible Executions in SC Model Assume initially x=y=z=0 Is signature 001001 valid by SC model ?

Remarks on the SC Model • SC model was conceived by Lamport in 1979 for shared memory of multiprocessors. • Maintain program order of each thread • Coherence view of any data item • Low efficiency in implementation • Each read/write op must be atomic. • Atomic read/write on distributed shared objects costs too much

Causal Consistency • In SC, two ops of w(x) and w(y) must provide consistent view to all processes. • In CC, the requirement is relaxed if w(x) and w(y) have no causal relationship (Hutto and Ahamad, 90) • That is, writes that are potentially causally related must be seen by all processes in the same order. Concurrent writes may be seen in a different order on different machines. Valid in CC, but invalid in SC

Causal Consistency (Cont) • A violation of a casually-consistent store. • A correct sequence of events in a causally-consistent store.

Remarks on Causal Consistency • Causal Consistency Model allows concurrent writes to be seen in a different order on different machines. So • Concurrent writes can be executed in parallel (overlapped) to improve efficiency • It relies on a compiler/run-time support for constructing and maintaining a dependency graph that captures causality between the operations.

Weak Consistency • Weak Consistency relaxed the ordering requirement by distinguishing synchronization op from normal ops. (Dubois et al. 1988) • Accesses to synchronization variables associated with a data store are sequentially consistent • No operation on a synchronization variable is allowed to be performed until all previous writes have been completed everywhere • No read or write ops are allowed to be performed until all previous operations to synchronization variables have been performed.

Weak Consistency (cont’) An invalid sequence for WC. A valid sequence by WC • WC enforces consistency on a group of operations, not on individual reads/writes; • WC guarantees the time when consistency holds, not the form of consistency.

(a) Consistency models not using synchronization operations. (b) Models with synchronization operations. Summary of Consistency Models

Eventually Consistent Distributed Data Store • In such special distributed data stores, all replicas will gradually become consistent, if there are no updates for a long time. • Examples • In distributed web servers, web pages are usually updated by a single authority; • Web cache may not return the latest contents, but this inconsistency acceptable in some situations • DNS implements EC • Usages or Target Applications: • Most ops involve reading data • Lack of simultaneous updates • A relatively high degree of inconsistency can be tolerated

Client-Centric Eventual Consistency • EC stores require only that updates are guaranteed to propagate to all replicas. Conflicts due to concurrent writes are often easy to resolve. Cheap implementation • Problems arise when different replicas are accessed by the same process at different time  Causal Consistency for a single client Eventual consistency data stores work fine as long as clients always access the same replica

Variations of Client-Centric EC • Monotonic-reads • If a process reads the value of a data item x, any successive read on x by that process will never return an older value of x. • E.g. Mail read from a mbox in SF will also be seen in mbox of NY • Monotonic-writes • A write by a process on a data item x is completed before any successive write on x by the same process. • Similar to FIFO consistency, but monotonic-write model is about the behavior of a single process. • Read Your Writes • A write on data item x by a process will always be seen by a successive read on by the same process. • E.g. Integrated editor and browser; Integrated dvips and ghostview • Writes follow reads • A write on x by a process following its previous read of x is guaranteed to take place on the same or a more recent value of x that was read

Outline • Object Replication • Data Consistency Model (client-side) • Implementation Issues (server-side) • Update Propagation • Distributing updates to replicas, independent of consistency model • Other Consistency Model Specific Issues • Case Studies

Replica Placement • A major design issue is to decide where, when, and by whom copies of the data store are to be placed. • Permanent Replicas • Initial setup of replicas that constitute a distributed store • For examples, • Distributed web servers: • server mirrors, • distributed database systems (shared nothing arch vs federated DB)

origin server in North America CDN distribution node CDN server in S. America CDN server in Asia CDN server in Europe Replica Placement: server-initiated • Dynamic replica placement of replica is a key to content delivery network (CDN). • CDN company installs hundreds of CDN servers throughout Internet • CDN replicates its customers’ content in CDN servers. When provider updates content, CDN updates servers • Key issue: When and where replicas should be created or deleted

CDN Content Placement • An algorithm (Rabinvoich et al 1999) • Each server keeps trace of access counts per file, and where access requests come from. • Assume given a client C, each server can determine which of the servers is closest to C.

Server-initiated Replication (cont’) • An algorithm (Rabinvoich et al 1999) • Initial placement • Migration or replication of objects to servers in the proximity of clients that issue many requests for the objects • Counting access requests to file F at server S from different client • Deletion threshold, replication threshold • If #access(S,F) <= del(S,F), remove the file, unless it is the last copy • If #access(S,F)>= rep(S,F), duplicate the file somewhere • If del(S,F) < #access(S,F) < rep(S,F), migrate

origin server Proxy server HTTP request HTTP request client HTTP response HTTP response HTTP request HTTP response client origin server Replica placement: Client-Initiated • Cache: a local storage to temporarily store a copy of the recently accessed data mainly for reducing request time and traffic on the net • Client-side cache (forward cache in Web) • Proxy cache • Proxy cache network

Update Propagation • Invalidation vs Update Protocols • In invalidation protocol, replicas are invalidated by a small message • In update protocol, replicas are brought to up to date by providing with modified data, or specific update operations (active replication) • Pull vs Push Protocols • Push-based (server-based) for appl with high read-to-update ratios (why?) • Pull-based (client-based) is often used by client caches • Unicasting versus Multicasting Push vs Pull in a multiple-clients-single-server system

Lease-based Update Propagation • [Gray&Chariton’89]: A lease is a server promise that updates will be pushed for a specific time. When a lease expires, the client is forced to • pull the modified data from the server, if exists, or • requests a new lease for pushing updates • [Duvvuri et al’90]: Flexible lease system: the lease period can be dynamically adapted • age-based lease, based on the last time the item was modified. Long lasting leases to inactive data items. • renewal-frequency based lease. Long term lease be granted to clients whose caches need to be refreshed often. • Based on server-side state-space overhead. Overloaded server should reduce the lease period so that it needs to keep track of fewer clients as leases expire more quickly. • [Yin et al’99] Volume lease on objects as well as volumes (i.e. collections of related objects)

Volume Lease-based Update • Leases provide good performance when the cost of lease renewal is amortized over many reads • If the renewal cost is high, lease time must be long enough; but a long lease implies a high chance of reading stale data • Volume lease uses a combination of long object lease and a short volume lease: • Long object lease save reading time • Short volume lease keeps objects updated • Spatial locality between objects in a volume

Other Impl. Issues of Consistency Model • Passive Replication (Primary-Backup Organization) • Single primary replica manager at any time and one or more secondary replica manager • Writes can be carried out only the primary copy • Two Write Strategies: • Remote-Write Protocols • Local-Write Protocols • Active Replication: • There are multiple replica managers and writes can be carried out at any replica • Cache Coherence Protocols

Passive Replication: Remote Write • Write is handled only by the remote primary server, and the backup servers are updated accordingly; Read is performed locally. • E.g. Sun Network Information Service (NIS, formerly Yellow Pages)

Consistency Model of Passive Replication • Blocking update, waiting till backups are updated • Blocking update of backup servers must be atomic so as to implement sequential consistency as the primary can sequence all incoming writes and all processes see all writes in the same order from any backup servers. • Nonblocking, returning as soon as primary is updated • what happens if backup fails after update is acknowledged • Consistency model with non-blocking update ?? • Atomic multicasting in the presence of failures! • Primary replica failure • Group membership change • Virtual Synchrony Implementation

Passive Replication: Local-Write • Based on object migration • Multiple successive writes can be carried out locally • How to locate the data item? • broadcast, home-based, forwarding pointers, or hierarchical location service

Remarks • Without backup servers • Implement linerizability • With backup servers • Read/write can be carried out simultaneously. • Consistency model ?? • Applicable to mobile computing • Mobile host serves a primary server before disconnecting • While being disconnected, all update operations are carried out locally; other processes can still perform read operations • When connecting again, updates are propagated from the primary to the backups

Active Replication • In active replication, each replica has an associated process that carries out update ops. • Client makes a request (via a front-end) and the request is multicast to the group of replica managers. • Totally-Ordered Reliable Multicast, based on Lamport timestamps • Implement Sequential Consistency • Alternative is based on central coordinator (Sequencer) • First forward each op to the sequencer for a unique sequence number • Then forward the op, together with the number, to all replicas • Another problem is replicated invocations • Object A invokes object B, which in turn invokes object c; If object B is replicated, each replica will invoke C independently, in principle • This problem occur in any client-server problem • No satisfactory solutions

Replication-Aware Solution to Replicated Invoc • Forwarding an invocation request from a replicated object. • Returning a reply to a replicated object.

In Summary • Replication • data, file (web files), object replication • reliability and performance • Client-Perceived Consistency Models • Strong consistency • Weak Consistency • Eventual Consistency and client-centric variations • Server Implementation (Update) • CAP Theorem [Brewer’00, Gilbert&Lynch’02] • Of three properties of shared-data systems: data consistency, service availability, and tolerance to network partition , only two can be achieved at any give n time.

CAP Theorem • Three core systemic requirements in designing and deploying applications in a distributed env • Consistency, Availability and Partition Tolerance • Partition tolerance: no set of failures less than total network failure is allowed to cause the system to respond incorrectly. • Availability (minimal latency) • Amazon claim that just an extra one tenth of a second on their response times will cost them 1% in sales. Google said they noticed that just a half a second increase in latency caused traffic to drop by a fifth.

CAP • If we want A and B to be highly available (i.e. working with minimal latency) and we want our nodes N1 to Nn (where n could be hundreds or even thousands) to remain tolerant of network partitions (lost messages, undeliverable messages, hardware outages, process failures) then sometimes we are going to get cases where some nodes think that V is V0 and other nodes will think that V is V1.

CAP Theorem Proof • Impossibility Proof • If M is an async msg (no clocks at all), N1 has no way of knowing whether N2 getsthe msg. • Even with guaranteed delivery of M, N1 has no way of knowing if a msg is delayed by a partition event or something failing in N2. • Making M sync doesn’t help because that treats the write by A on N1 and the update event from N1 to N2 as an atomic op, which gives us the same latency issues • Even a partially-sync model atomicity cannot be guaranteed. • In a partially async network, each site has a local clock and all clocks increment at the same rate, but the clocks are not synchronized to display the same value at the same instant.

CAP Theorem Proof • Consider a transaction (i.e. unit of work based around the persistent data item V) called α, then α1 could be the write op from before and α2 could be the read. • On a local system this would be easily be handled by a database with some simple locking, isolating any attempt to read in α2 until α1 completes safely. • In the distributed model though, with nodes N1 and N2 to worry about, the intermediate synchronising message has also to complete. Unless we can control when α2 happens, we can never guarantee it will see the same data values α1 writes. All methods to add control (blocking, isolation, centralised management, etc) will impact either partition tolerance or the availability of α1 (A) and/or α2 (B).

Dealing with CAP • Drop partition tolerance • Partition tolerance limits scaling • Drop availability • On encountering a partition event, affected services simply wait until data is consistent and therefore remain unable during that time. Controling this could be complex in large scale systems • Drop consistency • Lots of inconsistency don’t actually require much work • Eventually consistent model works in most apps • BASE: Basically available, Soft-state, Eventually consistent (logical opposite of ACID in database: atomicity, consistency, isolation, durability)

Replication and Consistency