Replication and Fault Tolerant

Replication and Fault Tolerant

Introduction • Reason for Replication • Reliability Maintaining multiple copies= if one crash; continue with another replicas • Performance Divide the work to multiple server • Place data close to the process that is using it • data access time reduces • any drawbacks? Cost? Inconsistency? = bank account • Accessing Web pages • can cache pages • need to keep the cache updated all the time

Object Replication (1) • If remote objects are replicated, • need to ensure that operations are performed in the correct (same) order in all replicas • first, need to ensure that the concurrent invocations on each replica are handled correctly • Organization of a distributed remote object shared by two different clients.

How do you prevent concurrent access to distributed Objects? • 2 choices • Let the object itself handle it Java allows methods to be synchronizedIn C++, use pthreads, mutex, … • The middleware handles it

Object Replication (2) • A remote object capable of handling concurrent invocations on its own. • A remote object for which an object adapter is required to handle concurrent invocations

Object Replication (3) • A distributed system for replication-aware distributed objects. • A distributed system responsible for replica management

Data-Centric Consistency Models Contract between process and data store(fileSys,S/memory,S/database) obey certain rules,data store promises to obey certain rules,data store promises to work correctly. e.g: process read the up-to-date data stored from the last write operation. • The general organization of a logical data store, physically distributed and replicated across multiple processes.

Strict Consistency Any read on a data item x Returns a value the most recent write on x • Observation: It doesn't make sense to talk about "the most recent" in a distributed environment. • Assume all data items have been initialized to NIL • W( x) a: value a is written to x • R( x) a: reading x returns the value a • Behavior of two processes, operating on the same data item. • A strictly consistent store. • A store that is not strictly consistent.

Sequential Consistency (1) SQ: results of any execution is same as if operations from different processes are executed in some sequential order.Operations of single process must appear in order specified by program any valid interleaving of read and write operations is acceptable,but all processes must see same interleaving of operations. • A sequentially consistent data store. • A data store that is not sequentially consistent.

Causal Consistency (1) • Necessary condition:Writes that are potentially causally related must be seen by all processes in the same order. • Concurrent writes may be seen in a different order on different machines.

Causal Consistency (2) • This sequence is allowed with a causally-consistent store, but not with sequentially or strictly consistent store. • A data store that is not sequentially consistent.

Causal Consistency (3) Concurent write Concurent write • A violation of a casually-consistent store. • A correct sequence of events in a casually-consistent store.

FIFO Consistency (1) • Necessary Condition:Writes done by a single process are seen by all other processes in the order in which they were issued, but writes from different processes may be seen in a different order by different processes.

FIFO Consistency (2) • A valid sequence of events of FIFO consistency

FIFO • Three concurrently executing processes.

FIFO Consistency (3) • Statement execution as seen by the three processes from the previous slide. The statements in bold are the ones that generate the output shown.

FIFO Consistency (4) Two concurrent processes. Both process can be killed P1 read y =0 before it sees P2(y)1

Summary of Consistency Models • Consistency models not using synchronization operations.

Distribution Protocols Replica Placement Update Propagation Epidemic Protocols

Replica Placement • The logical organization of different kinds of copies of a data store into three concentric rings.

Replica Placement • Permanent replicas: • Process/machine always having a initial set of replica • Web site(file ) & mirroring (all the content) & distributed database

Server-initiated replica: • Process that can dynamically host a replica on request of another server in the data store • push caches- • create a replicate when they have burst request from certain location. • The Algorithm: • Replication take place to reduce the load on a server. • Specified file on server can be migrate to the nearest request.

Server-Initiated Replicas • Q Counting access requests from different clients. • Eg: Web Hosting Service

Client-initiated replica: • client cache. • Local storage capacity • use temporarily to store a copy of data just requested. • Managing the cache is left to the client. • Access time improved if the cache hit is said to occurs.

Update propagation • Update are initiated at a client • Forwarded to one of the copies an propagate to another copies • Some design issues to consider in propagating the update. • State versus operations • Pull vs Push Protocol

Push and Pull based Approach • Push based Approach • Also referred as server-based protocol • Update are directly propagate to the replica without request. • Pull based Approach • Referred as client-based protocol • Client request a server to send any update it has at the moment.

Push versus Pull Protocols • A comparison between push-based and pull-based protocols in the case of multiple client, single server systems.

Quorum-Based Protocols • Three examples of the voting algorithm: • A correct choice of read and write set • A choice that may lead to write-write conflicts • A correct choice, known as ROWA (read one, write all)

Fault ToleranceBasic ConceptsFailure Models

Introduction • Partial failure in distributed system may happen when one component is fails. • May affect the operation in certain component • Leaving another component totally unaffected • The design goal in DS is • Build a system that automatically recover from a partial failure • Without seriously affecting the overall performance

Basic Concepts • Dependability Includes • Availability • Reliability • Safety • Maintainability

Availability • The system is ready to be used immediately • In general, the system is operating correctly at any given moment and is available to performs its functions. • Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time • 99.9%

Reliability • System can run continuously without failure. • High reliable system is one that will most likely continue to work without interruption during a relative long period of time • One measure used to define a component or system's reliability is mean time between failures (MTBF) • MTBF = (total elapsed time – sum of downtime)/number of failures • A related measurement is mean time to repair (MTTR). MTTR is the average time interval (usually expressed in hours) that it takes to repair a failed component.

Safety • Nothing catastrophic will happen if a system temporary fails to operate correctly.

Maintainability • Refers to how easy a failed system can be repaired

Terminology • Failure: When a component is not living up to its specifications, a failure occurs • Error: That part of a component's state that can lead to a failure • Fault: The cause of an error • Fault prevention: prevent the occurrence of a fault • Fault tolerance: build a component in such a way that it can meet its specifications in the presence of faults

Failure Models • Different types of failures.

Failure Models(cont) Timing failures: The output of a component is correct, but lies outside a specified real-time interval - (performance failures: too slow) Response failures: The output of a component is incorrect Value failure: The wrong value is produced State transition failure: Execution of the component's service brings it into a wrong state

Failure Models(cont) • Crash failures: A component halts but behaves correctly before halting • Omission failures: A component fails to respond • Receive omission:A server fails to receive incoming messages • Send omission:A server fails to send messages • Arbitrary failures: A component may produce arbitrary output and be subject to arbitrary timing failures

Replication and Fault Tolerant

Replication and Fault Tolerant

Presentation Transcript

Fault-Tolerant Broadcast

Fault-Tolerant Broadcast

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

Fault Tolerance and Replication

Fault-Tolerant Consensus

Fault Tolerant Backplane

FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Fault Tolerant Configuration

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

Fault-Tolerant State Machine Replication

fault-tolerant

Fault-tolerant routing

Fault-Tolerant Consensus

Fault-Tolerant Broadcast

Fault-tolerant Computing