CIS 620 Advanced Operating Systems

CIS 620 Advanced Operating Systems Lecture 11 – Fault Tolerance Prof. Timothy Arndt BU 331

Fault Tolerance • Dependable systems have the following requirements • Availability • Reliability • Safety • Maintainability

Fault Tolerance • Faults can be classified as • Transient • Intermittent • Persistent

Failure Models • Different types of failures.

Failure Masking by Redundancy • Triple modular redundancy.

Flat Groups versus Hierarchical Groups • Communication in a flat group. • Communication in a simple hierarchical group

Agreement in Faulty Systems • The Byzantine generals problem for 3 loyal generals and1 traitor. • The generals announce their troop strengths (in units of 1 kilosoldiers). • The vectors that each general assembles based on (a) • The vectors that each general receives in step 3.

Agreement in Faulty Systems • The same as in previous slide, except now with 2 loyal generals and one traitor.

RPC Failure Semantics • This gets hard and ugly. • Can't find the server. • Need some sort of out-of-band response from the client stub to the client. • Ada exceptions • C signals • Multithread the client and start the "exception" thread. • This loses transparency (centralized systems don't have this).

RPC Failures • Lost request message. • This is easy if known. That is, if we are sure the request was lost. • Also easy if idempotent and we think it might be lost. • Simply retransmit the request. • Assumes the client still knows the request. • Lost reply message. • If it is known the reply was lost, have server retransmit.

RPC Failures • Assumes the server still has the reply. • How long should the server hold the reply? • Wait forever for the reply to be ack'ed? No! • Discard after "enough" time. • Discard after we receive another request from this client. • Ask the client if the reply was received. • Keep resending reply. • What if we are not sure of whether we lost the request or the reply? • If the server is stateless, it doesn't know and the client can't tell! • If idempotent, simply retransmit the request.

RPC Failures • What if the server is not idempotent and can't tell if we lost the request or the reply? • Use sequence numbers so server can tell that this is a new request not a retransmission of a request it has already done. • Doesn't work for stateless servers. • Server crashes • Did it crash before or after doing some nonidempotent action? • Can't tell from messages.

RPC Failures • From databases, we get the idea of transactions and commits. • This really does solve the problem but is not cheap. • Fairly easy to get “at least once” (try request again if timer expires) or “at most once (give up if timer expires)” semantics. Hard to get “exactly once” without transactions. • To be more precise. A transaction either happens exactly once or not at all (sounds like at most once) and the client knows which.

RPC Failures • Client crashes • Orphan computations exist. • Again transactions work but are expensive. • We can have the rebooted client start another epoch and all computations of previous epoch are killed and clients resubmit. • It is better is to let old computations with owners that can be found continue. • This isn’t a great solution.

RPC Failures • An orphan may hold locks or might have done something not easily undone. • Serious programming is needed.

Basic Reliable-Multicasting Schemes • A simple solution to reliable multicasting when all receivers are known and are assumed not to fail • Message transmission • Reporting feedback

Nonhierarchical Feedback Control • Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.

Hierarchical Feedback Control • The essence of hierarchical reliable multicasting. • Each local coordinator forwards the message to its children. • A local coordinator handles retransmission requests.

Virtual Synchrony • The logical organization of a distributed system to distinguish between message receipt and message delivery

Virtual Synchrony • The principle of virtual synchronous multicast.

Message Ordering • Three communicating processes in the same group. The ordering of events per process is shown along the vertical axis.

Message Ordering • Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting

Implementing Virtual Synchrony • Six different versions of virtually synchronous reliable multicasting.

Implementing Virtual Synchrony • Process 4 notices that process 7 has crashed, sends a view change • Process 6 sends out all its unstable messages, followed by a flush message • Process 6 installs the new view when it has received a flush message from everyone else

Transaction, Recovery and Concurrency Control • Transactions are the units to be considered in both recovery and concurrency control • Recovery techniques are needed to ensure that transactions complete successfully despite system failures • Concurrency control is needed to prevent concurrent transactions from interfering with each other • Locking and timestamping are the major techniques for concurrency control

Properties of Transactions • Atomic - a transaction is an atomic unit of processing; it is either performed in its entirety or not performed at all • Consistent - a transaction must take the database from one consistent state to another • Isolated - changes made by a transaction should not be seen by other transactions until the initial transaction is committed • Durable - changes that are made and committed by a transaction must never be lost

Atomicity • Either an operation completes fully or the operation does not happen at all • Low-level atomic operations are built into hardware • High-level atomic operations are called transactions • Transaction manager ensures that all transactions either • complete (“committed transaction”) • have no effect on the db (“aborted transaction”)

System Model Memory Disk read(A) A A B write(B) B read(A) write(B) A A B Local buffer T1 Local buffer T2

Assumptions • Each data item can be read and written only once by one single transaction. It can be modified many times in the local buffer • Values are written onto the global buffer space in the same sequence that the write orders are issued • If the transaction reads a value from the database, and this value was previously modified, it always gets the latest value • A data item can be a file, relation, record, physical page, etc., determined by the designer

Transaction Example • Transfer $50 from account A to account B • Transaction structure 1. read(A) 2. A := A - 50 3. Write(A) 4. Read(B) 5. B := B + 50 6. Write(B) • If initial values are A = 180 and B = 100, then after the execution A = 130 and B = 150 • If the system crashes between steps 3 and 6, then the database is in an inconsistent state

Transaction Model • A transaction must see a consistent database • During transaction execution the database may be inconsistent • When the transaction is committed (completed successfully), the database must be consistent • A transaction which does not complete successfully is termed “aborted” • An aborted transaction must be “rolled back”, meaning the database is returned to its state prior to the transaction

Problems in Ensuring Atomicity • Transaction failure: transaction cannot complete due to user abort or internal error condition • System errors: the database system must terminate an active transaction due to an error condition (e.g., deadlock) • System crash: a power failure or other hardware failure causes the system to crash • Disk failure: a head crash or similar failure destroys all or part of disk storage

Management of Failure • Volatile storage • does not survive system crashes • examples: main memory, cache memory • Nonvolatile storage • survives system crashes • examples: disk; tape • Stable storage • a mythical form of storage that survives all failures • approximated by maintaining multiple copies on distinct nonvolatile media

Write-ahead Log • A log file is kept on stable storage • Each transaction Ti when it starts, registers itself on the log by writing <Ti, starts> • Whenever Ti executes write(X), the record < Ti, X, old-value, new-value> is written sequentially on the log, and then the write(X) is executed • When Ti reaches its last statement, the record < Ti,commits> is added to the log

Write-ahead Log • If X is modified, then its corresponding log record is always first written on the log and then written on the database • Before Ti is committed, all its corresponding log records must be in stable storage

Example Transactions T1: read(A) T2: read(A) A := A + 50 A := A + 10 read(B) write(A) B := B + 100 read(D) write(B) D := D - 10 read(C) read(E) C := 2C read(B) write(C) E := E + B A := A + B + C write(E) write(A) D := D + E write(D) Initial values: A = 100, B = 300, C = 5, D = 60, E =80

Log Records 1. <T1 starts> Database Values 2. < T1, B, 300, 400> I. B 400 3. < T1, C, 5, 10> II. C 10 4. < T1, A, 100, 560> III. A 560 5. < T1 commits> IV. A 570 6. < T2 starts> V. E 480 7. < T2, A, 560, 570> VI. D 530 8. < T2, E, 80, 480> 9. < T2, D, 60, 350> 10. < T2 commits>

Possible Order of Writes Log Database 1 2 3 4 5 I II III 6 7 8 IV 9 10 V VI Time

Consequences of a Crash • After a crash, the log is examined • Various actions are taken, depending on the last instruction written on the log • Example: Last instruction(i) Action i = 0 nothing 1<=i<=4 undo(T1) 5<=i<=9 redo(T1), undo(T2) i>=10 redo(T1), redo(T2)

Algorithm • Redo all transactions for which the log has both start and commit operations • Undo all transactions for which the log has a start operation but no commit operation • Remarks: • In a multitasking system, more than one transaction may need to be undone • if a system crashes during the recovery stage, the new recovery must still give correct results • In this algorithm, a large number of transactions must be redone since we don’t know how far behind the database updates are

Incremental Log with Deferred Updates • Each transaction Ti when it starts, registers itself on the log, by writing < Ti, starts> • Whenever executes write(X), the record < Ti, X, new-value> is written on the log • When Ti reaches its last statement, the record < Ti, commits> is added to the log • Before Ti is committed, all its corresponding log records must be in stable storage • Use the log records to perform the actual updates to the database after the commit • When system crashes, only need to redo transactions for which log has both start and commit operations

Checkpointing • During execution, in addition to the activities of the previous method, periodically perform checkpointing • force log buffers on log • force database buffers on database • force “checkpoint record” on log • During recovery • undo all transactions that have not been committed • redo all transactions that have been committed after a checkpoint

Checkpoint Example • T1 okay • T2 and T3 redone • T4 undone time T1 T2 T3 T4 system failure checkpoint

Other Types of Failure • Transaction failure • undo transaction • message to user - transaction bad • Disk failure • restore the database from backup copy, rerun transactions

Concurrency Control • Concurrency control is needed to handle problems that can occur when concurrent transactions execute • Lost Update: an update to a data item by some transaction is overwritten by another interleaved transaction without knowledge of the initial update • Temporary Update (Dirty Read): a transaction reads a data item updated by another transaction that later fails • Incorrect Summary: a transaction calculating an aggregate function uses some but not all updated data items of another transaction

Lost Update Example T1 T2 read(X) X := X - N read(X) X := X + M write(X) read(Y) write(X) Y := Y + N write(Y)

Temporary Update Example T1 T2 read(X) X := X - N write(X) read(X) X := X + M write(X) read(Y)

Incorrect Summary Example T1 T2 read(X) X := X - N write(X) read(X) sum := sum + X read(Y) sum := sum + Y read(Y) Y := Y + N write(Y)

Locking • All of the problems of concurrent users can be solved by a concurrency control technique called locking • When a transaction needs to be sure that some object will not have its value changed, it acquires a lock on that object • Locks may be either exclusive locks (X locks) or shared locks (S locks)

Deadlock • The use of locks solves the concurrency problems seen earlier, but it may lead to the problem of deadlock. This can be resolved in one of two ways (for centralized systems): • The system can maintain a Wait-For Graph (the graph of who is waiting for whom). When there is a cycle in this graph, a state of deadlock exists and one of the transactions in the cycle must be rolled back • A simpler solution is to use a timeout mechanism to rollback transactions which have been inactive for a certain period of time

CIS 620 Advanced Operating Systems