Prof. Ravi K. Iyer Center for Reliable and High-Performance Computing

Design of High Availability Systems and NetworksLecture 3Information Redundancy Part Four: Checkpointing & Recovery in Networks Prof. Ravi K. Iyer Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering and Coordinated Science Laboratory University of Illinois at Urbana-Champaign iyer@crhc.uiuc.edu http://www.crhc.uiuc.edu/DEPEND

Recovery in Networked Systems • Processes cooperate by exchanging information to accomplish a task • Message passing (distributed systems) • Shared memory (e.g., multiprocessor systems) • Rollback of one process may require that other processes also roll back to an earlier state. • All cooperating processes need to establish recovery points. • Rolling back processes in concurrent systems is more difficult than for a single process due to • Domino effect • Lost messages • Livelocks

Networked/Distributed Systems: Local State • For a site (computer, process) Si, its local state L Si, at a given time is defined by the local context of the distributed application. Let’s denote: send(mij)- send event of a message mij by Si to Sj rec(mij)- receive event of message mij by site Sj time(x) - time in which state x was recorded • We say that send(mij)  LSi iff time(send(mij)) < time(LSi) rec(mij)  LSj iff time(rec(mij)) < time(LSj) • Two sets of messages are defined for sites Si and Sj • Transit transit(LSi, LSj) = {mij | send(mij)  LSi  rec(mij)  LSj} • Inconsistent inconsistent (LSi, LSj) = {mij | send(mij) LSi rec(mij)  LSj}

Networked/Distributed Systems: Global State • A global state GS is a collection of the local states of its sites, i.e., GS = {LS1, LS2, …, LSn}, where n is the number of sites in the system. • Consistent global state: A global state GS is consistent iff i, j : 1  i, j  n :: inconsistent(LSi, LSj) =  • Transitless global state: A global state GS transitless iff i, j : 1  i, j  n :: transit(LSi, LSj) =  • Strongly consistent global state: A global state that is both consistent and transitless For every received message a corresponding send event is recorded in the global state All communication channels are empty

Networked/Distributed SystemsLocal/Global State - Examples The global states: • GS1 = {LS11, LS21, LS31} is a strongly consistent global state. • GS2 = {LS12, LS23, LS33} is a consistent global state . • GS3 = {LS11, LS22, LS32} is an inconsistent global state. LS11 Time LS12 S1   LS21 LS23 LS22 S2    LS31 LS33 LS32 S3   

Domino Effect x3 x1 x2 X y1 y2 Y z2 z1 Z Time X, Y, Z - cooperating processes • Rollback of X does not affect other processes. • Rollback of Z requires all three processes to roll back to their very first recovery points. - recovery points

Lost Messages and Livelocks Lost Messages X x1 Time Message loss due to rollback recovery m failure y1 X Y Livelocks Livelock is a situation in which a single process can cause an infinite number of rollbacks, preventing the system from making progress x1 x1 x1 Time Time Time X X X n2 n1 n1 m1 y1 y1 m2 y1 Y Y X Y  failure 2nd roll back to restore consistent global state

Consistent Set of Checkpoints: Recovery Lines • A strongly consistent set of checkpoints (recovery line) corresponds to a strongly consistent global state. • There is one recovery point for each process in the set during the interval spanned by checkpoints, there is no information flow between any • Pair of processes in the set • Process in the set and any process outside the set • A consistent set of checkpoints corresponds to a consistent global state. Set {x1, y1, z1} is a strongly consistent set of checkpoints Set {x2, y2, z2} is a consistent set of checkpoints (assuming that messages are logged) x1 x2 X y1 y2 Y z2 z1 Z Time

Synchronous Checkpointing and Recovery (Koo & Toueg) • Assumptions • Processes communicate by exchanging messages through communication channels • Channels are FIFO • End-to-end protocols (such a sliding window) are assumed to cope with message loss due to rollback recovery and communication failures • Communication failures do not partition the network • Two types of checkpoints • Permanent - a local checkpoint at a process • Tentative - a temporary checkpoint that is made a permanent checkpoint on the successful termination of the checkpoint algorithm • A single process invokes the algorithm • There is no site failure during algorithm execution. • The checkpoint and the rollback recovery algorithms are not invoked concurrently.

Checkpoint Algorithm • Phase One • Initiating process X takes a tentative checkpoint and requests that all the processes take tentative checkpoints. • Each process informs X whether it succeeded in taking a tentative checkpoint. • If X learns that all processes have taken tentative checkpoints, X decides that all tentative checkpoints should be made permanent. • Otherwise, X decides that all tentative checkpoints should be discarded. • Phase Two • X propagates its decision to all processes. • On receiving the message from X, all processes act accordingly.

Checkpoint Algorithm (cont.) • Optimization of the checkpoint algorithm • A minimal number of processes take checkpoints • Processes use a labeling scheme to decide whether to take a checkpoint. • All processes from which X has received messages after it has taken its last checkpoint take a checkpoint to record the sending of those messages Tentative checkpoint x1 x2 Time X Messages to take a checkpoint m y1 y2 Y z2 z1 Z

Rollback Recovery Algorithm • Phase One: • Process Pi checks whether all processes are willing to restart from their previous checkpoints. • A process may reply “no” if it is already participating in a checkpointing or recovering process initiated by some other process. • If all processes are willing to restart from their previous checkpoints, Pi decides that they should restart. • Otherwise, Pi decides that all the processes continue with their normal activities. • Phase Two: • Pi propagates its decision to all processes. • On receiving Pi’s decision, the processes act accordingly.

Rollback Recovery Algorithm (cont.) • Optimization • A minimum number of processes roll back • Processes use a labeling scheme to decide whether they need to roll back • Y will restart from its permanent checkpoint only if X is rolling back to a state where the sending of one or more messages from X to Y is being undone x1 x2 Time X X Failure y1 y2 Y z2 z1 Z

Synchronous CheckpointingDisadvantages • Additional messages must be exchanged to coordinate checkpointing. • Synchronization delays are introduced during normal operations. • No computational messages can be sent while the checkpointing algorithm is in progress. • If failure rarely occurs between successive checkpoints, then the checkpoint algorithm places an unnecessary extra load on the system, which can significantly affect performance.

Asynchronous Checkpointing and Recovery • Checkpoints at each process are taken independently without any synchronization among the processors. • There is no guarantee that a set of local checkpoints taken will be a consistent set of checkpoints. • The recovery algorithm must search for the most recent consistent set of checkpoints before it initiates recovery. Most recent consistent recovery line Inconsistent recovery line x2 x3 x1 X y2 y3 y1 Y z2 z1 Z Time

Asynchronous Checkpointing and Recovery (cont.) • All incoming messages are logged at each process. • This minimizes the amount of computation to undo during a rollback. • The messages received after setting the recovery point can be processed again. • Message logging • Pessimistic: An incoming message is logged before it is processed • This slows down the computation, even when there are no failures. • Optimistic: Processors continue to perform the computation, and the message received are stored in volatile storage and logged at certain intervals. • Messages that are not logged (stored on stable storage) can be lost in the event of rollback. • This does not slow down the underlying computation.

A Scheme for Asynchronous Checkpoint and Recovery • Two types of log storage volatile logandstable log. • Contents of the volatile log are periodically flushed to the stable storage and cleared. • Checkpoints are committed when all the messages since the last checkpoint have been logged to the stable log (usually a disk). • Each process keeps information on (the message id and the sender name) all messages (not logged to a disk) in a dependency vector. • A process attaches (piggybacks) its dependency vector to messages sent to other processes. • Once a process has logged a message to a disk it broadcasts that result to all processes, which can accordingly update their local dependency vectors.

A Scheme for Asynchronous Checkpoint and Recovery (cont.) • On a failure, the recovery manager in OS examines the stable log to determine the last message stored by the failing process. • The recovery manager notifies all processes that all higher-number messages have been lost. • All processes that received the message will roll back to their last committed checkpoints. Optimization for single fault-tolerant systems • A process sending a message logs the message by storing it only in its own volatile log. • When a receiver fails it can recover and obtain the message again from the sender’s. • When a sender fails, since the receiver is OK, it does not need the log.

Asynchronous Checkpoint and Recovery AlgorithmAn Example • Communication channels are reliable. • Messages are delivered in the order in which they were sent. • Each process keeps track of the number of messages that were • Sent to other processes • Received from other processes • A process, upon restarting (after failure) broadcasts a message that it had failed. • All processes determine orphan messages by comparing the numbers of messages sent and received. • The process rolls back to a state where the number of messages received (at the process) is not greater than the number of messages sent (according to the state at other processes).

Asynchronous Checkpoint and Recovery AlgorithmAn Example (cont.) ex1 ex0 ex2 O X ey0 ey1 ey3 ey2 O Y ez0 O Z ez3 ez1 ez2 Time • If Y rolls back to a state ey1, then • Y has sent only one message to X • X has received two messages from Y thus far • X must roll back to a state preceding ex1 (to be consistent with Y’s state) • For similar reasons, Z must also roll back

Checkpointing in Distributed Database Systems (DDBS) • In a DDBS a set of data objects is partitioned among several sites. • Checkpoints should be taken with minimal interference with normal operations. • Sites take local checkpoints recording the state of the local database. • The checkpoints are consistent. • A consistent checkpointing requires • That the updates of a transaction (the basic unit of user activity, which may be carried at many different sites) should be included in all the checkpoints or not at all • Synchronization among all the sites • Transactions may have to be blocked while checkpointing is in progress thereby interfering with normal operations.

Checkpointing in DDBS Issues • How the sites agree, upon updates, on what transactions are to be included in their checkpoints. • How each site can take a local checkpoint in a nonblocking fashion.

Checkpointing in a DDBS Assumptions • The basic unit of user activity is a transaction. • Transactions follow some concurrency control protocol (i.e., a data base system maintains the database consistency). • No two transactions have the same timestamp. • Lamport’s logical clocks are used to assign a timestamp to each transaction. • Site failures are detectable either by network protocols or by timeout mechanisms. • Network partitioning never occurs. • The checkpoint algorithm is initiated by a special process - checkpoint coordinator (CC). • CC takes a consistent checkpoint with the help processes called checkpoint subordinates (CS) running at every site.

Checkpointing in a DDBSThe Algorithm Phase One • At the checkpoint coordinator (CC) site • CC broadcast a Checkpoint_Request with local timestamp LCcc • (Local Checkpoint Number) LCPNcc:= LCcc • CC waits for replies (LCPNs, local checkpoint numbers) from all subordinate sites • At all the checkpoint subordinates (CS) sites • A site m updates local clock LCm:= MAX(LCm, LCcc + 1) • LCPNm:= LCm • A site m send LCPNm to CC • A site m marks all transactions with timestamp not greater than LCPNm as before checkpoint transactions (BCPTs) and the rest of the transactions as temporary - after checkpoint transactions (ACPTs)

Checkpointing in a DDBSThe Algorithm (cont.) Note • All updates by ACPTs are stored in the buffers of ACPTs. • If an ACPT commits, the data objects updated are maintained as committed temporary versions (CTVs). • If another transaction wishes to use an object for which a CVT exists • For read - the data stored in the CTV is returned. • For write (updates) - another version of the object is created. Phase Two • At the checkpoint coordinator site: • Once all replies have been received, the coordinator broadcasts the global checkpoint number (GCPN) GCPN := MAX(LCPN1, LCPN2, …., LCPNn)

Checkpointing in a DDBSThe Algorithm (cont.) • At all the checkpoint subordinates sites: • A site m marks all temporary ACPTs with the timestamp not greater than GCPN as BCPT. • The updates of newly converted BCPTs are also included in the checkpoint. • The updates due to remaining ACPTs will be flushed to the database after the current checkpoint is completed. • If the site m receives a new “initiate transaction” message for a transaction with the timestamp not greater than GCPN and if all BCPTs have been identified then • Site m rejects the “initiate transaction” message. • When all the BCPTs terminate, site m takes a local checkpoint by saving the state of the data objects. • When the local checkpoint is taken, the database is updated with the committed temporary versions and the committed temporary versions are deleted.

DHCP Server ARMOR Compound DHCP Client Manager El El El El El El DHCP server Daemon Daemon DHCP Server ARMOR Micro-Checkpointing - Checkpointing of Multithreaded Processes (DHCP example) • To completely recover ARMOR process we need: • State of all elements • State of active threads (i.e, threads processing messages) • In the initial implementation, DHCP constitutes a single element within the ARMOR. • The entire state of DHCP is saved to the disk after completion of each request. • significant overhead of 140%. • Decomposition of DHCP into several functional elements to reduce the overhead.

E2 E1 E2 E3 E4 OP_A OP_B OP_C OP_B OP_C OP_B OP_C OP_C OP_C operations payload fields Processing within a Thread • Each incoming message processed in its own thread. • Advantages of the element-compound structure when checkpointing the ARMOR state are: • Elements can only access private data (and payload fields in a message). • State changes are only made during operation processing

E2 E1 E2 E3 E4 OP_A OP_B OP_C OP_B OP_C OP_B OP_C OP_C OP_C checkpoint buffer: Disk 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Concept of Micro-checkpointing • A single checkpoint buffer is maintained per multithreaded ARMOR process. • The element state is checkpointed after each operation. • Checkpoints are committed to stable storage after processing a message. • The is no need to do process-wide checkpoints of stacks, heap, etc. • The existing locking policy of element data prevents the need to suspend all threads. • Overhead is reduced in comparison with process-wide checkpointing.

IRIX Operating System (SGI) Checkpoint and Restart • Facility for saving running process(es) and, at some other time, restarting the saved process(es) from the point already reached, without starting all over again. • A checkpoint image is saved in a set of disk files and can comprise • A set of processes (one or more), e.g., $ cpr -c ckptSep7 -p 1234 where cpr-c is the checkpoint command, ckptSep7 is the statefile name, -p option allows to specify a process ID • All processes in the process group (a set of processes that constitute a logical job) • All processes in a process session (a set of processes started from the same physical or logical terminal) • All processes in an IRIX array session (a set of related processes running on different nodes in an array) • The array service daemon supports chackpointing across the nodes. • To restart a set of processes the cpr command is used with the option -r $ cpr -r ckptSep7 • If the restart involves more than one process, all restarts must succeed before any process can run; otherwise all restarts fail.

IRIX Operating System (SGI) Checkpoint and RestartCheckpointable & Non-Checkpointable Objects • Checkpointable objects(objects that are checkpoint safe) • Process set ID • User memory (data, text, stack) • Kernel execution state ( e.g., signal mask, scheduling information, current and root directory) • System calls • Undelivered and queued signals • List of open files and devices • Pipeline setup and shared memory • Non-Checkpointable objects (objects that are not checkpoint safe) • Network sockets connections • X terminals and X11 client sessions • Graphic state • File pointers to mounted CD-ROM(s)

IRIX Operating System (SGI) Checkpoint and RestartApplication Handling of Non-Checkpointable Objects • To handle non-checkpoinable objects (e.g., network sockets, file pointers to mounted CD-ROM(s)), an application needs to: • Add an event handler to catch signals SIGCKPT & SIGRESTART • Run signal handlers to disconnect any open socket (or close open cdFiles and unmount the CD-ROM) before checkpoint and reconnect the socket (or mount the CD-ROM and reopen the cdFiles) after restart. • Two functions are provided for applications to add cpr event handlers: • atcheckpoint(my_cpt_handler())adds the application’s checkpoint handling function to the list of functions that get called upon receipt of SIGCKPT • atrestart(my_callback()) registers the application’s callback function for executing upon receipt of SIGRESTART.

Prof. Ravi K. Iyer Center for Reliable and High-Performance Computing