Design of Reliable Systems and Networks ECE 442 Checkpointing & Recovery (III)

Design of Reliable Systems and NetworksECE 442Checkpointing & Recovery (III) Ravi K. Iyer Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering and Coordinated Science Laboratory University of Illinois at Urbana-Champaign iyer@crhc.uiuc.edu http://www.crhc.uiuc.edu/DEPEND

Outline • Asynchronous checkpointing and recovery • Examples: • Checkpointing in distributed data base systems • Micro-checkpointing, checkpointing of multithreaded processes • Checkpoint and restart in IRIX operating System (SGI)

Asynchronous Checkpointing and Recovery • Checkpoints at each process are taken independently without any synchronization among the processors. • There is no guarantee that a set of local checkpoints taken will be a consistent set of checkpoints. • The recovery algorithm must search for the most recent consistent set of checkpoints before it initiates recovery. Most recent consistent recovery line Inconsistent recovery line x2 x3 x1 X y2 y3 y1 Y z2 z1 Z Time

Asynchronous Checkpointing and Recovery (cont.) • All incoming messages are logged at each process. • This minimizes the amount of computation to undo during a rollback. • The messages received after setting the recovery point can be processed again. • Message logging • Pessimistic: An incoming message is logged before it is processed • This slows down the computation, even when there are no failures. • Optimistic: Processors continue to perform the computation, and the message received are stored in volatile storage and logged at certain intervals. • Messages that are not logged (stored on stable storage) can be lost in the event of rollback. • This does not slow down the underlying computation.

Optimistic Message Logging • Messages not necessarily logged before being processed. • Unlogged messages are not available during recovery. • States in other processes that causally depend upon lost messages are called orphan states. • Processes that have orphan states must rollback. • Dependencies tracked trough state intervals: • Process consists of sequence of state intervals. • Receipt of message starts a new state interval. • Outgoing messages dependent upon current state interval of a process state interval 86 85 X 3 4 Y

Optimistic Message Logging (cont.) • Each process keeps a dependency vector: • One entry per process in the system. • Entry for process j specifies latest state interval in process j on which the process is dependent. • Dependency vector piggybacked on outgoing messages. • Receivers update their own dependency vector from piggybacked vector. • Causal dependencies propagated through piggybacked vector.

Piggybacked Dependency Vector • Example shows dependency vector being updated as time progresses. • Dependency vector of Z after receipt of m3 shows that Z is dependent upon state 5 of X and state 11 of Y. m1 5 - - 4 5 X m2 state interval 5 11 - 10 11 Y m3 5 11 4 3 4 Z X Y Z dependency vector

Recovery m1 5 - - 4 5 X X m2 state interval 5 11 - 10 11 Y m3 5 11 4 3 4 Z X Y Z dependency vector • X fails; if X has not logged m1 to disk at time of failure, then m1 is unrecoverable. • Cannot guarantee that state 5 of X can be recreated exactly as before. • All states dependent on state 5 of X are orphan states. • When X recovers, it broadcasts to other processes that it can recreate its state up to state 4. • Other processes check their dependency vectors and rollback if they are dependent on a state interval of X greater than 4.

Asynchronous Checkpoint and Recovery Algorithm: An Example • Communication channels are reliable. • Messages are delivered in the order in which they were sent. • Each process keeps track of the number of messages that were • Sent to other processes • Received from other processes • A process, upon restarting (after failure) broadcasts a message that it had failed. • All processes determine orphan messages by comparing the numbers of messages sent and received. • The process rolls back to a state where the number of messages received (at the process) is not greater than the number of messages sent (according to the state at other processes).

Asynchronous Checkpoint and Recovery Algorithm: An Example (cont.) ex1 ex2 ex0 X ey1 ey2 ey0 ey3 X Y ez0 Z ez1 ez3 ez2 Time • If Y rolls back to a state ey1, then • Y has sent only one message to X • X has received two messages from Y thus far • X must roll back to a state preceding ex1 (to be consistent with Y’s state) • For similar reasons, Z must also roll back

Checkpointing in Distributed Database Systems (DDBS) • In a DDBS a set of data objects is partitioned among several sites. • Checkpoints should be taken with minimal interference with normal operations. • Sites take local checkpoints recording the state of the local database. • It is desirable that the checkpoints are consistent. • A consistent checkpointing requires • the state updates of a transaction (the basic unit of user activity, which may be carried at many different sites) are included in all the checkpoints completely or not at all • Synchronization among all the sites • Transactions may have to be blocked while checkpointing is in progress thereby interfering with normal operations.

Checkpointing in DDBS Issues • How the sites agree, upon updates, on what transactions are to be included in their checkpoints. • How each site can take a local checkpoint in a non-blocking fashion.

Checkpointing in a DDBS Assumptions • The basic unit of user activity is a transaction. • Transactions follow some concurrency control protocol (i.e., a data base system maintains the database consistency). • No two transactions have the same timestamp. • Lamport’s logical clocks are used to assign a timestamp to each transaction. • Site failures are detectable either by network protocols or by timeout mechanisms. • Network partitioning never occurs. • The checkpoint algorithm is initiated by a special process - checkpoint coordinator (CC). • CC takes a consistent checkpoint with the help of processes called checkpoint subordinates (CS) running at every site.

Checkpointing in a DDBS: The Algorithm Phase One • At the checkpoint coordinator (CC) site • CC broadcast a Checkpoint_Request with local timestamp LCcc • (Local Checkpoint Number) LCPNcc:= LCcc • CONVERTCC = false • CC waits for replies (LCPNs, local checkpoint numbers) from all subordinate sites • At all the checkpoint subordinates (CS) sites • A site m updates local clock: LCm:= MAX(LCm, LCcc + 1) • LCPNm:= LCm • A site m send LCPNm to CC • CONVERTm = false • A site m marks all transactions with timestamp not greater than LCPNmas before checkpoint transactions (BCPTs) and the rest of the transactions as temporary - after checkpoint transactions (ACPTs)

Checkpointing in a DDBS: The Algorithm (cont.) Note • All updates by ACPTs are stored in the buffers of ACPTs. • If an ACPT commits, the data objects updated are maintained as committed temporary versions (CTVs). • If another transaction wishes to use an object for which a CVT exists • For read - the data stored in the CTV is returned. • For write (updates) - another version of the object is created.

Checkpointing in a DDBS: The Algorithm (cont.) Phase Two • At the checkpoint coordinator site: • Once all replies have been received, the coordinator broadcasts the global checkpoint number (GCPN) GCPN := MAX(LCPN1, LCPN2, …., LCPNn) • At all the checkpoint subordinates sites: • A site m marks all temporary ACPTs with the timestamp not greater than GCPN as BCPT. • The updates of newly converted BCPTs are also included in the checkpoint. • The updates due to remaining ACPTs will be flushed to the database after the current checkpoint is completed. • CONVERTm = true; indicates that GCPN is known and all BCPTs have been identified

Checkpointing in a DDBS: The Algorithm (cont.) • When all the BCPTs terminate, site m takes a local checkpoint by saving the state of the data objects. • When the local checkpoint is taken, the database is updated with the committed temporary versions and the committed temporary versions are deleted. Note • If the site m receives a new “initiate transaction” message for a transaction with the timestamp not greater than GCPN and if all BCPTs have been identified then • Site m rejects the “initiate transaction” message.

Micro-checkpointing: Checkpointing of Multithreaded Processes,An Example of ARMOR State Checkpointing

ARMOR Architecture • An ARMOR is a multithreaded process composed of replaceable, basic building blocks called elements • an element is a depository of replaceable functions within the ARMOR • a building block typically provides an elementary detection/recovery service • An ARMOR supports • an unified interface to invoke services provided by elements • static and dynamic customization of services ARMOR Interface element element element ARMOR

Progress Indicator element HB element Checkpoint element Data dependency checking element Text-segment signature element Checksum Element Example ARMOR Configuration Repository of Elements HB element Data dependency checking element ARMOR Interface Progress Indicator element Checksum Element Assertion check element Text-segment signature element Control flow signature element Range-check element ARMOR Checkpoint element

E2 E1 E2 E3 E4 OP_A OP_B OP_C OP_B OP_C OP_B OP_C OP_C OP_C operations payload fields Processing Within A Thread • Each incoming message processed in its own thread. • Elements can only access private data (and payload fields in a message). • State changes are only made during operation processing

E2 E1 E2 E3 E4 OP_A OP_B OP_C OP_B OP_C OP_B OP_C OP_C OP_C checkpoint buffer: Disk 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Concept of Micro-checkpointing • A single checkpoint buffer is maintained per multithreaded ARMOR process. • The element state is checkpointed after each operation. • Checkpoints are committed to stable storage after processing a message. • The is no need to do process-wide checkpoints of stacks, heap, etc. • The existing locking policy of element data prevents the need to suspend all threads. • Overhead is reduced in comparison with process-wide checkpointing.

IRIX Operating System (SGI) Checkpoint and Restart • Facility for saving running process(es) and, at some other time, restarting the saved process(es) from the point already reached, without starting all over again. • A checkpoint image is saved in a set of disk files and can comprise • A set of processes (one or more), e.g., $ cpr -c ckptSep7 -p 1234 where cpr-c is the checkpoint command, ckptSep7 is the statefile name, -p option allows to specify a process ID • All processes in the process group (a set of processes that constitute a logical job) • All processes in a process session (a set of processes started from the same physical or logical terminal) • All processes in an IRIX array session (a set of related processes running on different nodes in an array) • The array service daemon supports chackpointing across the nodes. • To restart a set of processes the cpr command is used with the option -r $ cpr -r ckptSep7 • If the restart involves more than one process, all restarts must succeed before any process can run; otherwise all restarts fail.

IRIX Operating System (SGI) Checkpointable & Non-Checkpointable Objects • Checkpointable objects(objects that are checkpoint safe) • Process set ID • User memory (data, text, stack) • Kernel execution state ( e.g., signal mask, scheduling information, current and root directory) • System calls • Undelivered and queued signals • List of open files and devices • Pipeline setup and shared memory • Non-Checkpointable objects (objects that are not checkpoint safe) • Network sockets connections • X terminals and X11 client sessions • Graphic state • File pointers to mounted CD-ROM(s)

IRIX Operating System (SGI) Application Handling of Non-Checkpointable Objects • To handle non-checkpoinable objects (e.g., network sockets, file pointers to mounted CD-ROM(s)), an application needs to: • Add an event handler to catch signals SIGCKPT & SIGRESTART • Run signal handlers to disconnect any open socket (or close open cdFiles and unmount the CD-ROM) before checkpoint and reconnect the socket (or mount the CD-ROM and reopen the cdFiles) after restart. • Two functions are provided for applications to add cpr event handlers: • atcheckpoint(my_cpt_handler())adds the application’s checkpoint handling function to the list of functions that get called upon receipt of SIGCKPT • atrestart(my_callback()) registers the application’s callback function for executing upon receipt of SIGRESTART.

Design of Reliable Systems and Networks ECE 442 Checkpointing & Recovery (III)