Autonomic distributed systems

Autonomic distributed systems

Think about this computer population x109 Human population 7 6 5 4 1980 1990 2000 2010 2

Think about this Machines will fail from time to time, regardless of how carefully they are designed. But who will manage these systems? Even if everyone joins IT, it is not enough! Isn’t this a crisis? Systems have to take care of themselves. Self-help is the best help. 3

What does it mean? Self-stabilizing Self-optimizing Self-healing Self-help Self-protecting Self-organizing Self-managing These are many such desirable self-- properties that be added to the Wish list. These properties collectively called self-* properties characterize an Autonomic System. 4

Self-healing The Spirit Mars rover has a radiation-hardened R6000 CPU from Lockheed-Martin Federal Systems. One day, while performing a crucial task, Spirit Mars Rover fell silent, alone on the emptiness of Mars. What next? Courtesy: Jet Propulsion Lab 5

Self-healing The problem was eventually remotely detected by ground control. The operating system tried to allocate more files than the RAM-based directory structure could accommodate. It caused an exception that suspended the task that attempted the allocation. NASA ground control deleted some files, and reformatted the entire flash memory system. On February 6, 2004 the rover was restored to its original working condition, and science activities resumed. It would have been nice if the detection and repair could be done by the rover itself … Courtesy: Jet Propulsion Lab 6

Self-stabilization • Technique for spontaneous restoration of a system predicate. • Forward error recovery (memoryless) -- does not bother about the impact of the failure as long as the recovery is guaranteed. • Guarantees eventual safety following failures. Feasibility demonstrated by Dijkstra (CACM 1974)

Self-stabilizing systems Starting from any initial configuration, the system is guaranteed to recover to a legitimate configuration (L is true) in a bounded number of steps, as long as the codes are not corrupted.

Self-stabilizing systems Transient failures perturb the global state. The ability to spontaneously recover from any initial state implies that no initialization is ever required. State space legal

Self-stabilizing systems Self-stabilizing systems exhibits non-masking fault-tolerance. It satisfies the following two criteria fault • Convergence • Closure Not L L convergence closure

Adaptive Distributed Systems System behavior spontaneously changes when the environment changes AM  L AM holds PM  L PM holds A traffic control system AM / PM L = (AM  L AM )  (PM  L PM ) defines the system invariant

Example 1: Stabilizing mutual exclusion N-1 0 1 7 2 6 3 4 5 Consider a unidirectional ring of processes. In the legal configuration, exactly one token will circulate in the network

A solution 0 1 2 3 4 n 0 The state of process j is x[j]  {0, 1, 2, K-1}, and N > K {Process 0} repeat x[0] = x[N-1]  x[0] := x[0] N 1 forever {Process j > 0} repeat x[j] ≠ x[j -1]  x[j] := x[j-1] forever Guard or condition action TOKEN = ENABLED GUARD

Does it work? First, be convinced that it works. Then think about why it will work.

Example 2: Stabilizing spanning tree • Given a connected graph G = (V,E) and a root r, design an algorithm for maintaining a spanning tree in presence of transient failures that may corrupt the local states of processes. • Let n = |V|

A solution Each process i has two variables L(i) and P(i): L(i) = Distance from the root via tree edges P(i) = parent of process i By definition L(r) = 0, and P(r) is undefined. In a legal state i  V | i ≠ r : L(i) ≠ n  L(i) = L(P(i)) +1.

Sample case 0 0 1 1 1 P(2) is corrupted 2 4 2 2 4 5 3 4 3 5 3 5

The algorithm The algorithm has three rules R0, R1, R2: (R0) (L(i) ≠ n)  (L(i) ≠ L(P(i)) +1)  (L(P) ≠ n)  L(i) :=L(P(i)) +1 (R1)(L(i)  n)  (L(P(i)) =n)  L(i):=n (R2) (L(i) =n)  (k  Neighbors(i):L(k) < n-1)  L(i) :=L(k)+1; P(i):=k

Proof of stabilization Define an edge from i to P(i) to be well-formed, when L(i) ≠ n, L(P(i) ≠ n and L(i) = L(P(i)) +1. In any configuration, the well-formed edges form a spanning forest. Delete all edges that are not well-formed. Designate each tree T(k) in the forest by the lowest value of L in it.

Example In the sample graph shown earlier.T(0) = {0, 1, T(2) = {2, 3, 4, 5} Let F(k) denote the number of T(k)’s in the forest. Define a tuple F= (F(0), F(1), F(2) …, F(n)). For the sample graph, F = (1, 0, 1, 0, 0, 0) after node 2 had the transient failure that changed P(2) from 2 to 4.

Skeleton of the proof Minimum F = (1,0,0,0,0,0) {legal configuration} Maximum F = (1, n-1, 0, 0, 0, 0). With each action, F decreases lexicographically. Verify the claim! This proves that eventually F becomes (1,0,0,0,0,0) and the spanning tree stabilizes.

Autonomic distributed systems