480 likes | 559 Vues
Developments in self-stabilizing autonomic systems for robust and stable software, focusing on automatic recovery and monitoring layers. Research activities, computational models, and implementation approaches discussed.
E N D
Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software Olga Brukman, BGU Shlomi Dolev, BGU Elliot K. Kolodner, IBM
Software Contains Bugs • Heisenbugs, corrupt states, leaked resources are common… • Correct and faultless SW is hard • Long-lived running programs, e.g., OS • Usually software is tested when starting from initial state and considering limited time scenarios.
Fault Model Reflecting Reality • Software packages can be trusted to work as required after restart. • Eventual Byzantine software. • System administrators and users use reboot to deal with faults.
So Reboot… • It does work in practice! • Automatic reboot (e.g., for satellites) • Be careful not to reboot with no reason • Not to reboot portions that work o.k. • Make sure the automatic reboot layer works…
Current Research Interest • Automatic recovery, self-managing systems, self-healing systems, evolving systems… • Imply need for robust and stable systems instead of performance optimised systems.
Current Research Activity • ROC project, Berkeley-Stanford • Kinesthetics eXtreme, Columbia • Autonomic(holistic) computing, IBM
Related Work : ROC • Hierarchical restart • minimizing MTTR instead of maximizing MTTF. • Adding layer that monitors components and restarts them upon failure.
Related Work: ROC – Drawbacks • Limited hierarchies considered • empty graph, tree. • Monitoring by heartbeats • no monitoring of system state and progress. • Monitoring-restarting layer itself may crash.
OMR Kernel OS Self-Stabilizing Monitoring Restarting Layer
OMR Kernel Self-Stabilizing Monitoring Restarting Layer OS
OMR Kernel OS Self-Stabilizing Monitoring Restarting Layer
OS Kernel <Preds,RActs>1 <Preds,RActs>2 … <Preds,RActs>n <Preds,RActs> OMR <Preds,RActs> <Preds,RActs> <Preds,RActs> System’s Genericness
<Pred,RActs>1 <Pred,RActs>2 … Monitor-Restarter for Process
<Pred,RActs>1 <Pred,RActs>2 … Monitor-Restarter for Subsystem
Restart Actions – Mature Approach • Subsystem waits for completion of a restart of its components. • Restart action may vary, depending on component internal state. • Reschedule • Roll-back • Kill & Restart • Few restart attempts with more drastic restart actions.
Computational Model: rsf-execution • An execution E is rsf (restart supporting fair)-execution iff E is a fair execution in which every subsystem subi that is initialised during E respects its specification function ssi. Requirement: Every rsf-execution E has a suffix in which the system respects its specification function ss.
On-line Safety Assurance. • In any execution [DS01] safety can be achieved by adding monitoring layer. <Pred,RActs>1 <Pred,RActs>2 …
On-line Liveness Assurance. • In any execution E of |Si|+1 or more configurations there exists a sub execution E’=c1, c2,…, cj in which • statesubi(c1) = statesubi(cj) • If no progress of subi during E’, then E’= Ecirc. • If there is Ecirc then there is an infinite execution in which liveness does not hold. * |Si| is number of possible states of subi
Tools for Autonomic Recoverer Implementation – Black Box Approach • Software package is ablack box. • Package is monitored by recording it’s IO (e.g., strace in Linux). • Monitors are independent of specific implementation
Tools for Autonomic Recoverer Implementation – Transparent Box Approach • Software package implementation tool is known. • Run-Time Reflection tools are used to monitor and restart the package. • Possible in Java, C++, CORBA, COM.
Global Predicates Distributed Concerns
Self-Stabilization of the System • OMR is self-stabilizing. • Eventually each process will have monitor. • Each monitor is self-stabilizing as well. • Eventually each process/subsystem is safe. • Corrupted history causes monitor state corruption • Restarts initialize history variables. • Eventually monitor will see correct history.
p4 p1 p4 p1 p2 p3 p4 Task Example: Mutual Exclusion With Tournament Algorithm [PF77] 1 2 3
Tournament Algorithm Procedure Node(v:integer, side:{0,1}) 1: Wantv[side] := 0 2. Wait until (Wantv[1-side] = 0 or Priorityv =side) 3. Wantv[side] :=1 4. If (Priorityv =1-side) then • If (Wantv[1-side] = 1) then goto Line 1 6. Else wait until (Wantv[1-side] = 0) 7. If (v = 1) • <Critical Section> 9. Else Node( v/2, v mod 2) 10. Priorityv := 1-side 11. Wantv[side] := 0
OS Kernel OMR ME Mutual Exclusion Task in Autonomic Recoverer Context <Preds,RActs>1 <Preds,RActs>2 … <Preds, RActs>ME … <Preds,RActs>n <Preds,RActs> <Preds,RActs>ME <Preds,RActs> <Preds,RActs>
p1 p2 p3 p4 Processes to Monitor • Tournament process – v • Location process (phantom process) – Priority, Want 1 Location processes 2 3 Tournament processes
1 2 3 p1 p2 p3 p4 Subsystems to Monitor
ME Recovery Tuples: Examples • If there are no N tournament processes, fork tournament processes. • If there are no N-1 location processes, fork location processes. • If there are no monitor-restarter for tournament/location processes, fork monitor-restarter .
ME Recovery Tuples: Examples (Cont.) • If processes are not on their correct path to the root node, restart those processes (or their subsystems). • If more than two processes competing for location, restart them. • If there is starvation in some node, restart processes in node’s subsystem. • If process is in critical section too long, restart process. • …
Recovery Tuples for ME Task Monitor <MonitorPred, RestartAct>ME, N mp1 : if |processes(TP)| ≠ N ra1 : forkProcesses(TP, N) mp2 : if psi in processes(TP) and no monitor(psi ) ra2 : forkMonitorRestarter(<MonitorPred,RestartAct>psi) mp3 : if |processes(LP)| ≠ N-1 ra3 : forkProcesses(LP, N-1) mp4 :iflpsiin processes(LP) and nomonitor(lpsi ) ra4 : forkMonitorRestarter(<MonitorPred,RestartAct>lpsi)
p1 p2 p3 p4 ME Task Monitor Goals (N=4) ME
Recovery Tuples for ME Task Monitor <MonitorPred, RestartAct>ME, N mp1 : if |processes(TP)| ≠ N ra1 : forkProcesses(TP, N) mp2 : if psi in processes(TP) and no monitor(psi ) ra2 : forkMonitorRestarter(<MonitorPred,RestartAct>psi) mp3 : if |processes(LP)| ≠ N-1 ra3 : forkProcesses(LP, N-1) mp4 :iflpsiin processes(LP) and nomonitor(lpsi ) ra4 : forkMonitorRestarter(<MonitorPred,RestartAct>lpsi)
1 2 3 p1 p2 p3 p4 ME Task Monitor Goals (N=4) ME
Recovery Tuples for ME Task Monitor <MonitorPred, RestartAct>ME, N mp1 : if |processes(TP)| ≠ N ra1 : forkProcesses(TP, N) mp2 : if psi in processes(TP) and no monitor(psi ) ra2 : forkMonitorRestarter(<MonitorPred,RestartAct>psi) mp3 : if |processes(LP)| ≠ N-1 ra3 : forkProcesses(LP, N-1) mp4 :iflpsiin processes(LP) and nomonitor(lpsi ) ra4 : forkMonitorRestarter(<MonitorPred,RestartAct>lpsi)
1 2 3 p1 p2 p3 p4 ME Task Monitor Goals (N=4) ME
Lemma “Safety”: • Every rsf-execution E has a suffix E’ such that in every configuration cE’ ! pi,pj: pc (pi,1,c)=8 pc(pj,1,c)=8
Lemma “Liveness”: Every rsf-execution E has infinitely many configurations cEsuch that pc(p,1,c)=8 for some process p.
Lemma “No starvation”: Every rsf-execution E has suffix E’ such that pk ciE’: pc(pk,1,ci)=8.
Practical Experience: Printers Problem • Corrupted pdf, doc or ps file sent to printing server. • Printer can’t print the file. • Cause retries by printing server • Printer is “stuck” on one job. • Predicate for printing server: • Restrict number of retries, try format conversions, send error message to user.
Concluding Remarks • Theory foundations of self-stabilization and restart techniques could serve as a basis for the new paradigms. • General framework for design and correctness proof for autonomic recoverer. • Printers experience coordinated with IBM.