Crash Fault Detection in Celerating Environments

Real Time And so on, leading to an infinite stream of mistakes! Msg Send Msg Recv 4k action-clock ticks Conclusion Implementing ◊P Failure Detectors Measuring Time Estimate on Round-trip time is k real-time ticks Eventually Perfect Failure Detector …. 2k action-clock ticks Timeout! False suspicion …. Msg Send Msg Recv New estimate on RTT is now 2k action ticks Timeout! False suspicion Process Speed 2k action-clock ticks New estimate on Round-trip time is now 2k real-time ticks (Process Speed ) Process Step Time k action-clock ticks …. Timeout! False suspicion Estimated bound on RTT - k action ticks De facto bound on Round-Trip Time (RTT) …. Msg Send Msg Recv k action-clock ticks Timeout! False suspicion And so on, leading to an infinite stream of mistakes! Real Time Real-time Clocks in Decelerating Environments Solving the Celeration Problem Bi-Chronal Timers in Non-Celerating Environments Slower processes  Longer duration to generate and process messages  Unbounded RTT (in real time) • Bi-chronal timer • A vectored composition of action timer and real-time timer. • Measures time in terms of actions as well as real-time. • All processes use separate local bi-chronal timers. • Timer expires only when both action timer and the real-time timer expire. • The action timer insulates ◊P from deceleration. • The real-time timer insulates ◊P from acceleration. • Bi-chronal clocks insulate ◊P from transient network behavior. • Hardware upgrades often accelerate process speeds • Action clocks precipitate ◊P mistakes during acceleration • Bi-chronal clocks are immune to acceleration • Multiple process crashes (in a server farm), DoS attacks, and such can decelerate processes to a crawl • Real-time clocks precipitate ◊P mistakes during deceleration • Bi-chronal clocks are immune to deceleration • Many existing ◊P implementations are subtly broken • Bi-chronal clocks provide a simple solution • Additionally, they insulate systems from transient behavior • Future work: • Properties and behavior of Bi-chronal clocks • Use of Bi-chronal clocks in other applications • Other approaches to dealing with Celeration Crash Fault Detection in Celerating Environments Distributed Systems Crash Detection and System Models A collection of autonomous computers (processes) connected through a communication network ◊P may make mistakes initially, but eventually provides perfect information • Asynchrony: Unbounded message delay and process speeds • Synchrony: Known bounds on message delay and process speeds • Partial Synchrony: Between synchrony and asynchrony • Failure detectors: Distributed system service to detect process crashes. • Failure detector provide (potentially) incorrect information. • Still powerful enough to solve important problems. • E.g., distributed consensus, leader election, wait-free scheduling, contention management. • Failure detector implementations often require partial synchrony. • One well known failure detector is ◊P, the eventually perfect failure detector. Fault Pattern 1 Live Crashed … Crash! ◊P outputs Crashed … Live Crashed … Asynchrony Permissive Model Crash Detection Impossible Synchrony Restrictive Model Crash Detection Possible Crash! Live Crashed … Fault Pattern 2 • But processes can crash! • Maintain correctness despite crashes • Fault tolerance through crash detection • Crash detection determined by synchronism in the system Live Partial Synchrony Crash Detection Possible Greater Fidelity to Real World Systems ◊P outputs Live Crashed Live Live Crashed Live Local Adaptive Estimation of RTT Action Clocks in Accelerating Environments Faster processes  More action-clock ticks per RTT  Action clock timer continually times out • Implementable under (some models of) partial synchrony. • Popular model: Unknown bounds on message delay () • and relative process speeds (). • Start timer with some arbitrary (small) value • If timer expires without receiving a message, suspect the process • If a message arrives after timer expiry, trust the process and increase the timer value. • Eventually timer value exceeds the bound on RTT. • After which correct processes will never be suspected. • Any crashed process is permanently suspected. • Two techniques: • Action clocks: Counting the number of actions • Real-time clocks: Independent device to measure time (e.g., hardware clocks, NTP). • Either technique works in environments that do NOT accelerate or decelerate arbitrarily • But in Celerating environments, where processes can accelerate or decelerate arbitrarily, each technique fails independently. Round Trip Time (RTT) = Outgoing message delay + message processing time + incoming message delay Incoming message delay ≤ ACK ≤ f() Outgoing message delay ≤  PING Local ◊P module But how do processes measure time? Ack generation Time ≤ f() RTT ≤  + f() +   RTT is bounded above! This bound on RTT can be adaptively estimated.

Crash Fault Detection in Celerating Environments

Crash Fault Detection in Celerating Environments

Presentation Transcript

Fault Detection Tools and Techniques

Automatic Fault Detection in Friction Stir Welding

Onboard/In-Field Automated Fault Detection and Diagnostics

Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments

Distributed Algorithms for Failure Detection in Crash Environments

Line Fault Detection

Onboard/In-Field Fault Detection and Diagnostics

Fault Detection and Isolation: an overview

Fault detection

Fault Detection

Figure 11.1 Fault detection in a simple circuit

Sophistocation of Fault Detection

Fault Detection in a Continuous Pulp Digester

FRONIUS Ground Fault Detection and Interruption

Fault Detection and Diagnosis (II)

Observers Data Only Fault Detection

Proving Fault In a Motor Vehicle Crash

Management: Fault Detection and Troubleshooting

Fault detection

Fault Detection and Diagnosis

Fault Detection and Prediction in Cloud Computing

Disclosure detection & control in research environments

Crash Fault Detection in Celerating Environments