Fault-Tolerant Pulse Synchronization Jennifer L. Welch Texas A&M University

Fault-Tolerant Pulse Synchronization Jennifer L. Welch Texas A&M University Dagstuhl: September 2008

Pulse Synchronization • Given a set of nodes in a distributed system that pulse (or fire) repeatedly, how can we get them to fire periodically at the same times? unsynchronized Dagstuhl: September 2008

Pulse Synchronization • Given a set of nodes in a distributed system that pulse (or fire) repeatedly, how can we get them to fire periodically at the same times? synchronized Dagstuhl: September 2008

Why Pulse Synchronization? • Understand natural phenomena: • fireflies flashing • crickets chirping • electrically synchronous pacemaker cells • In computer networks: • scheduling duty cycles in sensor networks • used to achieve clock synchronization (nodes have common idea of increasing values) Dagstuhl: September 2008

Firing Oscillators • Mirollo and Strogatz (1990) • mathematical model of a "population of identical integrate-and-fire oscillators" • describe a simple algorithm: when an oscillator fires, it instantaneously causes the others to jump ahead toward their next firing times according to a certain function • show mathematically under what conditions the system converges to synchronous firing Dagstuhl: September 2008

M&S Model  skip cycle Dagstuhl: September 2008

Sensor Networks • Werner-Allen et al. (2005) • "Reachback Firefly Algorithm" (RFA): adapt M&S ideas to sensor networks under realistic communication assumptions • since don’t receive messages instantaneously, collect observations during each cycle and then adjust cycle immediately after each firing Photo source: http://animals.howstuffworks.com/insects/firefly-info.htm Dagstuhl: September 2008

RFA Algorithm  skip cycle Dagstuhl: September 2008

What About Fault Tolerance? • Daliot, D. Dolev, and Parnas (2003, 2008): • adapt ideas about biological fault tolerance in such systems and apply to networks • When a node has heard about the firing of some number of other nodes (by receiving messages), it compares the sum to a threshold function to decide whether to fire • Proved to be self stabilizing and tolerant of up to a third Byzantine faults Dagstuhl: September 2008

Alternative Approach to Fault Tolerance? • Modify RFA, which collects data during a cycle, and then uses it to update cycle • Apply approximate agreement ideas from D. Dolev, Lynch, Pinter, Stark and Weihl (1983, 1986), previously applied to clock synchronization (Welch & Lynch, 1984, 1988) • Eliminate outliers and then perform RFA calculations on the remainder to modify cycle • Might provide a simpler solution than DDP Dagstuhl: September 2008

Fault-Tolerant Averaging • [DLPSW] fault-tolerant outlier-elimination method: • works for problems in which nodes have some numerical values as inputs and want to output numerical values, such as approximate agreement and clock synchronization • to tolerate f Byzantine failures: • eliminate f largest and f smallest values. • For agreement-type problems, do some kind of averaging function on the remaining values Dagstuhl: September 2008

Applying FTA Idea to RFA • Identifying outliers: • values are from a bounded range, not unbounded, so need to worry about wrap-around (cf. S. Dolev & Welch, 1995, 2004) • What to do with remaining values? • currently just doing the original RFA calculation • maybe something cleverer can be done Dagstuhl: September 2008

Preliminary Results • Discrete event simulation considering two kinds of faults: • no jump (faulty node never changes its cycle) • random jump (faulty node changes its cycle by a random amount after each firing) • Appears that • original RFA has some tolerance to these kinds of faults in that it still converges • FT-RFA has better periodicity (after convergence, time between firings is closer to 1) Dagstuhl: September 2008

Still To Do • Understand what is going on • Mathematical analysis to show convergence • maybe it doesn't, could try techniques from [DW] to get more consistent set of firings • Comparison with [DDP] • Lower bound for pulse synchronization on number of nodes to tolerate f faulty nodes? • does known result for clock synch carry over? • Extension to multihop? • known lower bounds on required connectivity for clock synch is probably relevant Dagstuhl: September 2008

Acknowledgments • Radu Stoleru • Keerthi Deconda Dagstuhl: September 2008

Fault-Tolerant Pulse Synchronization Jennifer L. Welch Texas A&M University