Local Tolerance to Unbounded Byzantine Faults

Mikhail Nesterenko Kent State University Anish Arora Ohio State University Local Tolerance to UnboundedByzantine Faults

faulty unaffected affected Faults in System of Large Scale • large system size presents unique challenges and opportunitiesto ensuring dependability • problem • faults: • occur often • affect multiple components • interact unpredictably • asynchronous execution model • faults are spatially/temporally unbounded, complex & undetectable • opportunity • a fault directly affects a region rather than whole system • if faults are contained, rest of the system continues to function

Difficulties Containing Unbounded Faults • lack of spatial bound • arbitrary number of processes can be faulty • cannot rely on limited scope offault or number of faulty processes • lack of temporal bound • faulty process behaves incorrectly arbitrarily long • cannot wait until fault stops • contain correctness and tolerance instead of faults • use execution models that simplify such containment

Outline • containing correctness and tolerance:strict fault containment and strict stabilization • execution models and example programs • reactive program: dining philosophers • transformational execution models and programs • output dependent: -independent set selection • output independent: lightweight spanner construction

containment radius l fault of classF containment locality Containing Correctness • address specification first • what does it mean for a system to be correct when its arbitrary portion is faulty? • spec defines correct sequences for each process P • sequence involves states of Pand possibly others • a program is locally containing of faults of class F if  constant l (containment radius) such that • every P conforms to its spec if faulty processes are at least l hops away from P • problem: correctness of P depends onevery process in the system conforming to spec or F

Byzantinefault Strict Fault Containment strict fault containing (SFC) program is locally containing of unboundedByzantine faults • a process satisfies spec regardlessof actions of processes outsidelocality • SFC-program is containing ofbounded and unbounded faults of any class • for each P the spec can only mention processes inside locality • a problem lacking such specs (e.g. routing) does not have SFC-solutions

strict stabilization – stabilization from transient faults: regardless of actions outside locality, each P eventually satisfies spec Strict Stabilization additional tolerance properties to faults within locality for a strictly-fault containing program

Outline • containing correctness and tolerance:strict fault containment and strict stabilization • execution models and example programs • reactive program: dining philosophers • transformational execution models and programs • output dependent: k-independent set selection • output independent: lightweight spanner construction

cycle forrequesting process thinking (T) hungry (H) eating (E) Dining Philosophers Problem definition • network of processes, each may request to eat • properties • mutual exclusion – no two neighbors eat together • liveness – each requestingprocess eats eventually execution model • interleaving • communication via shared registers • high-atomicity

E H T any decreasing priority Solution to Dining Philosophers priority based actions • if T & higher priority neighbors thinking  become hungry • if H & no neighbors are eating  eat (ensures MX) • E & done  think & give priority to neighbors (ensures liveness) • waiting chain ≤ 3 • optimal containmentradius of 2

process: sends info to b a sends a’s info to c b sends a’s info to d c result: d reads from a d Fault Containment andInformation Propagation • fault containment leverages limit on information propagation • idea: abstract fromthe process of information propagation and highlight the result

range P readsinput&output P readsinput only Execution Models • transformation program – given input computes output (e.g. leader election) • models for transformation programs – each process reads from processes within range (finite distance) • output dependent – each process reads all information within range: input and (atomically) output • output independent – each process reads only input within range • every program in this model is strictly fault containing

1-independent set k k P Q R joins S leaves S joins S k-Independent Set Selection (cf. [HHJS01]) problem: select a maximal subset of processes S such that • for each process in S each otherprocess of S is at leastk hops away solution actions • if no member of S less than k-hops away  join S • if exists member of S less than k-hops away  leave S observe: • only faulty node P can make another process Q to leave S • if Q leaves S, it can make another process R join S • containment radius is 2k

Outline • containing correctness and tolerance:strict fault containment and strict stabilization • execution models and example programs • reactive program: dining philosophers • transformational execution models and programs • output dependent: k-independent set selection • output independent: lightweight spanner construction • practical problem: fast routing tree construction in sensor networks • spanner construction with double range • spanner optimization with larger ranges

Experimental Platform: Wireless Sensors • 4 MHz Amtel processor • 8 Kb of programming memory • 512B of data memory • 916 MHz single-channel, low-power radio • 10 Kbps of raw bandwidth • uniform antenna length & orientation • TinyOS as the runtime system • fresh AA batteries

Experiment: Fast Routing Tree Construction By Flooding [G+02] • 156 nodes are arranged in a 13x12 grid on an open parking lot, with grid spacing of 2 feet. • the base station is placed in the middle of the base of the grid and starts the flooding • each receiving node rebroadcast the flood message immediately upon receipt and then squelches further broadcasts • the sender is selected as parent, thus routing tree to the base station is formed • expectation: a routing tree with relatively regular structure: • # of children, link length, path size, etc.

1 hop 2 hops Long Link Backward Link final 3 hops Straggler Clustering

Problems and Solution Approach problem: routing tree constructed fast over“raw” topology is inadequate • uneven clustering (some nodes have too many neighbors) • long links (possibly unreliable) • unoptimal paths (backward links) idea: pre-process the topology to mitigate the problem • weigh links (by length, error rate, node degree, etc.) • locally construct a connected but lightweight spanner • link weight may be reflexive (depend on the spanner, ex: node degree)

Lightweight Spanner Construction Using2k-Range P can compute MSTfor each process Qin this region • spanner – connected subgraph that includes all nodes (ex: spanning tree) • k-local spanner – there is a path within distance ≤ k to each neighbor problem: given a weighted graph(all weights unique) and 2k-rangebuild a lightweight k-local spanner solution: each process P computes the minimum spanning tree for eachprocess Q in distance no more than k and selects the union of incident edges k k P Q MST forQ’sregion

Spanner Optimization Using Ranges > 2 • each P computes spanner’s topology in neighborhood with radius range-k • P knows complete spanner in this region • P iteratively repeats theprocedure on the resultant spanner P can compute MSTfor each process Qin this region k k k P Q

Conclusion • complexity and scale of large systemsforces unorthodox approaches to faults • we explored spatial dimension of fault tolerance to complex unbounded faults, used lack of global info propagation • stated necessary conditions and impossibility results • gave first examples of programs • question: how to solve problems that do have global info propagation? is it possible to contain problems before they spread?

Local Tolerance to Unbounded Byzantine Faults