1 / 28

Distributed Algorithms 2g1513

Distributed Algorithms 2g1513. L16 – by Ali Ghodsi and Seif Haridi Failure Detection. Failure Detection. Failure Detector a module which uses timeouts to detect failures Useful abstraction for building systems Programming becomes easier May give false positives

turner
Télécharger la présentation

Distributed Algorithms 2g1513

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Algorithms2g1513 L16 – by Ali Ghodsi and Seif Haridi Failure Detection

  2. Failure Detection • Failure Detector a module which uses timeouts to detect failures • Useful abstraction for building systems • Programming becomes easier • May give false positives • Process A wrongly thinks process C is dead • Process B thinks process C is alive • We will not care what crashed processes failure detectors think! Ali Ghodsi and Seif Haridi

  3. Different Failure Scenarios • Failure Pattern, F • Actual view of crashed processes at a certain time • Monotonic • F(1)=, F(2)=, F(3)={P2}, F(7)={P2,P4} P1 P2 P3 P4 1 2 3 4 5 6 7 8 time Ali Ghodsi and Seif Haridi

  4. Detections • Suspicions, H • What a detector thinks at process P and time t • Process 3 thinks at time 8 • H(3, 8)={1,4} • Erroneously thinks 1 has crashed, detected 4’s crash, have not detected 2’s crash P1 P2 P3 P4 1 2 3 4 5 6 7 8 time Ali Ghodsi and Seif Haridi

  5. Completeness and Accuracy • Two important types of requirements • Completeness • The detector will detect a crashed process • Accuracy • The detector will not detect a non-crashed process • Trivial to satisfy only one requirement (how?) • Both impossible in an asynchronous system! Ali Ghodsi and Seif Haridi

  6. Practical Requirements • Strong Completeness • Every crashed process is eventually detected by all processes • For all failure patterns • For all possible behaviors of a detector • There exists a time t, whereafter all crashed processes are detected by all processes • We will only study detectors with this property Ali Ghodsi and Seif Haridi

  7. Practical Requirements • Strongly Accurate • No process is every suspected unless it has crashed • For all failure patterns • For all possible behaviors of a detector • For all correct processes P and Q, P will never suspect Q • Quite strong assumption • No premature timeouts Ali Ghodsi and Seif Haridi

  8. Practical Requirements • Weakly Accurate • There exists a correct process which is never suspected by anyone • For all failure patterns • For all possible behaviors of a detector • There exists a correct process P • All correct processes will never suspect P • Quite strong assumption • No premature timeouts Ali Ghodsi and Seif Haridi

  9. Practical Requirements • Eventually Strongly Accurate • After some finite time, t, the detector is strongly accurate • Eventually Weakly Accurate • After some finite time, t, the detector is weakly accurate • After some time, the requirements are fulfilled • Prior to that, any behavior is possible! • Weak assumptions • Think about Eventually Weakly Accurate! Ali Ghodsi and Seif Haridi

  10. Four Established Detectors • Perfect Detector (P) • Complete, Strongly Accurate • Strong Detector (S) • Complete, Weakly Accurate • Eventually Perfect Detector (P) • Complete, Eventually Strongly Accurate • Eventually Strong Detector (S) • Complete, Eventually Weakly Accurate Ali Ghodsi and Seif Haridi

  11. Programming Difference (1/2) • Programming without failure detectors • Can never receive from all processes (might block because of a failure) • General technique • Assume only t nodes can fail • Broadcast to all nodes • Receive N-t messages • See the Initially Dead Consensus Ali Ghodsi and Seif Haridi

  12. Programming Difference (2/2) • Programming with failure detectors • General technique • Broadcast • Receive from all nodes • Failed nodes will timeout (completeness) • Code: if collect<msg, par> from q print(q+” said “+msg); else print(q+” looks dead!”); Ali Ghodsi and Seif Haridi

  13. Pitfalls with Detectors • Two pitfalls: • A tries to send a message to all, fails halfway through. B might get the message, C mot not! • A sends a message to all, B gets it, but C erroneously detects A as dead Ali Ghodsi and Seif Haridi

  14. Consensus: Rotating Coordinator for S xi = input for r:=1 to N do if p=r then forall j do send <value, xi, r> to j; if collect<value, x’, r> from prthen xi = x’; end decide xi How many failures can this tolerate? Ali Ghodsi and Seif Haridi

  15. Tolerance of Eventuality (1/3) • Eventually perfect detector, cannot solve consensus with resilience t > n/2 • Proof by contradiction: • Assume it is possible, and assume N=10 and t=6 • The P detector initially tolerates any behavior Red nodes dead. green nodes alive. Detectors behave perfectly. Consensus will be 1 some time t1 1 1 1 1 Ali Ghodsi and Seif Haridi

  16. Tolerance of Eventuality (2/3) • Eventually perfect detector, cannot solve consensus with resilience t > n/2 • Proof by contradiction: • Assume it is possible, and assume N=10 and t=6 • The P detector initially tolerates any behavior Red nodes dead. Blue nodes alive. Detectors behave perfectly. Consensus will be 0 at some time t0 0 0 0 0 Ali Ghodsi and Seif Haridi

  17. Tolerance of Eventuality (3/3) • Eventually perfect detector, cannot solve consensus with resilience t > n/2 • Proof by contradiction: • Assume it is possible, and assume N=10 and t=6 • The P detector initially tolerates any behavior For t1time, green nodes think blue and red nodes are dead… Hence, agreement on 1 For t0time, blue nodes think green and red nodes are dead… Hence, agreement on 0 1 1 0 0 1 1 0 0 Ali Ghodsi and Seif Haridi

  18. Consensus: Rotating Coordinator for S • For the eventually strong detector • The trivial rotating coordinator will not work • Why? • “Eventually” might imply no consensus in first round! • Trivial solution: • Rotate forever • Eventually all nodes collect one coordinator: consensus • Problem? • Termination: How do we know when to finish? Ali Ghodsi and Seif Haridi

  19. Idea for termination • Bound the number of failures • Less than a third might fail (t < n/3) • Similar to rotating coordinator for S: • 1) Everyone send vote to coordinator c • 2)R picks majority vote V, and broadcasts V • 3) Every node get broadcast, change vote to V • 4) Change coordinator c and goto 1) Ali Ghodsi and Seif Haridi

  20. Consensus: Rotating Coordinator for S xi := input r=0 while true do begin r:=r+1 c:=(r mod N)+1 { rotate to coordinator c } send <value, xi, r> to pc { all send value to coord } Ali Ghodsi and Seif Haridi

  21. Consensus: Rotating Coordinator for S xi := input r=0 while true do begin r:=r+1 c:=(r mod N)+1 { rotate to coordinator c } send <value, xi, r> to pc { all send value to coord } if i==c then { coord only } begin msgs[0]:=0; msgs[1]:=0; { reset 0 and 1 counter } for x:=1 to N-t do begin receive <value, V, R> from q { receive N-t msgs } msgs[V]:=msgs[V]+1; { increase relevant counter } end if msgs[0]>msgs[1] then v:=0 else v:=1 end { choose majority value } forall j do send <outcome, v, r> to pj { send v to all } end Ali Ghodsi and Seif Haridi

  22. Consensus: Rotating Coordinator for S xi := input r=0 while true do begin r:=r+1 c:=(r mod N)+1 { rotate to coordinator c } send <value, xi, r> to pc { all send value to coord } if i==c then { coord only } begin msgs[0]:=0; msgs[1]:=0; { reset 0 and 1 counter } for x:=1 to N-t do begin receive <value, V, R> from q { receive N-t msgs } msgs[V]:=msgs[V]+1; { increase relevant counter } end if msgs[0]>msgs[1] then v:=0 else v:=1 end { choose majority value } forall j do send <outcome, v, r> to pj { send v to all } end if collect<outcome, v, r> from pcthen { collect value from coord } begin xi := v { change input to v } end end Ali Ghodsi and Seif Haridi

  23. Loop Invariant • If kN-t agree on a value V before a round • Then at least k nodes agree on V after the round • Why? • At most t did not vote V • Will only change value X if X proposed by coord • Coord only propose X if majority of N-t voted X • N-t > 2N/3, Majority of N-t is more than N/3 nodes • More than N/3 voted X • X has to be V Ali Ghodsi and Seif Haridi

  24. Enforcing Decision • Coordinator checks if all N-t voted same • Broadcast that information • If coordinator says all N-t voted same • Decide for that value! Ali Ghodsi and Seif Haridi

  25. Consensus: Rotating Coordinator for S xi := input r=0 while true do begin r:=r+1 c:=(r mod N)+1 { rotate to coordinator c } send <value, xi, r> to pc { all send value to coord } if i==c then { coord only } begin msgs[0]:=0; msgs[1]:=0; { reset 0 and 1 counter } for x:=1 to N-t do begin receive <value, V, R> from q { receive N-t msgs } msgs[V]:=msgs[V]+1; { increase relevant counter } end if msgs[0]>msgs[1] then v:=0 else v:=1 end { choose majority value } if msgs[0]==0 or msgs[1]==0 then d:=1 else d:=0 end{ all same? } forall j do send <outcome, d, v, r> to pj { send v to all } end if collect<outcome, d, v, r> from pcthen { collect value from coord } begin xi := v { change input to v } if d then decide(v) { decide if d is true } end end Ali Ghodsi and Seif Haridi

  26. Liveness: Decide will happen • Eventually some node q will not be false detected • Eventually q is coord • Everyone collects its vote V • Everyone decides V • From now all k nodes will vote V • Next time q is coord, d=1 • Everyone decides Ali Ghodsi and Seif Haridi

  27. Summary • Failure Detectors simplify programming • Can solve consensus any many other problems • Two main requirements • Completeness (detecting failed processes) • Accuracy (not detecting alive nodes) • Two main classes, • Those that behave well always • Those that eventually behave well Ali Ghodsi and Seif Haridi

  28. Summary • Failure Detectors • Simple abstraction • Characterization • Completeness • Accuracy (strongly vs. weakly) • Four important classes of detectors • Can be used to solve consensus with high resilience Ali Ghodsi and Seif Haridi

More Related