1 / 41

PIROGUE, A LIGHTER DYNAMIC VERSION OF THE RAFT DISTRIBUTED CONSENSUS ALGORITHM

PIROGUE, A LIGHTER DYNAMIC VERSION OF THE RAFT DISTRIBUTED CONSENSUS ALGORITHM. Jehan-François Pâris, U. of Houston Darrell D. E. Long, U. C. Santa Cruz. Motivation. New distributed consensus algorithm Raft (Ongaro and Ousterhout, 2014)

donnieread
Télécharger la présentation

PIROGUE, A LIGHTER DYNAMIC VERSION OF THE RAFT DISTRIBUTED CONSENSUS ALGORITHM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PIROGUE, A LIGHTER DYNAMIC VERSION OF THE RAFT DISTRIBUTED CONSENSUS ALGORITHM Jehan-François Pâris, U. of Houston Darrell D. E. Long, U. C. Santa Cruz

  2. Motivation New distributed consensus algorithm Raft (Ongaro and Ousterhout, 2014) Easier to understand and implement than Paxos (Lamport, 1998) Raft need to run on five servers to tolerate the failure of two of them High energy footprint A problem that needed to be addressed

  3. Talk organization Understanding how Raft works Focus on Raft update and election quorums Two main ideas Using dynamic-linear voting Replacing some servers by witnesses Evaluating our proposals

  4. A Raft cluster One leader and several followers All client requests go through the leader Consensusmodule Log State machine Client

  5. A client sends a request Leader stores request on its log and forwards it to its followers Log Log Log State machine State machine State machine Client

  6. The followers receive the request Followers store the request on their logs and acknowledge its receipt Log Log Log State machine State machine State machine Client

  7. The leader tallies followers' ACKs Once it ascertains the request has been processed by a majority of the servers, it updates its state machine Log Log Log State machine State machine State machine Client

  8. The leader tallies followers' ACKs Leader's heartbeats convey the news to its followers: they update their state machines Log Log Log State machine State machine State machine Client

  9. The leader fails Followers notice at different times the lack of heartbeats Decide to elect a new leader Log Log Log State machine State machine State machine Client

  10. At different times? Should all followers detect a leader failure at the same time They would all solicit the votes of other followers Nobody would win a majority of the votes Raft servers have randomized election timers Greatly reduces this risk Statistically guarantees convergence

  11. An election starts Candidate for leader position requests votes of other former followers Includes a summary of the state of its log Log State machine Log State machine

  12. Former followers reply Former followers compare the state of their logs with credentials of candidate Vote for candidate unless Their own log is more "up to date" They have already voted for another server Log State machine Log State machine ?

  13. The new leader is in charge Newly elected candidate forces all its followers to duplicate in their logs the contents of its own log Log State machine Log State machine

  14. Raft fault-tolerance Raft must run on five servers to tolerate the failure of two of them Because it uses majority consensus voting

  15. The good news Even after two server failures Data remain protected against two irrecoverable server failures X X

  16. The bad news High energy footprint of the algorithm Five servers is more than most fault-tolerant distributed systems use Two servers: mirroring Three servers: Google FS Byzantine fault-tolerance requires four servers

  17. Our proposal Run slightly modified Raft algorithm on fewer servers Guarantee the service will tolerate alldouble failures without service interruptions Let the service occasionally run on fewer than three servers Less protection against irrecoverable server failures Disk MFFTs are measured in decades

  18. First idea: Dynamic-Linear voting Adjust quorums as number of participants change Increase service availability Provides same—or better—availability with fewer servers

  19. Example (I) • Start with four servers • Quorum isthree out of four

  20. Example (II) • One of the four servers fails • Service still available under current quorum • New quorum istwo out of three X

  21. Example (III) • A second servers fails • Service still available under current quorum • New quorum istwo out of two X X

  22. Example (IV) • A third servers fails • Two options X X X

  23. The two options • Have no tie-breaking rule: • Quorum remainstwo out of two • Service cannot tolerate triple failures • Have a tie-breaking rule: • Use some fixed linear ordering of servers • Quorum is now defined as higher-ordered serverin pair • Service can tolerate one half of triple failures

  24. Example (V) • As one or more servers recover, quorums get updated • New quorum is two out of three X

  25. Implementation issues (I) • Cluster must keep track of current quorum • Not enough due to network partitions and other transmission errors • Servers that were assumed to have failed may suddenly reappear • Must also maintain the list of servers that are allowed to vote (majority block) • Best solution is using cohort sets

  26. Implementation issues (II) • Cohort sets • Represent the set of servers that are allowed to participate in a leader election • Typically stored in a bitmap • Updated by the leader of the cluster • Implemented on the top of the RAFT consensus algorithm • Will guarantee their consistency

  27. Second idea: Using witnesses Can replace one of the four servers by a witness Witnesses are lightweight servers Hold no data (no state machine) Maintainsame state information Sequence number of the current term and the indexes of all log updates Index of the last known update applied by the leader to its state machine Current cohort set

  28. Advantages and disadvantages Witnesses can run on very low power nodes Raspberry Pi, … A cluster with n servers and m witnesses has almost the same availability as a cluster withn + m servers Replacing servers by witnesses increases the risk that the service will run with only one available server Increases the risk of data loss

  29. Performance Analysis • Will evaluate three Pirogue configurations • PIROGUE(4) • A Pirogue cluster with four servers • RESTRICTED PIROGUE(4) • A Pirogue cluster with four servers that requires a minimum of two operational servers to accept updates • PIROGUE(3+1) • A Pirogue cluster with three servers and one witness

  30. Benchmarks • Two benchmarks; • RAFT(3) • A Raft cluster with three servers • RAFT(5) • A Raft cluster with five servers

  31. Performance criteria Availability: Fraction of time the service will be operational Exposure to double failures: Fraction of time the service will run with only two operational servers Exposure to single failures: Fraction of time the service will run on a single operational server

  32. Modeling hypotheses Device failures are mutually independent and follow a Poisson law A reasonable approximation Device repairs can be performed in parallel Device repair times follow an exponential law Not true but fairly robust H.-W. Kao, J.-F. Paris, T. Schwarz, S. J., and D. D. E. Long, A Flexible Simulation Tool for Estimating Data Loss Risks in Storage Arrays, Proc. MSST Symposium, May 2013.

  33. System Parameters Only two Server failure rate λ = 1/MTTF Server repair rate μ = 1/MTTR A λ/μ ratio of 0.01 corresponds to a server that crashes once every 25 days and takes 6 hours to restart

  34. μ μ 1 0 3 2 1' λ 2μ 2μ 3μ 2λ 3μ 3λ 3' 2' Markov diagram for PIROGUE(4) 4λ λ 3λ Available states 4 μ μ μ λ μ λ Unavailable states

  35. μ μ 2'' μ 3' μ 0'' 4λ 1'' 2' 3 2 1' 2μ 2μ 2μ 3λ 2λ 4 μ μ λ 2λ 2μ 2μ 2μ λ λ λ λ 2λ Markov diagram for RESTRICTED PIROGUE(4) Available states Unavailable states

  36. Markov diagram for PIROGUE(3+1) • Identical to that for Pirogue(4) • As long as the witness has the lowest rank in the linear ordering of sites • Will never win a tie-breaking decision • Sole problem is higher exposure to single and double failures • Operational cluster configurations with two servers and the witness or one server and the witness

  37. Availability

  38. Exposure to double failures

  39. Exposure to single failures

  40. Conclusions Can reduce the energy footprint of RAFT protocol by up to 40 percent: When reducing risk of data loss is critical, use RESTRICTED PIROGUE (4) When achieving high service availability iscritical, use PIROGUE(4) When maximizing energy savings and achieving high service availability are both critical, use PIROGUE(3+1)

  41. Thank you! • Any questions?

More Related