Dynamic Raft: Distributed Consensus Algorithm

PIROGUE, A LIGHTER DYNAMIC VERSION OF THE RAFT DISTRIBUTED CONSENSUS ALGORITHM Jehan-François Pâris, U. of Houston Darrell D. E. Long, U. C. Santa Cruz

Motivation New distributed consensus algorithm Raft (Ongaro and Ousterhout, 2014) Easier to understand and implement than Paxos (Lamport, 1998) Raft need to run on five servers to tolerate the failure of two of them High energy footprint A problem that needed to be addressed

Talk organization Understanding how Raft works Focus on Raft update and election quorums Two main ideas Using dynamic-linear voting Replacing some servers by witnesses Evaluating our proposals

A Raft cluster One leader and several followers All client requests go through the leader Consensusmodule Log State machine Client

A client sends a request Leader stores request on its log and forwards it to its followers Log Log Log State machine State machine State machine Client

The followers receive the request Followers store the request on their logs and acknowledge its receipt Log Log Log State machine State machine State machine Client

The leader tallies followers' ACKs Once it ascertains the request has been processed by a majority of the servers, it updates its state machine Log Log Log State machine State machine State machine Client

The leader tallies followers' ACKs Leader's heartbeats convey the news to its followers: they update their state machines Log Log Log State machine State machine State machine Client

The leader fails Followers notice at different times the lack of heartbeats Decide to elect a new leader Log Log Log State machine State machine State machine Client

At different times? Should all followers detect a leader failure at the same time They would all solicit the votes of other followers Nobody would win a majority of the votes Raft servers have randomized election timers Greatly reduces this risk Statistically guarantees convergence

An election starts Candidate for leader position requests votes of other former followers Includes a summary of the state of its log Log State machine Log State machine

Former followers reply Former followers compare the state of their logs with credentials of candidate Vote for candidate unless Their own log is more "up to date" They have already voted for another server Log State machine Log State machine ?

The new leader is in charge Newly elected candidate forces all its followers to duplicate in their logs the contents of its own log Log State machine Log State machine

Raft fault-tolerance Raft must run on five servers to tolerate the failure of two of them Because it uses majority consensus voting

The good news Even after two server failures Data remain protected against two irrecoverable server failures X X

The bad news High energy footprint of the algorithm Five servers is more than most fault-tolerant distributed systems use Two servers: mirroring Three servers: Google FS Byzantine fault-tolerance requires four servers

Our proposal Run slightly modified Raft algorithm on fewer servers Guarantee the service will tolerate alldouble failures without service interruptions Let the service occasionally run on fewer than three servers Less protection against irrecoverable server failures Disk MFFTs are measured in decades

First idea: Dynamic-Linear voting Adjust quorums as number of participants change Increase service availability Provides same—or better—availability with fewer servers

Example (I) • Start with four servers • Quorum isthree out of four

Example (II) • One of the four servers fails • Service still available under current quorum • New quorum istwo out of three X

Example (III) • A second servers fails • Service still available under current quorum • New quorum istwo out of two X X

Example (IV) • A third servers fails • Two options X X X

The two options • Have no tie-breaking rule: • Quorum remainstwo out of two • Service cannot tolerate triple failures • Have a tie-breaking rule: • Use some fixed linear ordering of servers • Quorum is now defined as higher-ordered serverin pair • Service can tolerate one half of triple failures

Example (V) • As one or more servers recover, quorums get updated • New quorum is two out of three X

Implementation issues (I) • Cluster must keep track of current quorum • Not enough due to network partitions and other transmission errors • Servers that were assumed to have failed may suddenly reappear • Must also maintain the list of servers that are allowed to vote (majority block) • Best solution is using cohort sets

Implementation issues (II) • Cohort sets • Represent the set of servers that are allowed to participate in a leader election • Typically stored in a bitmap • Updated by the leader of the cluster • Implemented on the top of the RAFT consensus algorithm • Will guarantee their consistency

Second idea: Using witnesses Can replace one of the four servers by a witness Witnesses are lightweight servers Hold no data (no state machine) Maintainsame state information Sequence number of the current term and the indexes of all log updates Index of the last known update applied by the leader to its state machine Current cohort set

Advantages and disadvantages Witnesses can run on very low power nodes Raspberry Pi, … A cluster with n servers and m witnesses has almost the same availability as a cluster withn + m servers Replacing servers by witnesses increases the risk that the service will run with only one available server Increases the risk of data loss

Performance Analysis • Will evaluate three Pirogue configurations • PIROGUE(4) • A Pirogue cluster with four servers • RESTRICTED PIROGUE(4) • A Pirogue cluster with four servers that requires a minimum of two operational servers to accept updates • PIROGUE(3+1) • A Pirogue cluster with three servers and one witness

Benchmarks • Two benchmarks; • RAFT(3) • A Raft cluster with three servers • RAFT(5) • A Raft cluster with five servers

Performance criteria Availability: Fraction of time the service will be operational Exposure to double failures: Fraction of time the service will run with only two operational servers Exposure to single failures: Fraction of time the service will run on a single operational server

Modeling hypotheses Device failures are mutually independent and follow a Poisson law A reasonable approximation Device repairs can be performed in parallel Device repair times follow an exponential law Not true but fairly robust H.-W. Kao, J.-F. Paris, T. Schwarz, S. J., and D. D. E. Long, A Flexible Simulation Tool for Estimating Data Loss Risks in Storage Arrays, Proc. MSST Symposium, May 2013.

System Parameters Only two Server failure rate λ = 1/MTTF Server repair rate μ = 1/MTTR A λ/μ ratio of 0.01 corresponds to a server that crashes once every 25 days and takes 6 hours to restart

μ μ 1 0 3 2 1' λ 2μ 2μ 3μ 2λ 3μ 3λ 3' 2' Markov diagram for PIROGUE(4) 4λ λ 3λ Available states 4 μ μ μ λ μ λ Unavailable states

μ μ 2'' μ 3' μ 0'' 4λ 1'' 2' 3 2 1' 2μ 2μ 2μ 3λ 2λ 4 μ μ λ 2λ 2μ 2μ 2μ λ λ λ λ 2λ Markov diagram for RESTRICTED PIROGUE(4) Available states Unavailable states

Markov diagram for PIROGUE(3+1) • Identical to that for Pirogue(4) • As long as the witness has the lowest rank in the linear ordering of sites • Will never win a tie-breaking decision • Sole problem is higher exposure to single and double failures • Operational cluster configurations with two servers and the witness or one server and the witness

Availability

Exposure to double failures

Exposure to single failures

Conclusions Can reduce the energy footprint of RAFT protocol by up to 40 percent: When reducing risk of data loss is critical, use RESTRICTED PIROGUE (4) When achieving high service availability iscritical, use PIROGUE(4) When maximizing energy savings and achieving high service availability are both critical, use PIROGUE(3+1)

Thank you! • Any questions?

Dynamic Raft: Distributed Consensus Algorithm

Dynamic Raft: Distributed Consensus Algorithm

Presentation Transcript

The Antnet Routing Algorithm - A Modified Version

Distributed systems Consensus

An Introduction to Consensus with Raft

The Raft Consensus Algorithm

Consensus Routing: The Internet as a Distributed System

Raft: A Consensus Algorithm for Replicated Logs

CompAPO : A complete version of the APO Algorithm

Distributed Consensus

Distributed Consensus (continued)

The Raft Consensus Algorithm

Distributed Consensus

A Dynamic Algorithm: Tomasulo’s

Consensus – Randomized Algorithm

Consensus Algorithm - A Heritage to the Blockchain Ecosystem

Distributed Consensus

Distributed Systems : Raft Consensus Alg.

Running the Distributed Version of ANSYS

Distributed Consensus

CompAPO : A complete version of the APO Algorithm

Distributed Consensus