Understanding Distributed Snapshots and Global State in Distributed Systems

Synchronization Part II Global State, Election, & Critical Sections Chapter 5

Global State

Global State – Motivation R1 R2 P1 P2 P1 request R1 R2 allocate request P2 allocate P1 release request R1 R2 allocate request P2

Global State • We cannot determine the exact global state of the system • We can approximate it • Distributed Snapshot: a state the system might have been in [Chandy and Lamport]

Distributed Snapshots • System = n processes P1 to Pn • Complete (unidirectional) graph • State of Pi is si , an infinite set of states • Entire contents of process address space! • State of all processes S = {s1,s2, …, sn} • Ci,j indicates channel between Pi and Pj • Reliable FIFO channels • Contents of Ci,j = Li,j = (m1,m2, …, mk) • L = {Li,j|i,j  1, …, n} • Global State of the system is G = (S,L)

Cuts • A consistent cut (meaningful global state) • An inconsistent cut

Distributed Snapshot Algorithm – Description • Provides consistent cuts • Any process can request a snapshot • Processes can request snapshots concurrently • A special message token is used to request a snapshot • The snapshot consists of a global state of the system G = (S,L) • Taken at a consistent cut

Distributed Snapshot Algorithm – Requesting a Snapshot • When a process P requests a snapshot it sends a token(P) to each other process Q • When a process Q receives a token(P) message its action depends on: • If Q receives the token for the first time • Q did not save its state for this token yet • If Q received the token before • Q has already saved its state for this token

Distributed Snapshot Algorithm – Receive Token for the First Time • When Q receives token(X) (from P) for the first time : • Save its state in sQ • Consider LP,Q to be empty • Reliable, FIFO channels: all messages before the token has been received and included in sQ • Cut takes effect at token • Send token(X) to everybody else • Note: Q must save state before receiving any subsequent messages from P. Why?

Distributed Snapshot Algorithm – Receive Token Again • When Q receives token(X) (from P) NOT for the first time : • Consider all messages received from P: • After Q has saved its state • Before receiving this token • These messages are part of LP,Q • Why? • Termination: When Q receives token(X) n-1 times, Q finished its part for the snapshot requested by X.

Distributed Snapshot Example (1) • Organization of a process and channels for a distributed snapshot

Distributed Snapshot Example (2) • Process Q receives a marker for the first time and records its local state • Q records all incoming message • Q receives a marker for its incoming channel and finishes recording the state of the incoming channel

Distributed Snapshot Algorithm – Variables for Process Pi • int my_version = 0 /* my snapshot version */ • int current_snap[1 .. n] = [0 .. 0] • /* If process Pi has current_snap[j] = k, then Pi has saved its state for snapshot version k initiated by Pj*/ • int tokens_received[1 .. n] = [0 .. 0] • /* If process Pi has tokens_received[j] = k, then Pi has received k tokens from Pj, need it to detect termination */ • process_state Si/* Process Pi saves its state in Si */ • channel_state Li[1 .. n] • /* Process Pi saves channel Ci,j contents in Li[j] */

Distributed Snapshot Algorithm – Code for Process Pi (1) • Request a snapshot: • my-version++ • save current state into Si • current_snap[i] = my_version • for each Pj  Pi • send token(i, my_version) • /* token(i, ver) : Pi is requesting a snapshot of version ver */

Distributed Snapshot Algorithm – Code for Process Pi (2) • Receive token(j, ver) from Pk : • if current_snap[j] < ver /* first time token of Pj */ • Save current state into Si • current_snap[j] = ver /* saved state for this token */ • Lk[i] = empty /* cut starts after token */ • for each Pr  Pi • send token(j,ver) to Pr • tokens_received[j] = 1 /* received first token */ • else (next slide) …

Distributed Snapshot Algorithm – Code for Process Pi (3) • else if current_snap[j] == ver /* notfirst time token */ • tokens_received[j]++ /* how many tokens */ • Lk[i] = all messages received from Pk since receiving token(j,ver) /* essential for a consistent cut */ • if tokens_received[j] == n –1 • local snapshot for (j,ver) is done • /* my participation for this snapshot is done */

Distributed Snapshot Algorithm • When a process finishes local snapshot, it collects its local state (S and L) and send it to the initiator of the distributed snapshot • The initiator can then analyze the state

Distributed Snapshot Algorithm – Correctness • Generates a consistent cut • If after P saves its state, P receives a message m from Q. Then m cannot be part of P’s state • Otherwise, no consistent cut • Algorithm considers it part of the channel • Any message sent before the token to P will be part of P’s state • FIFO, reliable channels • When P receives a token, it saves its state before receiving any subsequent messages

Distributed Computing Models

Synchronous Distributed Computing Model • Synchronous model: • Process execution speed: bounded • Message transmission delay: bounded • Clock drift rate: bounded • Useful for analysis of algorithms • Can be built if processes can be guaranteed • Enough CPU cycles and N/W capacity • Clocks with bounded drift rates • Can make use of time-outs to detect failures

Asynchronous Distributed Computing Model • Asynchronous model: • Process execution speed: not bounded • Message transmission delay: not bounded • Clock drift rate: not bounded • More realistic (e.g Internet) • Harder • More general • Cannot make use of time-outs to detect failures

Leader Election

Leader Election (1) • A central authority is often needed in a DS • Primary replica, scheduler, etc … • All processes have unique ids • Id could be any useful measure for election • Example ids: (1/load), name, priority, etc … • At the end, only one leader is elected • All process agree on the same leader (unanimous) • Choose the leader with the highest id • System can be synchronous or asynchronous

Leader Election (2) • Several processes can call for an election • There could be at most n concurrent elections at once • A process does not call more than one election at a time • Correctness: • Safety: When election is over, each process Pi has leaderi = j for some process Pj

Ring Algorithm – Asynchronous Model Predecessor(x) Successor(x) s Messages: - election(id) - leader(id) w n x Local variables: - runningi = false - leaderi m Initiate_Election(i): runningi = true Send election(i) to Successor(i)

A Ring Algorithm – Process Pi • Election(i): • Receive a message from Predecessor(i) • Case message is election(k): • if k > i then send election(k) to Successor(i) • if k < i & not runningi then send election(i) to • Successor(i) • if k = i then send leader(i) to Successor(i) • Case message is leader(k): • leaderi = k • runningi = false • quit election

Ring Algorithm – Example l(10) e(10) e(8) e(8) e(10) l(10) 5 9 8 e(9) e(10) l(10) l(10) e(10) e(2) 3 2 e(9) e(10) l(10) l(10) e(10) 4 10 e(10) l(10) l(10) e(10) 7 6 1 l(10) e(10) e(10) l(10)

Ring Algorithm – Analysis • What if more than one process Initiate_Election()? • Exercise: trace the algorithm on an example ring • Message Complexity (bandwidth utilization) O(n2): • There are always n leader messages • Best Case: n messages when Pn sends election(n) and no body else sends a message. election(n) travels back to Pn. • Worst Case: O(n2) messages • Exercise: find a ring arrangement that gives 1 + 2 + 3 + … + (n-2) + (n-1) + n = O(n2) messages

Bully Algorithm – Synchronous Model (1) • Reliable channels • Processes can crash • Process have minimal knowledge of each other: • Direct communication • Each processes knows which processes have higher ids than itself

Bully Algorithm (2) • Message types: • election: start an election • leader: announce a winner • bully: bully a nominee to quit • Synchronous model: • Can make use of time outs to detect failures • The process with the highest id, say P, can send a leader(P) message to all others

Bully Algorithm (3) • A process Q with a lower id can start an election by • sending an election message to all processes with higher ids • waiting for some time T • If after T, no response is received (all bigger guys are crashed), then send leader(Q) to all processes with lower ids • If Q receives a response (it must be bully) • It waits for a leader message

The Bully Algorithm - Example (1) • The bully election algorithm • Process 4 holds an election • Process 5 and 6 respond, telling 4 to stop (OK = bully) • Now 5 and 6 each hold an election

The Bully Algorithm – Example (2) • Process 6 tells 5 to stop • Process 6 wins and tells everyone (coordinator = leader)

Bully Algorithm – Failure Detector • Boolean Failure_Detector(int id) • send message to id • wait for T time units • /* time out */ if no response return true else return false • T = 2Ttrans + Tproc (upper bound) • Ttrans= upper bound on time to transmit a message • Tproc= upper bound on time to process a message

Bully Algorithm – Initiating an Election (1) • Initiate_Election(int i) /* process Pi */ • runningi = true /* I am running in this elections */ • if i is the highest id then • send leader(i) to all Pj, where j  i • else • send election(i) to all Pj, where j > i • /* check if there are bigger guys out there */ • wait for T time units

Bully Algorithm – Initiating an Election (2) • if no response /* time out, no response */ • leaderi = i /* I am the leader */ • send leader(i) to all Pj, where j  i • else /* bully is received */ • wait for T’ time units • if no leader(k) message Initiate_Election(i) • else (leader(k) from k) • leaderi = k • runningi = false /* leader elected */

The Bully Algorithm – Process Pi • Upon receiving a message m from Pj: • Case m is leader(j) : leaderi = j; runningi = false • Case m is election(j) : • if j < i then send bully to Pj • if not runningi then Initiate_Election(i) • Upon noticing the leader crashes: • Initiate_Election(i) • Upon introducing a new process Pi (replacing a crashed one): • Initiate_Election(i) • There is a problem here? Exercise: find it!

Bully Algorithm – Analysis n-1 • Bandwidth utilization • Message Complexity • Best Case: • When the process with the next highest id (after the leader) notices the leader crash • Sends n – 1(leader) messages, O(n) • Worst Case: • When the process with the lowest id notices the leader crash • Send (n-1)[E]+(n-2)[B] + (n-2)[E] + (n-3)[B] + … 2[E] + 1[B] +(n -1) [L], O(n2)

Critical Sections

Critical Sections CS c • Process P: • Enter(c) • Exit(c) • Remainder(c) • Critical Section correctness • Mutual Exclusion: safety • Deadlock-Freedom: progress • Starvation-Freedom: fairness

Critical Sections – Leader-based Algorithm • One leader process • Utilize leader election • Message types: • request(P,c) = P is requesting entry to CS c • release(P,c) = P is releasing CS c • acquire(P,c) = leader is telling P that it can enter c

Leader-based Algorithm – Example • Process 1 asks the leader for permission to enter a critical section. Permission is granted • Process 2 then asks permission to enter the same critical section. The leader does not reply. • When process 1 exits the critical section, it tells the leader, which then replies to 2

Leader-based Algorithm – Non-Leader Code • Enter(c) by process P • Send request(P,c) to the leader • Wait for acquire(P,c) message from the leader • Exit(c) by process P • Send release(P,c) to the leader

Leader-based Algorithm – Leader Code (1) • Boolean mutex[M] = false /* M critical sections */ • /* mutex[c] is true means that some process • is in CS c */ • Process_Queue CSQ[M] • /* CSQ[c] is a FIFO Queue of processes waiting to enter CS c */

Leader-based Algorithm – Leader Code (2) • Wait for a message (from process P) • Case message is request(P,c) /* P wants to enter CS c */ • if mutex[c] then CSQ[c].add(P) /* CS c is busy, P must wait */ • else /* P can enter CS c */ • mutex[c] = true /* CS cis taken by P now */ • send acquire(P,c) to P /* Inform P that it can enter CS c */

Leader-based Algorithm – Leader Code (3) • Case message is release(P,c) /* P exited CS c */ • if CSQ[c].empty() then /* No processes are waiting for CS c */ • mutex[c] = false /* CS c is available now */ • else /* At least one process is waiting for CS c */ • Q = CSQ[c].remove() /* Take process Q out, the one on the head of CSQ[c] */ • send acquire(Q,c) to Q /* Inform Q it can enter CS c */

Leader-based Algorithm – Analysis • What if the leader crashes? • Modify algorithm (exercise) • Correctness • Mutual Exclusion? • Progress? • Fairness? • Message Complexity • 3 messages per entry-exit

Critical Sections – Timestamps-based Algorithm • No leader process • Requires a total order on messages • Utilize Lamport’s timestamps • Message types: • request(P,c,ts) = P is requesting entry to CS c at ts • acquire(P,Q,c) = P is telling Q that it can acquire c

Timestamps-based Algorithm – Example • Two processes want to enter the same critical section at the same moment. • Process 0 has the lowest timestamp, so it wins. • When process 0 is done, it sends an OK also, so 2 can now enter the critical region.

Timestamps-based Algorithm (1) • Process_Queue CSQ[M] /* M critical sections */ • /* FIFO Queues to wait for entry */ • Enter(c) by process P • Send request(P,c,ts) to all processes • /* request entry from all other processes */ • Wait for acquire(Q,P,c) message from all other processes Q • /* when all processes permit, enter CS c */

Understanding Distributed Snapshots and Global State in Distributed Systems

Understanding Distributed Snapshots and Global State in Distributed Systems

Presentation Transcript

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization

Synchronization