CS60002 Distributed Systems

CS60002 Distributed Systems

Text Book: • “Advanced Concepts in Operating Systems” by Mukesh Singhal and Niranjan G. Shivaratri will cover about half the course, supplemented by copies of papers Xerox, notes, copies of papers etc. will cover the rest.

What is a distributed system? A very broad definition: A set of autonomous processes communicating among themselves to perform a task Autonomous: able to act independently Communication: shared memory or message passing “Concurrent system” : a better term probably

A more restricted definition: • A network of autonomous computers that communicate by message passing to perform some task A practical “distributed system” will probably have both • Computers that communicate by messages • Processes/threads on a computer that communicate by messages or shared memory

Advantages • Resource Sharing • Higher Performance • Fault Tolerance • Scalability

Why is it hard to design them? The usual problem of concurrent systems: • Arbitrary interleaving of actions makes the system hard to verify Plus • No globally shared memory (therefore hard to collect global state) • No global clock • Unpredictable communication delays

Models for Distributed Algorithms • Topology : completely connected, ring, tree etc. • Communication : shared memory/message passing (reliable? Delay? FIFO/Causal? Broadcast/multicast?) • Synchronous/asynchronous • Failure models (fail stop, crash, omission, Byzantine…) An algorithm need to specify the model on which it is supposed to work

Complexity Measures • Message complexity : no. of messages • Communication complexity/Bit Complexity : no. of bits • Time complexity : For synchronous systems, no. of rounds. For asynchronous systems, different definitions are there.

Some Fundamental Problems • Ordering events in the absence of a global clock • Capturing the global state • Mutual exclusion • Leader election • Clock synchronization • Termination detection • Constructing spanning trees • Agreement protocols

Ordering of Events and Logical Clocks

Ordering of Events Lamport’s Happened Before relationship: For two events a and b, a → b if • a and b are events in the same process and a occurred before b • a is a send event of a message m and b is the corresponding receive event at the destination process • a → c and c → b for some event c

a → b implies a is a potential cause of b Causal ordering : potential dependencies “Happened Before” relationship causally orders events • If a → b, then a causally affects b • If a → b and b → a, then a and b are concurrent ( a || b)

Logical Clock Each process i keeps a clock Ci. • Each event a in i is timestamped C(a), the value of Ci when a occured • Ci is incremented by 1 for each event in i • In addition, if a is a send of message m from process i to j, then on receive of m, Cj = max(Cj, C(a)+1)

Points to note: • if a → b, then C(a) < C(b) • → is an irreflexive partial order • Total ordering possible by arbitrarily ordering concurrent events by process numbers

Limitation of Lamport’s Clock a → b implies C(a) < C(b) BUT C(a) < C(b) doesn’t imply a → b !! So not a true clock !!

Solution: Vector Clocks Ci is a vector of size n (no. of processes) C(a) is similarly a vector of size n Update rules: • Ci[i]++ for every event at process i • if a is send of message m from i to j with vector timestamp tm, on receive of m: Cj[k] = max(Cj[k], tm[k]) for all k

For events a and b with vector timestamps ta and tb, • ta = tb iff for all i, ta[i] = tb[i] • ta≠ tb iff for some i, ta[i] ≠ tb[i] • ta ≤ tb iff for all i, ta[i] ≤ tb[i] • ta < tb iff (ta ≤ tb and ta ≠ tb) • ta || tb iff (ta < tb and tb < ta)

a → b iff ta < tb • Events a and b are causally related iff ta < tb or tb < ta, else they are concurrent • Note that this is still not a total order

Causal ordering of messages: application of vector clocks • If send(m1)→ send(m2), then every recipient of both message m1 and m2 must “deliver” m1 before m2. “deliver” – when the message is actually given to the application for processing

Birman-Schiper-Stephenson Protocol • To broadcast m from process i, increment Ci(i), and timestamp m with VTm = Ci[i] • When j ≠ i receives m, j delays delivery of m until • Cj[i] = VTm[i] –1 and • Cj[k] ≥ VTm[k] for all k ≠ i • Delayed messaged are queued in j sorted by vector time. Concurrent messages are sorted by receive time. • When m is delivered at j, Cj is updated according to vector clock rule.

Problem of Vector Clock • message size increases since each message needs to be tagged with the vector • size can be reduced in some cases by only sending values that have changed

Capturing Global State

Global State Collection Applications: • Checking “stable” properties, checkpoint & recovery Issues: • Need to capture both node and channel states • system cannot be stopped • no global clock

Some notations: • LSi : local state of process i • send(mij) : send event of message mij from process i to process j • rec(mij) : similar, receive instead of send • time(x) : time at which state x was recorded • time (send(m)) : time at which send(m) occured

send(mij) є LSi iff time(send(mij)) < time(LSi) rec(mij) є LSj iff time(rec(mij)) < time(LSj) transit(LSi,LSj) = { mij | send(mij) є LSi and rec(mij) є LSj} inconsistent(LSi, LSj) = {mij | send(mij) є LSi and rec(mij) є LSj}

Global state: collection of local states GS = {LS1, LS2,…, LSn} GS is consistent iff for all i, j, 1 ≤ i, j ≤ n, inconsistent(LSi, LSj) = Ф GS is transitless iff for all i, j, 1 ≤ i, j ≤ n, transit(LSi, LSj) = Ф GS is strongly consistent if it is consistent and transitless.

Chandy-Lamport’s Algorithm • Uses special marker messages. • One process acts as initiator, starts the state collection by following the marker sending rule below. • Marker sending rule for process P: • P records its state; then for each outgoing channel C from P on which a marker has not been sent already, P sends a marker along C before any further message is sent on C

When Q receives a marker along a channel C: • If Q has not recorded its state then Q records the state of C as empty; Q then follows the marker sending rule • If Q has already recorded its state, it records the state of C as the sequence of messages received along C after Q’s state was recorded and before Q received the marker along C

Points to Note: • Markers sent on a channel distinguish messages sent on the channel before the sender recorded its states and the messages sent after the sender recorded its state • The state collected may not be any state that actually happened in reality, rather a state that “could have” happened • Requires FIFO channels • Network should be strongly connected (works obviously for connected, undirected also) • Message complexity O(|E|), where E = no. of links

Lai and Young’s Algorithm • Similar to Chandy-Lamport’s, but does not require FIFO • Boolean value X at each node, False indicates state is not recorded yet, True indicates recorded • Value of X piggybacked with every application message • Value of X distinguishes pre-snapshot and post-snapshot messages, similar to the Marker

Mutual Exclusion

Mutual Exclusion • very well-understood in shared memory systems • Requirements: • at most one process in critical section (safety) • if more than one requesting process, someone enters (liveness) • a requesting process enters within a finite time (no starvation) • requests are granted in order (fairness)

Classification of Distributed Mutual Exclusion Algorithms • Non-token based/Permission based • Permission from all processes: e.g. Lamport, Ricart-Agarwala, Raicourol-Carvalho etc. • Permission from a subset: ex. Maekawa • Token based • ex. Suzuki-Kasami

Some Complexity Measures • No. of messages/critical section entry • Synchronization delay • Response time • Throughput

Lamport’s Algorithm • Every node i has a request queue qi, keeps requests sorted by logical timestamps (total ordering enforced by including process id in the timestamps) To request critical section: • send timestamped REQUEST (tsi, i) to all other nodes • put (tsi, i) in its own queue On receiving a request (tsi, i): • send timestamped REPLY to the requesting node i • put request (tsi, i) in the queue

To enter critical section: • i enters critical section if (tsi, i) is at the top if its own queue, and i has received a message (any message) with timestamp larger than (tsi, i) from ALL other nodes. To release critical section: • i removes it request from its own queue and sends a timestamped RELEASE message to all other nodes • On receiving a RELEASE message from i, i’s request is removed from the local request queue

Some points to note: • Purpose of REPLY messages from node i to j is to ensure that j knows of all requests of i prior to sending the REPLY (and therefore, possibly any request of i with timestamp lower than j’s request) • Requires FIFO channels. • 3(n – 1 ) messages per critical section invocation • Synchronization delay = max. message transmission time • requests are granted in order of increasing timestamps

Ricart-Agarwala Algorithm • Improvement over Lamport’s • Main Idea: • node j need not send a REPLY to node i if j has a request with timestamp lower than the request of i (since i cannot enter before j anyway in this case) • Does not require FIFO • 2(n – 1) messages per critical section invocation • Synchronization delay = max. message transmission time • requests granted in order of increasing timestamps

To request critical section: • send timestamped REQUEST message (tsi, i) On receiving request (tsi, i) at j: • send REPLY to i if j is neither requesting nor executing critical section or if j is requesting and i’s request timestamp is smaller than j’s request timestamp. Otherwise, defer the request. To enter critical section: • i enters critical section on receiving REPLY from all nodes To release critical section: • send REPLY to all deferred requests

Roucairol-Carvalho Algorithm • Improvement over Ricart-Agarwala • Main idea • once i has received a REPLY from j, it does not need to send a REQUEST to j again unless it sends a REPLY to j (in response to a REQUEST from j) • no. of messages required varies between 0 and 2(n – 1) depending on request pattern • worst case message complexity still the same

Maekawa’s Algorithm • Permission obtained from only a subset of other processes, called the Request Set (or Quorum) • Separate Request Set Ri for each process i • Requirements: • for all i, j: Ri∩ Rj ≠ Φ • for all i: i Є Ri • for all i: |Ri| = K, for some K • any node i is contained in exactly D Request Sets, for some D • K = D = sqrt(N) for Maekawa’s

A simple version To request critical section: • i sends REQUEST message to all process in Ri On receiving a REQUEST message: • send a REPLY message if no REPLY message has been sent since the last RELEASE message is received. Update status to indicate that a REPLY has been sent. Otherwise, queue up the REQUEST To enter critical section: • i enters critical section after receiving REPLY from all nodes in Ri

To release critical section: • send RELEASE message to all nodes in Ri • On receiving a RELEASE message, send REPLY to next node in queue and delete the node from the queue. If queue is empty, update status to indicate no REPLY message has been sent.

Message Complexity: 3*sqrt(N) • Synchronization delay = 2 *(max message transmission time) • Major problem: DEADLOCK possible • Need three more types of messages (FAILED, INQUIRE, YIELD) to handle deadlock. Message complexity can be 5*sqrt(N) • Building the request sets?

Token based Algorithms • Single token circulates, enter CS when token is present • No FIFO required • Mutual exclusion obvious • Algorithms differ in how to find and get the token • Uses sequence numbers rather than timestamps to differentiate between old and current requests

Suzuki Kasami Algorithm • Broadcast a request for the token • Process with the token sends it to the requestor if it does not need it Issues: • Current vs. outdated requests • determining sites with pending requests • deciding which site to give the token to

The token: • Queue (FIFO) Q of requesting processes • LN[1..n] : sequence number of request that j executed most recently • The request message: • REQUEST(i, k): request message from node i for its kth critical section execution • Other data structures • RNi[1..n] for each node i, where RNi[j] is the largest sequence number received so far by i in a REQUEST message from j.

To request critical section: • If i does not have token, increment RNi[i] and send REQUEST(i, RNi[i]) to all nodes • if i has token already, enter critical section if the token is idle (no pending requests), else follow rule to release critical section On receiving REQUEST(i, sn) fat j: • set RNj[i] = max(RNj[i], sn) • if j has the token and the token is idle, send it to i if RNj[i] = LN[i] + 1. If token is not idle, follow rule to release critical section

To enter critical section: • enter CS if token is present To release critical section: • set LN[i] = RNi[i] • For every node j which is not in Q (in token), add node j to Q if RNi[ j ] = LN[ j ] + 1 • If Q is non empty after the above, delete first node from Q and send the token to that node

Points to note: • No. of messages: 0 if node holds the token already, n otherwise • Synchronization delay: 0 (node has the token) or max. message delay (token is elsewhere) • No starvation

CS60002 Distributed Systems

CS60002 Distributed Systems

Presentation Transcript

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed systems

Distributed Systems

Distributed Systems

Distributed Systems Course Distributed Multimedia Systems

Distributed Systems Course Distributed File Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems Course Distributed File Systems

Distributed Systems

Distributed Systems