Chapter 7

Chapter 7 Synchronization

Topics • Physical clock synchronization • Logical clock synchronization • Causality relation • Lamport’s logical clock • Vector logical clock • Multicast • ISIS vector clock • Snapshot

New Issues in DS • Global time • Event order • e1 at 1:00pm on machine m1, e2 at 1:01pm at machine m2. Which event happens earlier? • Global state • snapshot • Mutual exclusion & Synchronization

Time and Clock • Two roles of time - Defines temporal order among events - Duration (measured by timer) • UTC (Universal Coordinated Time) is based on Cesium-133 atom oscillation; located at over 200 labs in the world • With satellites, 0.5ms accuracy is possible. (100 MIPS  50,000 instructions in 0.5ms).

Clock Skew • Skew: Clock reading (from single clock) is location-dependent, e.g., distance from satellite or clock source on a circuit board • Drift: Multiple clocks. • t = the real time • Cp(t) = the reading of a clock p at time t (Cp(t)= t for ideal clock) • dCp(t) /dt = ticking rate (dCp(t) /dt = 1 for ideal clock)

Consequence: An Example • When each machine has its own clock, an event that occurred after another event may nevertheless be assigned an earlier time. But it is a different story in DS. Time

Physical Clock Synchronization • Cristian’s Algorithm • The Berkeley Algorithm • Network Time Protocol (NTP) • OSF DCE

Cristian’s Algorithm: Architecture • WWV-node receiving UTC-signals, serving as the central UTC-time server (CUTCS) for the DS • WWV is a short wave radio station in Colorado. • Periodically every node in the DS sends a time request to the central UTC time server CUTCS. • The CUTCS responds with its current time tUTC

Adjust Time • When the client gets the reply, it simply set its clock to tUTC • Time may run backward • Introduce the change gradually • Consider the time for message propagation.

Tp T0 I Tp T1 t Server Client Adjust Time (continued) • Estimate Tp (propagation delay) from T1 – T0 = 2 x Tp + I. where I = processing time. • Current time = t (server’s time in message) + Tp. • Assumption?

The Berkeley Algorithm • In Cristian’s algorithm the central time server was passive • Now it’s active, i.e. it periodically polls all other nodes to hand out their current local times ci(t). • Based on the answers it calculates a mean and tells all other machines to advance or slow down their clocks accordingly.

Relative Clock Synchronization • Time server periodically sends its time to clients and asks for theirs. • Clients respond with how far ahead or behind the server they are • Time server uses the estimated local times for building the arithmetic mean • Deviations from this arithmetic mean are sent to nodes enabling them to slow down respectively to speed up.

The Berkeley Algorithm • The time daemon asks all the other machines for their clock values • The machines answer • The time daemon tells everyone how to adjust their clock

Summary • Cristian’s method and the Berkeley algorithm are intended for intranets • Both may be improved with fault tolerance methods • Instead of one UTC server in Cristian’s algorithm use n time servers and always take the first answer from whatever time serve • Instead of taking the arithmetic mean from all clients in the Berkeley algorithm take the fault tolerance mean, i.e. skip deviations with a certain threshold

Network Time Protocol (NTP) • Goal • absolute (UTC)-time service in large nets (e.g. Internet) • high availability (via fault tolerance) • protection against fabrication (via authentication) • Architecture • time-servers build up a hierarchical synchronization subnet • all primary servers have an UTC-receiver • secondary servers are synchronized by their parent primary server • all other stations are leaves on level 3 being synchronized by level 2 time servers • accuracy of clocks decreases with increasing level number • the net is able to reconfigure

NTP Reliability from redundant paths, scalability, authenticated time sources

Synchronization of Servers (NTP) • Synchronization subnet can reconfigure if failures occur,e.g. • Primary having lost its UTC source can become a secondary • Secondary having lost its primary can use another one • Modes of synchronization: • Multicast mode (for quick LANs, low accuracy) • A server within a LAN periodically multicasts time to other leaves in the LAN which set their clocks assuming some delay • Procedure-call mode (Cristian’s algorithm with medium accuracy) • A server responds to requests with its actual timestamp • Symmetric mode (high accuracy) • Pairs of servers exchange message containing times

TS1 TS2 TS3 TS4 Reject New time interval OSF DCE • Time is an interval [t-e, t+e]. Two intervals overlap  cannot say which time is earlier (In case of overlap, Unix make should recompile).

Logical Clock Synchronization • A powerful building block in DS • Duplicate detection • Cache consistency • Leases • Commitment • …

Leslie Lamport Time, Clocks, and the Ordering of Events in a Distributed System Best known for his work on 1)Temporal logic 2)LaTeX Microsoft

The Paper • Handles problems of clock drift in a DS • Identifies main function of computer clocks, i.e. ordering of events • Indicates which conditions clocks must satisfy to fulfill their role • Introduces logical clocks • Benefits of logical clocks? • Needed for determining causality

Logical Time • In many cases it’s sufficient just to order the related relevant events, i.e. we want to be able to position these events relatively, but not absolutely. • Interesting: Relative position of an event on the time axis  no need for any scaling on this time axis

Ordering Events • Event ordering linked with concept of causality: • Saying that event a happened before event b is same as saying that event a could have affected the outcome of event b • If events a and b happen on processes that do not exchange any data, their exact ordering is not important.

Time p3 p4 p1 p2 P q2 q3 q4 q5 Q q1 R r1 r2 r3 r4 p1 p2 p1 q2 p1 r3 transitive Causality Relationship • Event changes state of process. • State remains same till next event occurs.

Formal Definition • ab defined by: • If a and b are in the same process, and a occurs earlier than b, then ab. • If a is a sending event and b is receiving event of same message, then ab. • If ab and bc, then ac. [Transitive] • If ab, then acausally precedes (or happened before) b; a and b are causally related • a and b are concurrent if neither ab nor ba.

Message-Related Events • Sending event • Receiving event • Message arrival (at kernel) and delivery (to user process): Kernel can control timing of delivery after arrival.

Example Time e11e12e21e22e32 , furthermore e31e32, whereas e31 is neither related (has happened before) to e11, nor to e12, nor to e21, nor to e22. e31 is concurrent to e11, e12, e21, and e22.

Lamport Clock • Suppose E= {events}, each e in E gets a Lamport time stamp L(e), as follows: • 1. e is a pure local event or a sending-event: if e has no local predecessor, then L(e) := 1, otherwise there is a local predecessor e’, thus L(e) := L(e’) + 1 • 2. e is a receiving event, with a corresponding sending-event s: if e has no local predecessor, then L(e) = L(s) +1, otherwise there is a local predecessor e’, thus L(e) := max{L(s),L(e’)} + 1

Example Note: Each local counter is incremented with each local event. In a communication we adjust the involved counters of the two communicating nodes to be consistent with the happened-before-relation. Remark: Same mechanism can be used to adjust clocks on different nodes. The Lamport time is consistent with the happened-before-relation, i.e. if xy, then L( x)<L( y), but not vice versa.

Adjusting Clocks Without adjusting local clocks With adjusting local clocks

Limitation on Lamport Clocks From Lamport time values you cannot conclude whether two events are in the happen-before relationship,e.g. e11 ande32.

Total Ordering of Events • Lamport-time only gives us a partial-ordering of distributed events. • To implement the total ordering: • Each processor is assigned by a unique id (integer) • Given two events e1 and e2, e1 is ordered before e2 if L(e1) < L(e2) or L(e1)=L(e2) and Id(e1) < Id(e2)

Holding Back Deliveries • Delay the delivery of messages that arrived “too soon” • Useful when delivering messages from kernel to processes • Hold back the delivery of M to process P until there is a guarantee that no message M’ with L(M’) < L(M) will arrive at P in the future.

Implementation • Assumption: messages from a particular source arrive in the FIFO order • Each site maintains a set of message queues, one for each other site • When a message arrives, placed in the corresponding queue • When all queues are non-empty, compare the timestamps of the messages at the heads of the queues, and deliver the messages with the oldest timestamp.

Limitation • All message queues need be non-empty. • Normally not true. • Require multicast to solve the problem. • With Lamport clock, L(a)<L(b) does not mean ab. • Unnecessarily delay some messages. • Vector Clock.

P1 P2 P3 1 1 2 2 1 2 3 3 4 Event Counter Event counteratPi:Initialized at 0 and incremented for eachevent

Vector Time • Assumption: n tasks (processes) Pi in DS • Each Pihas its own local clock being a n-dimensional vector (initially zeroed) • Vi(a) is timestamp of event a at process Pi • Vi[i] = number of local events at Pi • Vi[j] is Pibest guess of how many events have been on Pj

Rules • There is a DS with n distributed processes. n-dimensional vector Vi is vector-time of process Pi if it is built according to the following rules: • (1) Initially, Vi = (0, …, 0) for all 1<=i<=n • (2) For a local event on process Pi: Vi[i]++ • (3) Piincludes the value t = Viin every msg m • (4) When Pjreceives a message mwith timestamp t, it sets Vj[k]= max{t[k], Vj [k]}, for 1<=k<=n and k != j • Communication cost? • Little overhead compared to Lamport clock

P1 P2 P3 000 000 000 100 010 001 200 220 230 300 240 242 250 450 243 260 264 550 273 Example M1 M2 M3 Time

Notation • We define global V(e) = Vi(e) if event e happens in Pi • We write V(a)  V(b) if • V(a)[k]  V(b)[k] for all k. • Here V(a)[k] denotes the kth component of V(a). • We write V(a) < V(b) if • V(a)[k]  V(b)[k] for all k, and • V(a)[j] < V(b)[j] for at least onej

Vector Time Characteristics • The following inter relationships between causality or the happened-before relation and vector-time hold: • A.) ee’ iff V(e) < V(e) • B.) e||e’ iff V(e) ||V(e’) • The vector-time is the best known estimation for global sequencing that is based only on local information.

P1 P2 P3 a t1 t3 t2 b Proof • a b iff V(a) < V(b) • Proof : For A fixed b, a  b iff a is in shaded area iff each component of V(a)  corresponding component of V(b).

Multicasting • A message is sent to all the members of a group • Sending video stream to a set of customers • Implementing a chat program • Sending updates to a group of replica managers

IPv4 Multicast Addresses • Class D (starts with bit sequence1110) • 224.0.0.1 to 239.255.255.255 (about 228268 million) • 224.0.0.1 is for “all systems on this subnet” • 224.2.0.0 ~ 224.2.127.253 are for multimedia conference calls

Causal Ordering of Messages • Suppose m1 and m2 are two messages being received at the same node i. A set of messages is causally ordered if for all pairs <m1, m2> the following holds: send(m1)  send(m2)  receive(m1)receive(m2)

P1 P2 P3 Migrate foo toP2 “Do you have foo?” “Given to P2 (=M2)” M1 “Do you have foo? (=M3)” “Nope” Time Causality Violation • SupposeM1’ssending event happened before M3’s sending event. • Causality violation occurs if M1 is delivered after M3 (In particular, non-FIFO delivery is causality violation). • Delay the delivery of M3 to P2 until M1 arrives. • ISIS system using multicast

Formal Description of ISIS Clock ICi • Pi initializes its clockICi = [0,…,0]. • For each msg sendingevent by Pi • ICi[i]++ • Pi attaches ICi to message it sends. • Upon receiving msg M from Pj with M.ts, Pi checks if • 1) M.ts[j] ==ICi[j] + 1 (M is next msg expected from Pj) • 2) ICi[k]  M.ts[k] for all otherk (all msgs from Pkthat sender Pjhas received have been received byPi) • If both are satisfied, PideliversM after ICi[j]++ • Otherwise, Pi puts M in hold-back Q until they are satisfied.

Example P1 P2 P3 Migrate foo to P2 “Where is foo?” 100 000 001 101 M1 M2.ts[1]> IC3[1]+1 Put M2in Hold-back Q 201 M2 “foois at P2” M1.ts[1] = IC3[1]+1 IC3101; deliver M1 IC3201; deliver M2 Time • Note: jth component of M.ts is • sequence number of latest msg sent • by Pj that is known to sender of M

Safety • Show that msgs are delivered in timestamp order. • Suppose not • Letm(m’)be event of sending message M (M’) • Assume Pi delivered msg M (from Pk) before M’(from Pj), even though M’.ts(= ICj(m’)) < M.ts (=ICk(m)) …….(A) (1) (a) Just before Pi delivered M’: ICi[j]+1= ICj(m’)[j]henceICi[j] < ICj(m’)[j] (2) (b) Delivery of M would have resulted in ICi[j]*= ICk(m)[j] at time of delivery • (a) and (b) contradict (A) since (b) took place before (a), hence ICi[j]* ICi[j]

Liveness • Show the system starvation-free: no message will wait forever in the hold-back Q • Assume Q is the hold-back queue in Pi and is non-empty. Let M be a msg in Q which is not preceded by any other msg in Q. Suppose M was sent by Pj.

Chapter 7

Chapter 7

Presentation Transcript

Chapter 7

Chapter 7

Chapter 7

CHAPTER 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7