380 likes | 654 Vues
Eventual Consistency . Jinyang. Sequential consistency. Sequential consistency properties: Latest read must see latest write Handles caching All writes are applied in a single order Handles concurrent writes Realizing sequential consistency:
E N D
Eventual Consistency Jinyang
Sequential consistency • Sequential consistency properties: • Latest read must see latest write • Handles caching • All writes are applied in a single order • Handles concurrent writes • Realizing sequential consistency: • Reads/writes from a single node execute one at a time • All reads/writes to address X must be ordered by one memory/storage module responsible for X
W(A)1 Invalidate, R(B) W(B)3 W(A)2 Realizing sequential consistency Cache or replica Cache Or replica
Disadvantages of sequential consistency • Requires highly available connections • Lots of chatter between clients/servers • Not suitable for certain scenarios: • Disconnected clients (e.g. your laptop) • Apps might prefer potential inconsistency to loss of availability
Why (not) eventual consistency? • Support disconnected operations • Better to read a stale value than nothing • Better to save writes somewhere than nothing • Potentially anomalous application behavior • Stale reads and conflicting writes…
Sync w/ server resolves non-conflicting changes, reports conflicting ones to user W(A)1 W(A)2 No sync between clients Client writes to its local replica Operating w/o total connectivity replica replica
Pair-wise synchronization Pair-wise sync resolves non-conflicting changes, reports conflicting ones to users W(B)3 replica W(A)1 W(A)2 replica replica
Examples usages? • File synchronizers • One user, many gadgets
File synchronizer • Goal • All replica contents eventually become identical • No lost updates • Do not replace new version with old ones
Prevent lost updates • Detect if updates were sequential • If so, replace old version with new one • If not, detect conflict • “Optimistic” vs. “Pessimistic” • Eventual Consistency: Let updates happen, worry about whether they can be serialized later • Sequential Consistency: Updates cannot take effect unless they are serialized first
W(f)b f 16679 W(f)c 15648 23657 How to prevent lost updates? • Strawman: use mtime to decide which version should replace the other • Problem w/ wallclock: cannot detect disagreement on ordering W(f)a H1 f mtime: 15648 f 12354 H2
Strawman fix • Carry the entire modification history • If history X is a prefix of Y, Y is newer W(f)a W(f)b H1 H1:15648 H1:15648 H1:16679 W(f)c H1:15648 H1:15648 H2:23657
H1:1 H1:2 H1:1 H1:2 H1:2 H2:1 Compress version history W(f)a W(f)b H1 H1:1 H1:1 H1:2 W(f)c H1:1 H1:1 H1:2 H2 H1:1 H1:2 H2:1 H1:2 implies H1:1, so we only need one number per host
< Compare vector timestamp H1:1 H2:3 H3:2 H1:1 H2:5 H3:7 < H1:1 H2:3 H3:2 H1:2 H2:1 H3:7
H1:2 H1:2 H2:1 Using vector timestamp W(f)a W(f)b H1 H1:1 H1:2 W(f)c H1:1 H1:1 H2:1 H2
Using vector timestamp W(f)a W(f)b H1 H1:1 H1:2 W(f)c H1:1 H1:1 H2:1 H1:1 H2:1 H2
How to deal w/ conflicts? • Easy: mailboxes w/ two different set of messages • Medium: changes to different lines of a C source file • Hard: changes to same line of a C source file • After conflict resolution, what should the vector timestamp be?
What about file deletion? • Can we forget about the vector timestamp for deleted files? • Simple solution: treat deletion as a write • Conflicts involving a deleted file is easy • Downside: • Need to remember vector timestamp for deleted files indefinitely
Tra [Cox, Josephson] • What are Tra’s novel properties? • Easy to compress storage of vector timestamps • No need to check every file’s version vector during sync • Allows partial sync of subtrees • No need to keep timestamp for deleted files forever
Tra’s key technique • Two vector timestamps: • One represents modification time • Tracks what a host has • One represents synchronization time • Tracks what a host knows • Sync time implies no modification happens since mod time H1:1 H2:5 H3:7 H1:10 H2:20 H3:25
H1:1 H1:0 H1:2 H1:0 f1 f1 f2 f2 H1:0 H2:0 H1:2 H2:0 H1:0 H2:0 H1:2 H2:0 Using sync time W(f1)a W(f2)b H1 H1:1 H1:2 f1 f2 H1:1 H2:0 H1:2 H2:0 H2
Compress mtime and synctime • dir synctime = element-wise min of child sync times • dir mtime = element-wise max of child mod times • Sync(d1d1’) • Skip d1 if mtime of d1 is less than synctime of d1’ • Can we achieve this with single mtime? • Skip d1 if mtime of d1 is less than mtime of d1’
Synctime enables partial synchronization • Directory d1 contains f1 and f2, suppose host sync a subtree (d1/f1) • With synctime+mtime: synctime of d1 does not change. Mtime of d1 increases • With mtime only: Mtime of d1 increases • Host later syncs subtree d1/f2 • With synctime+mtime: will pull in modifications in e2 because synctime of d1 is smaller • With mtime only: skips d1 because mtime is high enough
H1:0 H1:1 H1:0 f1 f1 f2 H1:2 H2:0 H1:0 H2:0 H1:0 H2:0 H1:2 H1:2 d d H1:0 H2:0 H1:0 H2:0 Using sync time W(f1)a W(f2)b H1 f1 H1:1 f2 H1:2 H1:2 Sync f1 only d Sync f2 only H1:2 H2:0 H1:1 H1:2 f1 f2 H2 H1:2 d H1:2 H2:0
H1:1 f1 H1:2 H1:0 d d H1:2 H2:0 H1:0 H2:0 How to deal w/ deletion Deletion notice for a deleted file contains its sync time W(f1)a D(f2) H1 f1 H1:1 f2 H1:2 H2:0 H1:2 d H1:2 H2:0 H1:0 H1:0 f2 f1 H2
H2:1 f2 H1:1 f1 H1:2 H1:0 d d H1:2 H2:1 H1:0 H2:1 How to deal w/ deletion Deletion notice for a deleted file contains its sync time W(f1)a D(f2) H1 f1 H1:1 f2 H1:2 H2:0 H1:2 d H1:2 H2:0 H1:0 H2 H2:1 f2 f1
Another definition of eventual consistency • Eventual consistency (Tra) • All replica contents are eventually identical • Do not care about individual writes, just overwrite old replica w/ new one • Eventual consistency (Bayou) • Writes are eventually applied in total order • Reads might not see most recent writes in total order
Bayou Write log 0:0 1:0 2:0 Version Vector N1 0:0 1:0 2:0 N0 0:0 1:0 2:0 N2
1:0 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:0 2:0 Bayou propagation Write log 1:1 W(x) 0:0 1:1 2:0 Version Vector N1 1:0 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:0 2:0 N0 0:0 1:0 2:0 N2
0:3 1:4 2:0 1:1 W(x) Bayou propagation Write log 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:4 2:0 Version Vector N1 1:0 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:0 2:0 N0 0:0 1:0 2:0 N2
Which portion of The log is stable? Bayou propagation Write log 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:4 2:0 Version Vector N1 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:4 1:4 2:0 N0 0:0 1:0 2:0 N2
Bayou propagation Write log 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:4 2:0 Version Vector N1 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:4 1:4 2:0 N0 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:4 2:5 N2
Bayou propagation Write log 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:6 2:5 Version Vector N1 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:4 1:4 2:0 0:3 1:4 2:5 N0 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:4 1:4 2:5 N2
Bayou uses a primary to commit a total order • Why is it important to make log stable? • Stable writes can be committed • Stable portion of the log can be truncated • Problem: If any node is offline, the stable portion of all logs stops growing • Bayou’s solution: • A designated primary defines a total commit order • Primary assigns CSNs (commit-seq-no) • Any write with a known CSN is stable • All stable writes are ordered before tentative writes
∞:1:1 W(x) 0:0 1:1 2:0 Bayou propagation Write log ∞:1:1 W(x) 0:0 1:1 2:0 Version Vector N1 1:1:0 W(x) 2:2:0 W(y) 3:3:0 W(z) 0:3 1:0 2:0 N0 0:0 1:0 2:0 N2
1:1:0 W(x) 2:2:0 W(y) 3:3:0 W(z) 4:1:1 W(x) 0:4 1:1 2:0 Bayou propagation Write log ∞:1:1 W(x) 0:0 1:1 2:0 Version Vector N1 1:1:0 W(x) 2:2:0 W(y) 3:3:0 W(z) 0:4 1:1 2:0 N0 4:1:1 W(x) 0:0 1:0 2:0 N2
Bayou’s limitations • Primary cannot fail • Server creation & retirement makes nodeID grow arbitrarily long • Anomalous behaviors for apps? • Calendar app