Créer une présentation
Télécharger la présentation

Download

Download Presentation

Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

71 Vues
Download Presentation

Télécharger la présentation
## Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Viceroy: Scalable Emulation of Butterfly Networks For**Distributed Hash Tables By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003 Presented by Zhenlei Jia Nov. 11, 2004**Acknowledgments**Some of the following slides are adapted from the slides created by the authors of the paper**Outline**• Outline • DHT Properties • Viceroy • Structure • Routing Algorithm • Join/Leave • Bounding In-degree: Bucket Solution • Fault Tolerance • Summary**DHT**• What’s DHT • Store (key, value) pairs • Lookup • Join/Leave • Examples • CAN, Pastry, Tapestry, Chord etc.**DHT Properties**• Dilation • Efficient lookup, usually O(log(n)) • Maintenance cost • Support dynamic environment • Control messages, affected servers • Degree • Number of opened connections • Servers impacted by node join/leave • Heartbeat, graceful leave**DHT Properties (cont.)**• Congestion: • Peers should share the routing load evenly • Load (of a node): the probability that it is on a route with random source and destination. • If path length = O(log(n)) then on average, each node is on n2 x O(log(n))/n = O(nlog(n)) routes. Average load = O(nlogn)/n2 = O(log(n))/n**Intuition**• Route is a combination of links of appropriate size • Chord: Each node has ALL log(n) links • Viceroy • Each node has ONE of the long-range links • A link of length 1/2k points to a node has link of length 1/2k+1 Chrod**011**110 000 001 010 100 101 111 Level 1 Level 2 Level 3 Level 4 A Butterfly Network • Each node has ONE of the long-range links • A link of length 1/2k points to a node has link of length 1/2k+1 • Nodes “share” each other’s long link • Routing • Route to root • Route to right group • Route to right level • Path: O(log(n)) • Degree: O(1)**0101**1001 1011 1101 1111 0001 0010 0011 0100 0110 1000 1110 0 1 Level 1 Level 2 Level 3 A Viceroy network • Ideally, there should be log(n) levels • There is not a global counter • Later, we will see how a node can estimate log(n) locally**Structure: Nodes**• Node • Id: 128 bits binary string, u • Level: positive integer, u.level • Order of ids • b1b2…bk ∑i=1…k bi/2i • Each node has a SUCCESSOR and a PREDECESSOR SUCC(u), PRED(u) • Node u stores the keys k such that u≤k<SUCC(u)**Keys stored on x**0 1 PRED(x) x SUCC(x) Structure: Nodes • Lemma 2.1 Let n0 = 1/d(x, SUCC(x)), then w.h.p. (i.e. p>1-1/n1+e) that log(n)-log(log(n))-O(1) <log(n0) ≤3log(n) • Node x selects level from 1…log(n0) uniformly randomly**Structure: Links**• A node u in level k has six out links • 2 x Short: SUCCESSOR ,PREDECESSOR • 2 x Medium: (left) closest level-(k+1) node whose id matches u.id[k] and is smaller than u.id. • 1 x Long: the closest level-(k+1) node with prefix u1…uk-1(1-uk)(?) u1…uk-1(1-uk)uk+1…uw* where w=log(n0)-log(log(n0)) • 1 x Parent: closest level-(k-1) node • Also keeps track of in-bound links**0101**1001 1011 1101 1111 0001 0010 0011 0100 0110 1000 1110 0 1 Short link Long link, cross over about 1/2k Matches u[w] except kth bit. (11*) Short link Level 1 Medium link Matches x[k]0* Matches 1* Wrong! Level 2 Parent link, to level k-1 Level 3 Structure: Links**Routing: Algorithm**LOOKUP(x, y): Initialization: set cur to x Proceed to root: while cur.level > 1: cur = cur.parent Greedy search: if cur.id ≤ y < SUCC(cur).id, return cur. Otherwise, choose m from links of cur that minimize d(m, y), move to m and repeat. Demo: http://www.cs.huji.ac.il/labs/danss/anatt/viceroy.html**y**0101 1001 1011 1101 1111 0001 0010 0011 0100 0110 1000 1110 x 0 1 Level 1 Level 2 Level 3 Routing: Example**0101**1001 1011 1101 1111 0001 0010 0011 0100 0110 1000 1110 0 1 Level 1 Level 2 Level 3 One Observation**y**0101 1001 1011 1101 1111 0001 0010 0011 0100 0110 1000 1110 x 0 1 Level 1 Level 2 Level 3 Routing: Analysis (1)**Routing: Analysis (2)**• Expected path length = O(log(n)) • log(n ) to `level-1’ node • log(n ) for traveling among clusters • log(n ) for final local search**Routing: Theorems**• Theorem 4.4 The path length from x to y is O(log(n)) w.h.p. • Proof is based on several lemmas • Lemma 4.1 For every node u with a level u.level < log(n)-log(log(n)), the number of nodes between u and u.Medium-left (Medium-right), if it exists, is at most 6log2(n) w.h.p.**Routing: Theorems (2)**• Lemma 4.2 In the greedy search phase of a lookup of value Y from node x, let the j’th greedy step vj, for 1 ≤ j ≤ m, be such that vj is more than O(log2(n)) nodes away from y. Then w.h.p. node vj is reached over a Medium or Long link, and hence satisfies vj.level = j and vj[j] = Y[j]. • m = log(n)-2loglog(n)-log(3+e) • W.h.p. within m steps, we are n/2m = 6log2(n) nodes away from the destination**Routing: Theorems (3)**• Lemma 4.3 Let v be a node that is O(log2(n)) nodes away from the target y. Then w.h.p., within O(log(n)) greedy steps that target y is reached from v. • Theorem 4.4 The total length of a route from x to y is O(log(n)) w.h.p. • Theorem 4.6 Expected load on every node is O(log(n)/n). The load on every node is log2(n)/n w.h.p. • Theorem 4.7 Every node u has in-degree O(log(n)) w.h.p.**Join: Algorithm**• Choose identifier: select a random 128 bits x1x2…x128 • Setup short links: invoke LOOKUP(x), let x’ be the result node. Insert x between x’ and x’.SUUCESSOR. • Choose level: let k be the maximal number of matching prefix bits between x and either SUCC(x) or PRED(x), choose level from 1…k. • Set parent link: If SUCC(x) has level x.level-1, set x.parent to it. Otherwise, move to SUCC(x) and repeat. • Set long link: p = x1…xk-1(1-xk)xk+1…xw Invoke LOOKUP(p), stop after a node at level x.level+1 and matches p is reached.**Join: Algorithm (cont.)**6. Set medium links: Denote p = x1x2…xx.level. If SUCC(x) has prefix p and level x.level+1, set x.Medium-right link to it. Otherwise, move the SUCC(x) and repeat. 7.Set inbound links: Denote p = x1x2…xx.level. Set inbound Medium links: Following SUCC links, so long as successor y has a prefix p and a level different from x.level, if y.level = x.level-1, set y.Medium-left to x. Set inbound long links: Following SUCC links, find y that has a prefix matches p and has level x.level. Take any inbound links that is closer to x than y. Set inbound parent links: Following PRED link, find y such that y.level = x.level+1. Repeat until meet a node in same level as x.**Set Medium link: O(lg2n) w.h.p**p = x1x2…xk (01) If y[k] != p: stop If y[k]=p and y.level=k+1: set Medium link Otherwise, move to succ(y) 0101 0111 1001 1011 1101 1111 0001 0010 0011 0100 0110 1000 1110 0 1 Lookup(x) Level 1 X Level 2 Set inbound long links: Following short links, find y such that y[k]=x[k] and y.level = x.level, check y’s inbound links. STOP Level 3 Set long link P = x1…xk-1(1-xk)…xw stop at level k+1? In this case, find 00* Join: Example Set Parent link: Following SUCC link, find a node has level k-1. 0111**Join: Analysis**• LOOKUP takes O(log(n)) messages w.h.p. • Travels on short links during link setting phase is O(lg2n) w.h.p. • A Medium link is within 6log2(n) nodes from x w.h.p. • Similar for others • Theorem 5.1: A JOIN operation by a new node x incurs expected O(log(n)) number of messages, and O(log2(n)) messages w.h.p. The expected number of nodes that change their state as a result of x’s join is constant, and w.h.p is O(log(n)). Because node x has O(log(n)) in-degrees w.h.p. Similar results holds for LEAVE.**Bounding In-degrees**• Theorem 4.7 Every node has expected constant in-degree, and has O(log(n)) in-degree w.h.p. • In-degree=# of servers affected by join/leave • How to guarantee constant in-degree? • Bucket solution • A background process to balance the assignment of levels**Level k-1**Level k Bucket Solution: Intuition ~log(n) • Node x has log(n) in-degree, assuming Medium Right x • Too many nodes at level k-1; Too few nodes at level k • Improve the level selection procedure**0101**1001 1011 0011 0110 1101 1110 1111 0001 0100 1000 0010 0 1 Bucket Solution • The name space is divided into non-overlapped buckets. • A bucket contains m nodes, where log(n) ≤m ≤ clog(n), for c>2. • In a buckets, levels are NOT assigned randomly • For each 1≤j≤log(n), there are 1…c nodes at level j in each bucket • In(x) < 7c (?? 2c)**Maintaining Bucket Size**• n can be accurately estimated • When bucket size exceeds clog(n), the bucket is split into two equal size buckets. • When bucket size drops below log(n), it is merged with a neighbor bucket. Further more, if the merged bucket is greater than log(n)x(2c+2)/3, the new bucket is split into two buckets. (c+1)/3 > 1 since c>2 • Buckets are organized into a ring, which can be merged or split with O(1) message.**Maintain Level Property**• Node join/leave without merging or splitting O(1) • Join: size < clog(n), choose a level that has less that c nodes • Leave: If it is the only node in its level, find another level that has two nodes, reassign level j to one of them. • Bucket merge or split may result in a reassignment of the levels to all nodes in the bucket(s) O(log(n)) • Merging/splitting are expensive, but they do not happen very often • After a merging or splitting of buckets, at least log(n) (c-2)/3 JOIN/LEAVE must happen in this bucket until another merging or splitting of this bucket is performed Amortized Overhead = c/((c-2)/3) = O(1) for c>2**Max bucket size**New bucket size clog(n) d1 d2 Log(n) min(c/2lgn, (c+1)/3lgn) max(c/2lgn, (2c+2)/3lgn) d1, d2 > (c-2)/3 Amortized analysis**Fault Tolerance**• Viceroy has no built in support for fault tolerance • Viceroy requires graceful leave • Leaves are NOT the same as failures • Performance is sensitive to failure • External techniques: • Thickening Edges • State Machine Replication**Old**New SMR SMR State Machine Replication Viceroy nodes Super node**Related Works**• De Bruijn Graph Based Network • Distance halving • D2B • Koorde • Others • Symphony (Small world model) • Ulysses (ButterFly, log(n), log(n)/loglogn)**Summary**• Constant out-degree • Expected constant in-degree • O(log(n )) w.h.p. • O(1) with bucket solution • O(log(n )) path length w.h.p • Expected log(n )/n load: • O(log2(n)/n) w.h.p. • Weakness/improvements: • Not Locality Aware • No Fault Tolerance Support • Due to the lack of flexibility of ButterFly network**Question**Photo by Peter J. Bryant