570 likes | 714 Vues
Tapestry Deployment and Fault-tolerant Routing. Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz Berkeley Research Retreat January 2003. Scaling Network Applications. Complexities of global deployment Network unreliability
E N D
Tapestry Deployment and Fault-tolerant Routing Ben Y. ZhaoL. Huang, S. Rhea, J. Stribling,A. D. Joseph, J. D. Kubiatowicz Berkeley Research RetreatJanuary 2003
Scaling Network Applications • Complexities of global deployment • Network unreliability • BGP slow convergence, redundancy unexploited • Lack of administrative control over components • Constrains protocol deployment: multicast, congestion ctrl. • Management of large scale resources / components • Locate, utilize resources despite failures UCB Winter Retreat ravenben@eecs.berkeley.edu
GUID1 GUID2 GUID1 Enabling Technology: DOLR(Decentralized Object Location and Routing) DOLR UCB Winter Retreat ravenben@eecs.berkeley.edu
What is Tapestry? • DOLR driving OceanStore global storage(Zhao, Kubiatowicz, Joseph et al. 2000) • Network structure • Nodes assigned bit sequence nodeIds fromnamespace: 0-2160, based on some radix (e.g. 16) • keys from same namespaceKeys dynamically map to 1 unique live node: root • Base API • Publish / Unpublish (Object ID) • RouteToNode (NodeId) • RouteToObject (Object ID) UCB Winter Retreat ravenben@eecs.berkeley.edu
3 4 2 NodeID 0xEF34 1 4 3 2 1 3 4 4 3 2 3 4 2 3 1 2 1 2 3 1 Tapestry Mesh NodeID 0xEF97 NodeID 0xEF32 NodeID 0xE399 NodeID 0xEF34 NodeID 0xEF37 NodeID 0xEF44 NodeID 0x099F NodeID 0xE530 NodeID 0xEF40 NodeID 0xEF31 NodeID 0xE555 NodeID 0xEFBA NodeID 0x0999 NodeID 0xE932 NodeID 0xFF37 NodeID 0x0921 NodeID 0xE324 UCB Winter Retreat ravenben@eecs.berkeley.edu
Object Location UCB Winter Retreat ravenben@eecs.berkeley.edu
Talk Outline • Introduction • Architecture • Node architecture • Node implementation • Deployment Evaluation • Fault-tolerant Routing UCB Winter Retreat ravenben@eecs.berkeley.edu
Single Node Architecture DecentralizedFile Systems Application-LevelMulticast ApproximateText Matching Application Interface / Upcall API Routing Table&Object Pointer DB Dynamic NodeManagement Router Network Link Management Transport Protocols UCB Winter Retreat ravenben@eecs.berkeley.edu
Single Node Implementation Enter/leaveTapestry Applications Application Programming Interface API callsUpcalls State Maint.Node Ins/del Routing LinkMaintenance Patchwork Dynamic Tapestry Core Router route tonode / obj fault detectheartbeat msgs Node Ins/delMessages Distance Map Network Stage UDP Pings SEDA Event-driven Framework Java Virtual Machine UCB Winter Retreat ravenben@eecs.berkeley.edu
Deployment Status • C simulator • Packet level simulation • Scales up to 10,000 nodes • Java implementation • 50000 semicolons of Java, 270 class files • Deployed on local area cluster (40 nodes) • Deployed on Planet Lab global network (~100 distributed nodes) UCB Winter Retreat ravenben@eecs.berkeley.edu
Talk Outline • Introduction • Architecture • Deployment Evaluation • Micro-benchmarks • Stable network performance • Single and parallel node insertion • Fault-tolerant Routing UCB Winter Retreat ravenben@eecs.berkeley.edu
Micro-benchmark Methodology SenderControl ReceiverControl LANLink Tapestry Tapestry • Experiment run in LAN, GBit Ethernet • Sender sends 60001 messages at full speed • Measure inter-arrival time for last 50000 msgs • 10000 msgs: remove cold-start effects • 50000 msgs: remove network jitter effects UCB Winter Retreat ravenben@eecs.berkeley.edu
Micro-benchmark Results • Constant processing overhead ~ 50s • Latency dominated by byte copying • For 5K messages, throughput = ~10,000 msgs/sec UCB Winter Retreat ravenben@eecs.berkeley.edu
Large Scale Methodology • PlanetLab global network • 101 machines at 42 institutions, in North America, Europe, Australia (~ 60 machines utilized) • 1.26Ghz PIII (1GB RAM), 1.8Ghz P4 (2GB RAM) • North American machines (2/3) on Internet2 • Tapestry Java deployment • 6-7 nodes on each physical machine • IBM Java JDK 1.30 • Node virtualization inside JVM and SEDA • Scheduling between virtual nodes increases latency UCB Winter Retreat ravenben@eecs.berkeley.edu
Node to Node Routing Median=31.5, 90th percentile=135 • Ratio of end-to-end routing latency to shortest ping distance between nodes • All node pairs measured, placed into buckets UCB Winter Retreat ravenben@eecs.berkeley.edu
Object Location 90th percentile=158 • Ratio of end-to-end latency for object location, to shortest ping distance between client and object location • Each node publishes 10,000 objects, lookup on all objects UCB Winter Retreat ravenben@eecs.berkeley.edu
Latency to Insert Node • Latency to dynamically insert a node into an existing Tapestry, as function of size of existing Tapestry • Humps due to expected filling of each routing level UCB Winter Retreat ravenben@eecs.berkeley.edu
Bandwidth to Insert Node • Cost in bandwidth of dynamically inserting a node into the Tapestry, amortized for each node in network • Per node bandwidth decreases with size of network UCB Winter Retreat ravenben@eecs.berkeley.edu
Parallel Insertion Latency 90th percentile=55042 • Latency to dynamically insert nodes in unison into an existing Tapestry of 200 • Shown as function of insertion group size / network size UCB Winter Retreat ravenben@eecs.berkeley.edu
Talk Outline • Introduction • Architecture • Deployment Evaluation • Fault-tolerant Routing • Tunneling through scalable overlays • Example using Tapestry UCB Winter Retreat ravenben@eecs.berkeley.edu
Adaptive and Resilient Routing • Goals • Reachability as a service • Agility / adaptability in routing • Scalable deployment • Useful for all client endpoints UCB Winter Retreat ravenben@eecs.berkeley.edu
Existing Redundancy in DOLR/DHTs • Fault-detection via soft-state beacons • Periodically sent to each node in routing table • Scales logarithmically with size of network • Worst case overhead: 240 nodes, 160b ID 20 hex1 beacon/sec, 100B each = 240 kbpscan minimize B/W w/ better techniques (Hakim, Shelley) • Precomputed backup routes • Intermediate hops in overlay path are flexible Keep list of backups for outgoing hops(e.g. 3 node pointers for each route entry in Tapestry) • Maintain backups using node membership algorithms(no additional overhead) UCB Winter Retreat ravenben@eecs.berkeley.edu
Bootstrapping Non-overlay Endpoints • Goal • Allow non-overlay nodes to benefit • Endpoints communicate via overlay proxies • Example: legacy nodes L1, L2 • Li registers w/ nearby overlay proxy Pi • Pi assigns Li a proxy name Dis.t. Di is the closest possible unique name to Pi(e.g. start w/ Pi, increment for each node) • Li and L2 exchange new proxy names • messages route to nodes using proxy names UCB Winter Retreat ravenben@eecs.berkeley.edu
Tunneling through an Overlay • L1 registers with P1 as document D1 • L2 registers with P2 as document D2 • Traffic tunnels through overlay via proxies D2 L2 P2 Overlay Network P1 L1 D1 UCB Winter Retreat ravenben@eecs.berkeley.edu
Failure Avoidance in Tapestry UCB Winter Retreat ravenben@eecs.berkeley.edu
Routing Convergence UCB Winter Retreat ravenben@eecs.berkeley.edu
Bandwidth Overhead for Misroute • Status: under deployment on PlanetLab UCB Winter Retreat ravenben@eecs.berkeley.edu
For more information … Tapestry and related projects (and these slides): http://www.cs.berkeley.edu/~ravenben/tapestry OceanStore: http://oceanstore.cs.berkeley.edu Related papers: http://oceanstore.cs.berkeley.edu/publications http://www.cs.berkeley.edu/~ravenben/publications ravenben@eecs.berkeley.edu UCB Winter Retreat ravenben@eecs.berkeley.edu
Backup Slides Follow… UCB Winter Retreat ravenben@eecs.berkeley.edu
The Naming Problem • Tracking modifiable objects • Example: email, Usenet articles, tagged audio • Goal: verifiable names, robust to small changes • Current approaches • Content-based hashed naming • Content-independent naming • ADOLR Project: (Feng Zhou, Li Zhuang) • Approximate names based on feature vectors • Leverage to match / search for similar content UCB Winter Retreat ravenben@eecs.berkeley.edu
Approximation Extension to DOLR/DHT • Publication using features • Objects are described using a set of features:AO ≡ Feature Vector (FV) = {f1, f2, f3, …, fn} • Locate AOs in DOLR ≡ find all AOs in the network with |FV* ∩ FV| ≥ Thres, 0 < Thres ≤ |FV| • Driving application: decentralized spam filter • Humans are the only fool-proof spam filter • Mark spam, publish spam by text feature vector • Incoming mail filtered by FV query on P2P overlay UCB Winter Retreat ravenben@eecs.berkeley.edu
Evaluation on Real Emails • Accuracy of feature vector matching on real emails • Spam (29631 Junk Emails from www.spamarchive.org) • 14925 (unique), 86% of spam ≤ 5K • Normal Emails • 9589 (total) = 50% newsgroup posts, 50% personal emails • Status • Prototype implemented as Outlook Plug-in • Interfaces w/ Tapestry overlay • http://www.cs.berkeley.edu/~zf/spamwatch “Similarity” Test3440 modified copies of 39 emails “False Positive” Test9589(normal)×14925(spam) UCB Winter Retreat ravenben@eecs.berkeley.edu
State of the Art Routing • High dimensionality and coordinate-based P2P routing • Tapestry, Pastry, Chord, CAN, etc… • Sub-linear storage and # of overlay hops per route • Properties dependent on random name distribution • Optimized for uniform mesh style networks UCB Winter Retreat ravenben@eecs.berkeley.edu
Reality • Transit-stub topology, disparate resources per node • Result: Inefficient inter-domain routing (b/w, latency) AS-3 AS-1 S R AS-2 P2P Overlay Network UCB Winter Retreat ravenben@eecs.berkeley.edu
Landmark Routing on P2P • Brocade • Exploit non-uniformity • Minimize wide-area routing hops / bandwidth • Secondary overlay on top of Tapestry • Select super-nodes by admin. domain • Divide network into cover sets • Super-nodes form secondary Tapestry • Advertise cover set as local objects • brocade routes directly into destination’s local network, then resumes p2p routing UCB Winter Retreat ravenben@eecs.berkeley.edu
Brocade Routing Brocade Layer Original Route Brocade Route AS-3 AS-1 S D AS-2 P2P Network UCB Winter Retreat ravenben@eecs.berkeley.edu
Overlay Routing Networks Fast Insertion / Deletion Constant-sized routing state Unconstrained # of hops Overlay distance not prop. to physical distance Simplicity in algorithms Fast fault-recovery Log2(N) hops and routing state Overlay distance not prop. to physical distance Fast fault-recovery Log(N) hops and routing state Data replication required for fault-tolerance • CAN:Ratnasamy et al., (ACIRI / UCB) • Uses d-dimensional coordinate space to implement distributed hash table • Route to neighbor closest to destination coordinate • Chord:Stoica, Morris, Karger, et al., (MIT / UCB) • Linear namespace modeled as circular address space • “Finger-table” point to logarithmic # of inc. remote hosts • Pastry:Rowstron and Druschel (Microsoft / Rice ) • Hypercube routing similar to PRR97 • Objects replicated to servers by name UCB Winter Retreat ravenben@eecs.berkeley.edu
0157 0154 0123 2175 0880 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 4 4 4 4 4 5 5 5 5 5 7 7 7 7 7 3 3 3 3 3 6 6 6 6 6 Routing in Detail Example: Octal digits, 212 namespace, 2175 0157 2175 0880 0123 0154 0157 UCB Winter Retreat ravenben@eecs.berkeley.edu
Publish / Lookup Details • Publish object with ObjectID: // route towards “virtual root,” ID=ObjectID For (i=0, i<Log2(N), i+=j) { //Define hierarchy • j is # of bits in digit size, (i.e. for hex digits, j = 4 ) • Insert entry into nearest node that matches onlast i bits • If no matches found, deterministically choose alternative • Found real root node, when no external routes left • Lookup object Traverse same path to root as publish, except search for entry at each node For (i=0, i<Log2(N), i+=j) { • Search for cached object location • Once found, route via IP or Tapestry to object UCB Winter Retreat ravenben@eecs.berkeley.edu
Dynamic Insertion • Build up new node’s routing map • Send messages to each hop along path from gateway to current node N’ that best approximates N • The ith hop along the path sends its ith level route table to N • N optimizes those tables where necessary • Notify via acked multicast nodes with null entries for N’s ID • Notified node issues republish message for relevant objects • Notify local neighbors UCB Winter Retreat ravenben@eecs.berkeley.edu
3 2 1 4 3 2 1 3 4 4 3 2 3 4 2 3 1 2 1 3 1 Dynamic Insertion Example 4 NodeID 0x779FE NodeID 0xA23FE NodeID 0x6993E NodeID 0x243FE NodeID 0x243FE NodeID 0x973FE NodeID 0x244FE NodeID 0x4F990 NodeID 0xC035E NodeID 0x704FE NodeID 0x913FE NodeID 0xB555E NodeID 0x0ABFE NodeID 0x09990 NodeID 0x5239E NodeID 0x71290 Gateway 0xD73FF NEW 0x143FE UCB Winter Retreat ravenben@eecs.berkeley.edu
Dynamic Root Mapping • Problem: choosing a root node for every object • Deterministic over network changes • Globally consistent • Assumptions • All nodes with same matching suffix contains same null/non-null pattern in next level of routing map • Requires: consistent knowledge of nodes across network UCB Winter Retreat ravenben@eecs.berkeley.edu
PRR Solution • Given desired ID N, • Find set S of nodes in existing network nodes n matching most # of suffix digits with N • Choose Si = node in S with highest valued ID • Issues: • Mapping must be generated statically using global knowledge • Must be kept as hard state in order to operate in changing environment • Mapping is not well distributed, many nodes in n get no mappings UCB Winter Retreat ravenben@eecs.berkeley.edu
Tapestry Solution • Globally consistent distributed algorithm: • Attempt to route to desired ID Ni • Whenever null entry encountered, choose next “higher” non-null pointer entry • If current node S is only non-null pointer in rest of route map, terminate route, f (N) = S • Assumes: • Routing maps across network are up to date • Null/non-null properties identical at all nodes sharing same suffix UCB Winter Retreat ravenben@eecs.berkeley.edu
Analysis Globally consistent deterministic mapping • Null entry no node in network with suffix • consistent map identical null entries across same route maps of nodes w/ same suffix Additional hops compared to PRR solution: • Reduce to coupon collector problemAssuming random distribution • With n ln(n) + cn entries, P(all coupons) = 1-e-c • For n=b, c=b-ln(b), P(b2nodes left) = 1-b/eb = 1.8 10-6 • # of additional hops Logb(b2) = 2 Distributed algorithm with minimal additional hops UCB Winter Retreat ravenben@eecs.berkeley.edu
Dynamic Mapping Border Cases • Node vanishes undetected • Routing proceeds on invalid link, fails • No backup router, so proceed to surrogate routing • Node enters network undetected; messages going to surrogate node instead • New node checks with surrogate after all such nodes have been notified • Route info at surrogate is moved to new node UCB Winter Retreat ravenben@eecs.berkeley.edu
SPAA slides follow UCB Winter Retreat ravenben@eecs.berkeley.edu
Network Assumption • Nearest neighbor is hard in general metric • Assume the following: • Ball of radius 2r contains only a factor of c more nodes than ball of radius r. • Also, b > c2 • [Both assumed by PRR] • Start knowing one node; allow distance queries UCB Winter Retreat ravenben@eecs.berkeley.edu
Algorithm Idea • Call a node a level i node if it matches the new node in i digits. • The whole network is contained in forest of trees rooted at highest possible imax. • Let list[imax] contain the root of all trees. Then, starting at imax, while i > 1 • list[i-1] = getChildren(list[i]) • Certainly, list[i] contains level i neighbors. UCB Winter Retreat ravenben@eecs.berkeley.edu
3 4 2 NodeID 0xEF97 NodeID 0xEF32 NodeID 0xE399 NodeID 0xEF34 1 4 NodeID 0xEF37 NodeID 0xEF44 3 2 1 NodeID 0x099F 4 3 NodeID 0xE530 NodeID 0xEF40 3 NodeID 0xEF31 NodeID 0xE555 NodeID 0xEFBA NodeID 0x0999 2 2 1 2 NodeID 0xE932 NodeID 0xFF37 NodeID 0x0921 NodeID 0xE324 1 We Reach The Whole Network NodeID 0xEF34 UCB Winter Retreat ravenben@eecs.berkeley.edu