1 / 49

MOAT: A Multi-Object Assignment Toolkit

MOAT: A Multi-Object Assignment Toolkit. Haifeng Yu Intel Research Pittsburgh / CMU Joint work with: Phillip B. Gibbons Intel Research Pittsburgh. Background. Availability has become principle design goal : 0.1% improvement  $2M / year for Amazon and Ebay [internetweek.com]

eldon
Télécharger la présentation

MOAT: A Multi-Object Assignment Toolkit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MOAT: A Multi-Object Assignment Toolkit Haifeng Yu Intel Research Pittsburgh / CMU Joint work with: Phillip B. Gibbons Intel Research Pittsburgh

  2. Background • Availability has become principle design goal: • 0.1% improvement  $2M / year for Amazon and Ebay [internetweek.com] • One major focus of 8 OSDI’04 papers (out of 27) • Two orthogonal efforts: • Lower-level system components robustness • Example: disk, individual machine, Internet routing • Higher-level redundancy • Example: data replication • This talk focuses on higher-level redundancy Haifeng Yu, Intel Research Pittsburgh / CMU

  3. High Availability via Replication • Large amount of data accessed by many users: • Distributed file systems • Network monitoring (PIER, SDIMS, IRISLOG) • Index databases for search engine (Google, p2p) • Scientific / medical databases • Data replicated across multiple machines • Object: The unit for replication • File, file block, database table, database tuple, inverted index for a certain keyword Haifeng Yu, Intel Research Pittsburgh / CMU

  4. Multi-object Accesses • Many accesses request multiple objects • Compile a project • Writing a paper under Latex • Asking for aggregates of network conditions • Search for web pages containing multiple keywords • Availability of single object can be misleading: • An access requesting 1,000 objects can observe up to 1,000 times higher unavailability • There’s more subtlety..... Haifeng Yu, Intel Research Pittsburgh / CMU

  5. A B A B C D C D A C A B C D B D A Simple Example • Compile a small project with four files, each file has two replicas: A, A, B, B, C, C, D, D • Four machines fail independently with same prob, each holds two file • Which assignment gives better avail: or Better Assignment matters because objects are now correlated Haifeng Yu, Intel Research Pittsburgh / CMU

  6. A B A B C D C D A C A B C D B D A Simple Example - Continued • Suppose user is happy even if only three objects are available (e.g., when computing average) or Better • Assignment makes a difference • Even if we are using the same machines (same amount of redundancy/resource) • Easily have multiple-nine difference Haifeng Yu, Intel Research Pittsburgh / CMU

  7. Goal and Contributions • MOAT (Multi-Object Assignment Toolkit): • Goal: High availability for multi-object accesses • Key issue: Replica assignment • Contributions: • First to observe the importance of replica assignment • Strong theoretical results regarding best and worst assignments • Practical designs to approximate optimal assignments • MOAT toolkit implementation for replica assignments Haifeng Yu, Intel Research Pittsburgh / CMU

  8. Outline • Motivation and MOAT contributions  • System model and case studies of existing systems • Theoretical results • Designs for approximating optimal assignments • Designs for mixed accesses • Conclusions Haifeng Yu, Intel Research Pittsburgh / CMU

  9. Assumptions for This Talk • Assume: • Replication (no erasure coding) • Crash failures (no Byzantine failures) • Eventual consistency (no quorum or voting) • Most of our results hold without these assumptions • Assume same replication degree for all objects • We have results for different replication degrees as well • Talk to me if interested in the more complete story... Haifeng Yu, Intel Research Pittsburgh / CMU

  10. file system p2p DB search engine network monitoring Data API obj create / delete / read / write Control API assignment policy MOAT raw data on distributed machines or disks MOAT Architecture Overview Storage System App replication / repair / load balancing / naming / assignment Haifeng Yu, Intel Research Pittsburgh / CMU

  11. A B C D A B C D System Model • Basic system model: • N objects, each with k replicas • Load balancing among all machines • Machines fail independently with same prob • An assignment is a mapping: replica  machine, for all Nk replicas Haifeng Yu, Intel Research Pittsburgh / CMU

  12. Some Simple Assignments • PTN: partition assignment • Used in most practice of Coda [Satyanarayanan et al.’90] ........... A B C D E F ........... A B C D E F for k = 2 • RAND: pick a random replica each time • Similar as in Google File System [Ghemawat et al.’03] Haifeng Yu, Intel Research Pittsburgh / CMU

  13. C C hash(A) = 95 B B C A A B Assignment in Chord [Stoica et al.’01] • DHTs: • Hash machine IP to get machine id • Assignment in Chord: • Sliding window • Neither PTN nor RAND 120 080 104 090 101 098 Haifeng Yu, Intel Research Pittsburgh / CMU

  14. Assignment in CAN [Ratnasamy et al.’01] • Hash object k times • CAN uses a similar approach • Similar as RAND • But machines may have slightly different number of objects 120 080 hash1(A) = 95 104 090 101 098 A Haifeng Yu, Intel Research Pittsburgh / CMU

  15. Assignment in CAN [Ratnasamy et al.’01] • Hash object k times • CAN uses a similar approach • Similar as RAND • But machines may have slightly different number of objects 120 080 A hash2(A) = 119 104 090 101 098 A Haifeng Yu, Intel Research Pittsburgh / CMU

  16. Assignment in CAN [Ratnasamy et al.’01] • Hash object k times • CAN uses a similar approach • Similar as RAND • But machines may have slightly different number of objects 120 080 A hash1(B) = 84 hash2(B) = 100 104 090 B 101 098 A B Haifeng Yu, Intel Research Pittsburgh / CMU

  17. Which assignment should we use? • MOAT Goal: Improve avail of multi-object accesses • If an access requests n (n  N) objects, what if only x are available? • Threshold-based success definition: • If x≥t, user happy  Available • If x < t, too low confidence  Unavailable • Availability for an access defined as: • Prob[  t objects available out of n requested objects] Haifeng Yu, Intel Research Pittsburgh / CMU

  18. Examples of t • t = n • File systems • Search for terrorist images in image database • t close n • Query for top-10 most-loaded machines on PlanetLab • t not close n • Sample with confidence Haifeng Yu, Intel Research Pittsburgh / CMU

  19. Outline • Motivation and MOAT contributions  • System model and case studies of existing systems  • Theoretical results • Designs for approximating optimal assignments • Designs for mixed accesses • Conclusions Haifeng Yu, Intel Research Pittsburgh / CMU

  20. Formal Results • For access requesting N objects • Theorem: Among all assignments, when t = N: • PTN is best (within constant) • RAND is worst (within constant) • Difference is about c folds (c is #obj / machine) • Theorem: Among all assignments, when t = c+1 < N: • PTN is worst • RAND is best (within constant) • Difference is even larger Haifeng Yu, Intel Research Pittsburgh / CMU

  21. c times difference if p is small, where c is # obj/machine Numerical Examples (from Simulation) 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 unavailability PTN Chord RAND (CAN) unavail of single obj threshold Haifeng Yu, Intel Research Pittsburgh / CMU

  22. A Spectrum of Assignments 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 unavailability PTN RAND (CAN) threshold Haifeng Yu, Intel Research Pittsburgh / CMU

  23. More Formal Arguments • Tradeoff is fundamental: • Impossible to achieve the best of RAND and PTN • Previous results only for access requesting N objects • Similar results hold for accesses requesting n (n  N) objects • But each machine may not be filled to capacity: • For PTN, use as few machines as possible • For RAND, use as many machines as possible • I have more....talk to me if you are interested Haifeng Yu, Intel Research Pittsburgh / CMU

  24. Access Requesting 500 Objects 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 RAND (CAN) unavailability Chord PTN threshold Haifeng Yu, Intel Research Pittsburgh / CMU

  25. Outline • Motivation and MOAT contributions  • System model and case studies of existing systems  • Theoretical results  • Designs for approximating optimal assignments • Designs for mixed accesses • Conclusions Haifeng Yu, Intel Research Pittsburgh / CMU

  26. Design of Replica Assignment • Trivial in a static / centralized environment • Challenging in dynamic environment: • We may not have global knowledge with many objects and many machines • Basic solution: Consistent hashing • But some re-design is necessary Haifeng Yu, Intel Research Pittsburgh / CMU

  27. Approximating RAND • Multi-hash DHT: • Hash the object k times • As in CAN 120 080 A hash1(B) = 84 hash2(B) = 100 104 090 B 101 098 A B Haifeng Yu, Intel Research Pittsburgh / CMU

  28. Approximating PTN • Chord does not achieve PTN C 120 080 C hash(A) = 95 104 090 B B C 101 098 A A B Haifeng Yu, Intel Research Pittsburgh / CMU

  29. 120 101 090 120 101 090 Approximating PTN • Chord does not achieve PTN • Group DHT: • (Arbitrarily) group machine into groups of k size C C C hash(A) = 95 B A B A B Haifeng Yu, Intel Research Pittsburgh / CMU

  30. Node Join and Leave in Group DHT • Maintain r rondevour points in DHT • Diminishing Chord [Karger et al.’04] / ReDir [Karp et al.’04] • New node reports to a random rondevour point • If group can be formed, join DHT • Two options upon node leave: • Dismiss group and delete the group from DHT • The group wait to recruit a new node • Groups use rondevour point to decide Haifeng Yu, Intel Research Pittsburgh / CMU

  31. Complexity Analysis Haifeng Yu, Intel Research Pittsburgh / CMU

  32. Outline • Motivation and MOAT contributions  • System model and case studies of existing systems  • Theoretical results  • Designs for approximating optimal assignments  • Designs for mixed accesses • Conclusions Haifeng Yu, Intel Research Pittsburgh / CMU

  33. Mixture of Queries • Previous design only for single access requesting all N objects • PTN if t close to N • RAND if t far from N • But there are other accesses • Requests n (n < N) objects with threshold t • How does t change with n ? • Infinite possibilities • We focus on 4 large categories Haifeng Yu, Intel Research Pittsburgh / CMU

  34. Four Application Scenarios Strict accesses: t n Loose accesses: t< n Haifeng Yu, Intel Research Pittsburgh / CMU

  35. Loosefor both small and large n • Goal: • Approach RAND for both small and large n • Design: • Multi-hash DHT 120 080 A hash1(B) = 84 hash2(B) = 100 104 090 B 101 098 A B Haifeng Yu, Intel Research Pittsburgh / CMU

  36. 120 101 090 120 101 090 Loosefor small n; Strict for large n • Goal: • Approach RAND for small n • Approach PTN for large n • Design: • Group DHT C C C A A B A B Haifeng Yu, Intel Research Pittsburgh / CMU

  37. 120 101 090 120 101 090 Strictfor both small andlarge n • Goal: • Approach PTN for both small and large n • Assume accesses are tree accesses • Design: • Group DHT with item-balancing [Karger et al.’04] C C A = 95 B A B A B Haifeng Yu, Intel Research Pittsburgh / CMU

  38. Strictfor small n; Loose for large n • Goal: • Approaches PTN for n < R • Approaches RAND for n >> R • Design: • Multi-hash DHT • But cluster objects into clusters of constant size R 120 080 hash1(AB) = 84 hash2(AB) = 100 104 090 A B 101 098 A B Haifeng Yu, Intel Research Pittsburgh / CMU

  39. Simulation Results for Strict Accesses Here an access needs all n objects to be successful 400 machines fail prob = 0.2 40,000 obj 4 replica / obj unavailability number (n) of objects requested by an access Haifeng Yu, Intel Research Pittsburgh / CMU

  40. Simulation Results for Loose Accesses Here an access needs only t = n - 150 objects to be successful 400 machines fail prob = 0.2 40,000 obj 4 replica / obj unavailability number (n) of objects requested by an access Haifeng Yu, Intel Research Pittsburgh / CMU

  41. Current Status • Waiting for paper deadlines • Finishing implementing MOAT • Evaluation on IrisLog trace and file system traces Haifeng Yu, Intel Research Pittsburgh / CMU

  42. Related Work • Multi-object accesses rarely addressed • CFS [Dabek et al.’01] focuses on individual file blocks • Chain replication [Renesse et al.’04] considers single data object • A long list ..... • Replica assignment largely ignored • Different DHTs (e.g., Chord, Pastry, CAN) use dramatically different replica assignment: Effects not understood / studied • Replica placement[Douceur et al.’01, Li et al.’99, Qiu et al.’01, Venkataramani et al.’01, Yu et al.’04] well studied: • Typically for machines in different locations in the network • Machines are heterogeneous • Approaches does not apply to replica assignment Haifeng Yu, Intel Research Pittsburgh / CMU

  43. Conclusions • Availability becoming key design goal • Multi-object access availability dramatically different from single-object availability • MOAT Contributions: • First to observe the importance of replica assignment • Strong theoretical results regarding the best and worst assignments • Practical designs to approximate optimal assignments • MOAT toolkit implementation Haifeng Yu, Intel Research Pittsburgh / CMU

  44. My Other Recent Work • Om [NSDI’04]: • Consistent and automatic replica regeneration • Regenerate from any single replica rather than a majority • Signed quorum systems [PODC’04]: • Constant quorum size at the cost of small prob of inconsistency • Node failure characteristics in WAN [WORLDS’04]: • Answer subtle questions regarding real-world failure properties Haifeng Yu, Intel Research Pittsburgh / CMU

  45. Haifeng Yu, Intel Research Pittsburgh / CMU

  46. Erasure Coding • Encode the object into k fragments and any m (m < k) out of k fragments can reconstruct the object • RAID techniques are special cases • Replication is a special case where m = 1 Haifeng Yu, Intel Research Pittsburgh / CMU

  47. A B A B C D C D A C A B C D B D Example Revisited • Need four files to compile: or Better Can we treat A, B, C, D as a single obj and use erasure coding? So that all files can be reconstructed from any 4 out of 8 fragments • Erasure coding is hard to be applied across large amount of data • Updating any portion of data needs to update k - m + 1 fragments  the size of original data • We cannot use erasure coding across 1,000 files Haifeng Yu, Intel Research Pittsburgh / CMU

  48. Threshold Semantics and Erasure Coding In short, they are different, orthogonal concepts Haifeng Yu, Intel Research Pittsburgh / CMU

  49. c times difference if p is small, where c is # obj/machine Numerical Examples (from Simulation) 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 Chord unavailability PTN CRAND (100) CRAND (10) RAND (CAN) threshold Haifeng Yu, Intel Research Pittsburgh / CMU

More Related