1 / 41

P2P Databases

P2P Databases. P2P Today. edonkey. bittorrent. pastry. jxta. can. fiorana. napster. freenet. united devices. open cola. ?. aim. ocean store. netmeeting. farsite. gnutella. icq. ebay. morpheus. limewire. seti@home. bearshare. uddi. grove. jabber. popular power. kazaa.

miach
Télécharger la présentation

P2P Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. P2P Databases

  2. P2P Today edonkey bittorrent pastry jxta can fiorana napster freenet united devices open cola ? aim ocean store netmeeting farsite gnutella icq ebay morpheus limewire seti@home bearshare uddi grove jabber popular power kazaa folding@home tapestry mojo nation process tree chord

  3. Object representation and storage Objects Attributes : Name , Artist, Album , Genre Pointer to object

  4. P2P vs. Distributed DBMS • Transactions • Distributed Query Optimization • Interoperation of heterogeneous data sources • Reliability/failure of nodes Traditional DDBMS Issues: Complex features do not scale

  5. P2P vs. Distributed DBMS Example application: file-sharing • Simple data model and query language • No complex query optimization • Easy interoperation • No guarantee on quality of results • Individual site availability unimportant • Local updates • No transactions • Network partitions OK Simple Amenable to large-scale network of PCs

  6. Example: file sharing  ? ? ? ? • Challenge #1: Performance • Asking everyone is expensive! • If I am smart, I only need to ask one peer • How can I be smart? File X?

  7. Search in P2P Both are useful to study • System can control: • Connections made by users/topology • Data placement • Query type • Tight control: “Structured” • Efficient, comprehensive • Loose control: “Unstructured” • Inefficient, not comprehensive, simple, expressive • Used in real life

  8. Centralized • Napster model • Benefits: • Efficient search • Limited bandwidth usage • No per-node state • Drawbacks: • Central point of failure • Limited scale Bob Alice Judy Jane

  9. http://www.snocap.com/

  10. Unstructured – Query Flooding = query source = forward query = processed query = found result = forward response

  11. Problems with unstructured Structured systems address these problems • Inefficient • Query messages are flooded • Even if routing is intelligent, worst case load is still O(n), where n is # nodes in system • Not comprehensive • If I do not get a result for my query, is it because none exists? • (Of course, many optimizations are possible…)

  12. Distributed Hash Table (DHTs) • Model: • Key/Object pair, the key is hashed to get an ID • Example: • Objects are files • The key is the content of the file • The ID is the hash of the file contents • Single operation: Lookup(ID) • Input: integer ID • Output: the object with the corresponding ID

  13. Identifiers • IDs are m-bit integers • Nodes are also assigned IDs • Commonly assigned by hashing a node’s IP address, although many problems with this • An object is stored on the node with the smallest ID greater than the object’s ID • This node is called the successor of the object’s ID • IDs are arranged on a circle, so 0 > 2m-1

  14. Data Placement 6 2 0 • Nodes: • 0 • 1 • 3 m = 3 7 1 1 6 6 • Data: • 1 • 2 • 6 2 2 3 5 4

  15. Connections • Distance • 20 • 21 • …. • 2m-1 0 7 1 “Finger pointers” 6 2 3 5 4

  16. Query • Lookup(objectID) • objectID is typically the ID of the object you are looking for, but not necessarily • Approach: • Find the predecessor of the object • I.e. the node with the largest ID that is smaller than the object ID • Return the successor of the predecessor

  17. Query Example • Say node 0 wants to find the object with ID = 7 • For simplicity, we will assume a node exists at every ID in the space

  18. Query Example 0 Node 0: Lookup(7) 7 1 Node 0: FindPred (7) 6 2 3 5 4

  19. Query Example 0 Node 4: FindPred(7) 7 1 6 2 3 5 4

  20. Query Example 0 Node 6: FindPred(7) 7 1 Node 6 is predecessor Return successor node 7 6 2 3 5 4

  21. Query characteristics • With high probability, a query can be answered by contacting O(log N) nodes • N total nodes in the network • Efficient! • Also notice: if an object with the ID exists in the network, it will be found • Comprehensive! • State is also O(log N) in size

  22. Query characteristics • Note that finger pointers are not required for correct operation • Only successor pointers are needed • But then cost of query increases • O(N) in worst case

  23. Advantages of Structured? • Scalability/Efficiency • load grows with O(log N) • Comprehensiveness

  24. Disadvantages? (cont) • Availability of Data • If a node dies suddenly, what happens to the data it was storing? • MUST replicate data across multiple nodes • Query Language • How can we express keyword queries efficiently? • Many useful applications require different languages

  25. Magnolia Current approach: Hash each keyword separately and store pointers at h(keyword) Seven h(some) 1100100101 Innovation h(innovation) Myths “Seven Innovation Myths” 1100100101 “Innovation” h(myths) h(title)

  26. Resulting Distribution

  27. Prefix hashing Prefix Hashing Innovation hP(innovation) hP = m’ bit hash function 100 …………. m’ m bits Partitions network into ~ n/2m’ separatesibling groups n = nodes, m’ partitioning factorFor m’=12, n= 1 million, ~ 256 nodes will share same prefix Assumption: h is uniformly distributed

  28. Balancing Innovation Balanced over the sibling group 100 Sibling group ID=100 All siblings in a group share the same prefix

  29. Lookup Group Broadcast or Multicast Replies O(1) Keyword Insert Keyword hP SiblingGroup ID Random Sibling Locate a sibling node via SIFT

  30. Advantages • Good Balancing Properties

  31. Advantages • Low Traffic Load on nodes for popular queries • Quick Lookup • Popularity Ranking of Objects • Distributed Replication for resilience

  32. Implementing Magnolia • Developed on top of a chord clone written in Python • If you’re going to write a peer-to-peer app, why not leverage existing modules and libraries? • Challenge: How do we implement group-based stores and queries without requiring additional network maintenance?

  33. Chord’s Finger Table • A chord node maintains a finger table of M IP’s pointing to nodes ahead of it in the ring. • A pointer at index i is the successor of node id + (2^i-1). This lets us reach any node in the network in O(log M) hops • We use the M’ most significant bits in a node’s id to indicate it’s group. We want to reach any group in O(log M’) hops. • Do we need another table? • Nope. The last M’ entries in our finger table provide this.

  34. Talking to Siblings • How do we propagate queries through the group? • Naïve solution: send to our predecessor and successor. • A better solution: We can send a query throughout the group by treating the sibling group as a tree.

  35. Sibling Tree N/N’ = 16; M/M’ = 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0+2^3 0+1 8 1 1+2^2 8+1 8+2^2 1+1 2 12 9 5 2+2^1 2+1 5+1 5+2^1 9+1 9+2^1 12+1 12+2^1 10 11 7 3 6 4 13 14 Every edge can be found in the finger table! 14+2^0 15

  36. Sibling Tree Problems • Problems: • Not every possible node will exist • Not every node will have results to report • The query maker needs to know when the search is done • But we’re okay! • Nodes can determine if a child sub-tree is dead • Even if a child node in our sibling table is of a higher ID than expected • its sub-tree contains all existing descendents of the expected id • we can predict when a child is in a sibling our ancestor’s tree

  37. Bigger Problems • What if a pointer in our finger table fails? • We either have to find the successor to it’s id or fail to query the sub-tree • What if the lowest ID node isn’t the root of our tree? • Some of our edges won’t be in our finger table

  38. Popularity queries

  39. Yulania , Demo

  40. BitTorrent

  41. SplitStream

More Related