440 likes | 547 Vues
This document explores the intersection of Information Retrieval (IR) and Peer-to-Peer (P2P) networking, detailing fundamental concepts, models, and research issues. It defines Information Retrieval, discusses the structure and characteristics of P2P networks, and investigates search strategies such as broadcasting in unstructured P2P systems and consistent hashing in structured ones. The paper also highlights challenges like load balancing, redundancy, and ranking results. It aims to provide insights into effective resource sharing and querying methods within decentralized environments.
E N D
Information Retrieval on P2P Networking Willie Yang November 2004
What is Information Retrieval ? • Select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user【Salton, 1989】 • Basic model【Belkin, 1992】
1. Format, Source, Type 3. Query Expression 2. Indexing 4. Query Model 5. Ranking 6. Feedback Research Issues
More about Information Retrieval Concepts related to searching - Browsing - Filtering Technologies related to IR - Information extraction - Question answering - Classification
What is Peer-to-peer Networking ? • Peer-to-peer is a way of structuring distributed applications such that the individual nodes have symmetric roles. Rather than being divided into clients and servers each with quite distinct roles, in P2P applications a node may act as both a client and a server.【IETF/IRTF 2004】
Characteristics of P2P (1) • Multiple peers participating in the network • The number of roles is small. • The number of peers is typically large. • Every peer owns some resources and pays its participation by providing access to its resources. • Distributed, decentralized, no distinguished roles • Autonomous, self-control, ad hoc participation. • Dynamic (e.g. come and go freely) • Rely very little on the underlay infrastructure. →do most things on their own.
Characteristics of P2P (2) • Difference from distributed computing • More dynamic (fail or not v.s. join or leave) • Larger number • Difference from distributed database, or grid computing. • No centralized mechanism (i.e.integrator or dispatcher, etc.) • Research highlights • Resource sharing • Autonomous • Load balancing
X Where is X? Search on unstructured P2P • Example: Gnutella • Solution: Broadcasting + TTL • Constraints: non-guarantee search • Research topics • - Exploring strategies • - Linking strategies • - Routing strategies
8 1 7 Node joining : assign node id 2 6 3 5 4 Search on Structured P2P Where is X? • Example: Chord, a kind of DHT P2P • Solution: Consistent Hashing + Routing • Constraints: only support Key-value pair lookup Object publishing : hash(X) = 3 X Object look up : the same • Research topics • - Topology and Routing • - Efficiency X
DOC 1 DOC 1 網路 DOC 3 DOC 2 DOC 6 8 DOC 4 DOC 8 1 資管 7 台灣 2 6 DOC 3 DOC 5 3 DOC 7 5 4 Keyword Search on Structured P2P • Example: Chord + Inverted List • Solution: Routing + Merge Sort • Constraints: (1) storage redundancy (2) unbalanced load → Zipf’s law (3) single point failure (4) huge traffic (5) hard to rank the results Where is 台灣 & 資管? DOC 1
Keyword Search in DHT-Based Peer-to-Peer Networks Yuh-Jzer Joung, Chien-Tse Fang, and Li-Wei Yang
Outline • Background • Some Preliminaries • The Hypercube Index Scheme • Simulation • Conclusions and Related Work
0010000 1010000 DOC 1 1011000 DOC 2 0 0 1 0 0 0 0 Doc1 (keyword 台灣) Doc2 (keyword 台灣, 網路) Doc3 (keyword 台灣, 網路, 資管) 1 0 1 0 0 0 0 1 0 1 1 0 0 0 DOC 3 Our Hypercube Indexing Scheme • Assign node id : a r-bit string • Hash each keyword into range [0,r] to construct a doc vector • Publish doc to the node where doc vector = node id Hash(台灣) = 2 Hash(網路) = 0 Hash(資管) = 3
0100 1100 0101 1101 0000 1000 0001 1001 0110 1110 0111 1111 1010 0010 0011 1011 Hypercube • An r-dimensional hypercube Hr(Vr, Er) has 2r nodes. Each node u in Vris represented by a unique r-bit binary string. • Two nodes u, v in Vr has an edge iff differ at exactly one bit. • An r-D hypercube can be constructed by 2 (r1)-D hypercubes
Spanning Binomial Tree Search and broadcast in hypercube can be done via traversing the spanning binomial tree.
Subhypercube • A subhypercube of Hr(Vr, Er) induced by u, denoted by Hr(u), is a subgraph G=(U, F) of Hr such that every node wVr is in U if and only if w contains u, and every edge eEr is in F if and only if its two end points are in U. H3 H4(0100)
Outline • Background • Some Preliminaries • The Hypercube Index Scheme • Simulation • Conclusions
0 0 1 0 0 1 0 0 0 0 1 0 … 1 0 0 0 Our Index Scheme • A conceptual r-D hypercube is built over the DHT to index objects. • Each object o with keyword set Ko is mapped to a unique r-bit vector by a hash h as follows: Object o Ko={w1, w2, …, wk} h: W {0, 1, …, r-1} h(w2)=6 h(w1)=1 0 r-1 Fh(Ko) The node Fh(Ko) in the hypercube is responsible for indexing o.
0100 1100 1101 0101 1000 0000 1001 0001 1110 0110 1111 0111 0010 1010 0011 1011 Index Table(0101) {w1, w2} {(A, u), …} {w1, w7} … … … Object Insert/Delete/Pin Search • To insert/delete an object o with keyword set Ko into the system • Find node Fh(Ko) that is responsible for o • Insert/delete index information of o at the node. Object A KA={w1, w2} u publishes A u Fh(KA)=0101 Fh(KA)=0101 x Any object of {w1, w2}
Superset Search • To search objects that can be described by a keyword set K (object o with Ko K) we need just to search the subhypercube induced by the node Fh(K). • E.g.,to search objects that can be described by KA={w1, w2}, we need to search all nodes with x1x1 (since Fh(KA)=0101).
Flexible Superset Search • The spanning binomial tree of the subhypercube can be visited in various ways: • Top-down • General objects first • Bottom-up • Specific objects first • Priority can also be distinguished by nodes at the same depth • Note that the hypercube is purely conceptual; each logical node corresponds directly to a physical node in the DHT. So tree traverse can be flexible as the underlying DHT provides the basic communication.
Simulation • Data set • 131,180 web site records from PCHome (http://www.pchome.com.tw) • Each Web site is maintained manually by experienced editors containing the following fields: • ID, Title, URL, Category, Description, Keyword
Keyword Frequency Logarithm in base e
Object vs. node Distribution X-axis: dimensionality r of hypercube
Query Performance---cacheless m: keyword set size
Conclusions • Our hypercube index scheme has the following characteristics: • Load balancing • Fault tolerant • Facilitate efficient object insert/delete • Direct pin search • A variety of ways for superset search • Ranking can be based on this diversity • Personalization services can also be built • The hypercube index scheme is decomposable • Multiple hypercubes can be built for multi-attribute search
Future Challenges • Flexible Keyword Search • Boolean • Prefix / Range Query • Wildcard / Fuzzy Query • Semantic Query • Semantic Routing
Two Types of Services • White page service • search by names • “Lord of the rings.mpg” • Yellow page service • search by attributes • “rings”, “lord”, “mpg” • Keyword search is the basis for yellow page services • Both services can be easily supported in unstructured P2Ps or P2Ps with a centralized server. Yellow page service, however, is not easy in DHTs.
w2 w5 w3 w1 w4 {A, C, E} {B, D} {B, E} {A, B, D} {C, E} Keywords={W1, W5} Distributed Inverted Indexing {A, B, D}∧{B, D}
Zipf 's law • In a real world corpus, keyword frequency---the count of a keyword's occurrence in objects---varies enormously. A few keywords occur very often while many others occur rarely (in power-law relationship). • e.g., mp3, ring, lord • Zipf’s law implies that a straightforward distributed implementation of inverted index results in an extremely imbalanced load.
Other Problems • Storage redundancy • an object o contains keywords {w1, w2, …, wk} is repeatedly stored at k different sites. • Increase insert/delete complexity • Decrease consistency • Fault tolerance • A failure to a site would block all queries containing a keyword handled by the site. • Nodes handling hot keywords may be swamped. • Object ranking is difficult • Ranking in general requires global knowledge • inverse document frequency (IDF)
Our Keyword Indexing Scheme • The index entries of a single keyword are deterministically handled by a set of nodes. • Fault tolerance • The population of this set depends on the popularity of the keyword • Load Balancing • An object o with a keyword set K is indexed at exactly one node, and the node is determined uniquely by K • No storage redundancy • Insert/delete is efficient
Ranking • Given a keyword set K, the set SK of nodes that may be responsible for a superset of K is fixed. The larger the size of K, the smaller the size of SK. • Within SK, the nodes are distinguished according to their responsible keyword sets as follows: • K+{w1}, K+{w2}, K+{w3}, … • K+{w1,w2}, K+{w1,w3}, K+{w1,w4}, … K+{w2,w1}, … • K+{w1,w2,w3}, K+{w1,w2,w4}, K+{w1,w2,w5}, … • … • So, much leeway in visiting the nodes to retrieve objects in an order required by applications.