250 likes | 279 Vues
ODISSEA o pen dis tributed s earch e ngine a rchitecture. A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval Torsten Suel, Chandan Mathur, Jo-Wen Wu, Jiangong Zhang, Alex Delis, Mehdi Kharrazi, Xiaohui Long, Kulesh Shanmugasundaram. Daniel Porta (d.porta@web.de).
E N D
ODISSEAopen distributed search engine architecture A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval Torsten Suel, Chandan Mathur, Jo-Wen Wu, Jiangong Zhang, Alex Delis, Mehdi Kharrazi, Xiaohui Long, Kulesh Shanmugasundaram Daniel Porta (d.porta@web.de)
Talk Outline • Motivation • Design Overview • System Design Details • Target Applications • Implementation Details • Efficient Query Processing • Open Questions Seminar "Peer-to-peer Information Systems"
Motivation • Today, main part of the web search infrastructure is supplied by only a few large crawl-based search engines • Strong research in the field of P2P systems over the last few years • Computers have/will become faster and the network bandwidth has increased/will grow • This raises two issues • Vast data in P2P networks requires the ability to search in these networks • Significant computing resources provided by a P2P system could be used to search content residing inside or outside the system • ODISSEA attempts both issues by a „distributed global indexing and query execution service“ Seminar "Peer-to-peer Information Systems"
Design Overview • ODISSEA is different from many other approaches to P2P search • It assumes a two-layered search engine architecture and a global index structure distributed over the nodes of the system • In a global index, as contradiction to a local index, a single node holds the entire inverted index for a particular term Seminar "Peer-to-peer Information Systems"
Two Layer Approach • Lower layer provides • maintanance of the global index structure under document insertions and updates • Maintanance of node joins and failures • Efficient execution of simple search queries ODISSEA queries WWW Search server queries • Upper layer interacts with P2P-based lower layer via two classes of clients • Update clients (e.g crawler, web server) • Query clients (user implemented optimized query execution plan) crawler Seminar "Peer-to-peer Information Systems"
Two Layer Approach • Enables a large variety of (client-based) search tools that more fully exploit client computing resources. • Those tools could share the same lower-layer web search infrastructure. • Tools are developed using an open API, which accesses the search infrastructure • When processing a query, this could in the most general case (i.e where no pre-evaluation is done on server-side) result in large amounts of data to be transferred to the query client Seminar "Peer-to-peer Information Systems"
Global vs. Local index • posting = [DocID, Position, additional information] • Inverted list is a list of postings that represents all occurencies of a term in the document collection • Inverted index for terms is the set of the corresponding inverted lists • Suppose a query „chair AND table“. Then the query will be processed as follows A A: chair B: table B C C search client search client Seminar "Peer-to-peer Information Systems"
Global vs. Local index • Local index organization is very inefficient in very large networks (e.g. web) if result quality is the major concern, because the query has to be transmitted to all nodes and all of them have to respond • But in a global index organization large amounts of data need to be transmitted between nodes when • Initially building the index • Evaluating a query bad response time • Can be overcome with smart algorithmic techniques, as you will see later • Choice depends on the types of queries and the frequency of document updates, as well as on the question of how dynamic the system is Seminar "Peer-to-peer Information Systems"
Crawling and Fault Tolerance Crawling approach • Client-based, non P2P crawlers have the advantage that they can be easily altered in the case that some web site operators have complains about the bot • Smart crawling strategies beyond BFS are hard to implement in a P2P environment unless there is a centralized scheduler P2P systems and fault tolerance • System design relies on the assumption of a more stable P2P environment, since otherwise administration (insert, update, replication) would be too expencive Seminar "Peer-to-peer Information Systems"
Target Applications • Full-text search in large document collections located within in P2P communities • Search in large intranet environments • Web Search: a powerful API supports the anticipated shift towards client-based search tools which better exploit the resources of todays desktop machines • Search middleware: Instead of inserting documents, clients could directly insert index entries. This might speed up query execution, since for a document only certain „strong“ keywords can be inserted. But a drawback could be that the identification of such keywords lies in client‘s hand Seminar "Peer-to-peer Information Systems"
Implementation Details • Currently, a first system is being implemented in Java, using Pastry as a P2P substrate (lower layer) and a DHT mapping for hashing IDs to the appropriate IP-address • Each node runs an indexer that stores inverted list in compressed form in a Berkeley DB (which contains a B+ tree), each document is also stored in a Berkeley DB • Using MD5, all documents and term lists are hashed to a 80-bit ID that is used for lookups in the system Seminar "Peer-to-peer Information Systems"
Implementation Details Parsing and Routing Postings • New or updated documents are parsed at the node where they reside, as determined by the DHT mapping • Parser generates for each term a posting that is routed via several intermediate nodes, as determined by the topology of the Pastry network, until it reaches its destination node • An index structure of a node is split up in a small structure (residing in main memory) that is eventually merged with a bigger structure on disk to avoid disk accesses during inserts/updates lower amortized complexity Seminar "Peer-to-peer Information Systems"
Implementation Details Groups and Splits • Initially, all objects (documents, indexes) whose first w bits (here w=16) coincide are placed into a common group identified by this w-bit string • Locally, each group maintains a Berkeley DB with all objects it contains • When a group (of documents) becomes too large (here >1GB), it is split into two groups identified by a (w+1)-bit string leaving a stub structure pointing to the new groups that are assigned to new nodes • If index structures for terms are too large (here >100MB), they are split into two lists according to the document IDs they contain Seminar "Peer-to-peer Information Systems"
Implementation Details Replication • Performed at group level by attaching „/0“, „/1“, etc. to the group label (e.g. 0100101/2) • This new label is then what is really presented to Pastry/DHT during lookups • All replicas of a group form a „clique“ that communicate periodically to update their status • If a group replica fails, the others are in charge of detecting this and if necessary perform repair • Each node can contain several distinct group replicas and therefore participate in several cliques • Postings are first routed to only one replica that is then in charge of forwarding them to the others over a period of a few minutes Seminar "Peer-to-peer Information Systems"
Implementation Details Faults, Unavailability and Synchronization • When a node leaves the system, its group replicas eventually have to be replaced to maintain the desired degree of replication • A node has failed if it has been unavailable for an extended period of time • Create new replicas for a failed node or if a certain number of nodes are unavailable • Former unavailable nodes have to synchronize its index structures using logs of missing updates Seminar "Peer-to-peer Information Systems"
Efficient Query Processing Information Thoeretic Background • Let d be a document, q = q0…qm-1 a query consisting of m terms and F be a function that assigns d (depending on q)a value F(d,q). Such a function is called a ranking function. • The top-k ranking problem for a query q is finding the k documents with the highest values F(d,q). • A common form of such a function looks like this • Since queries typically have at most only 2 search terms, the following algorithm focuses on the top-k ranking problem and queries with exactly 2 search terms (for one-term queries, there is in fact nothing to do) Seminar "Peer-to-peer Information Systems"
Efficient Query Processing Fagin‘s Algorithm (FA) • Intuitively, an item that is ranked in the top is likely to be ranked very high in at least one of the contributing subcategories • Assume a query q = q0 AND q1 and postings of the form (d,f(d,qi)) that are sorted by the second component with highest values on top • Also assume that the inverted lists for q0and q1 are located on the same machine, so that no network communication is required • Goal: compute the top k documents as fast as possible Seminar "Peer-to-peer Information Systems"
Efficient Query Processing • Scan both lists from the beginning, by reading one element from each list in every step, until there are k documents that have been encountered in both lists (here assume k=2) • Compute the scores of these k documents. Also, for each document that was encoutered in only one of the lists, perform a lookup into the other list to determine the score of the document. 1 2 3 4 5 A 0.9 0.8 0.7 0.69 0.67 B 0.6 0.5 0.4 0.3 0.2 0.1 6 5 3 1 7 8 Return the k documents with the highest score (here d1, d5) Seminar "Peer-to-peer Information Systems"
Efficient Query Processing Threshold Algorithm (TA) • Scan both lists simultaneously and read (d,f(d,q0)) from the first and (d‘,f(d‘,q1)) from the second list • Compute t = f(d,q0) + f(d‘,q1) • For each d in one of the lists perform immediately a lookup in the other list in order to compute its complete score • Algorithm terminates, when k documents have been found that have higher scores than the current value of t Because it does not make sense to scan two lists simultaneously while they are distributed in a P2P network, the above techniques have to be adapted. This leads us to the following protocol that aims at minimize the data to be transferred. Seminar "Peer-to-peer Information Systems"
Efficient Query Processing A simple distributed pruning protocol (DPP) Node B receives the postings from A and performs a lookup into its own list in order to compute the total scores. Retain the k documents with the highest scores. Let rk be the smallest value among these. Node A (holding the shorter list) sends the first x postings to node B. Let rminbe the smallest value f(d,q0) transmitted Node A now performs lookups into its own list for the postings received from B and determines the overall top k documents A B Node B now transmitts to A all postings among its first x postings with f(d,q1) > rk - rmin, together with the total scores of the k documents from the previous step Seminar "Peer-to-peer Information Systems"
Efficient Query Processing DPP-Example for k=2 and x=3: A containing term q0: (d1, 0.9), (d2, 0.8), (d3, 0.7), (d4, 0.69), (d5, 0.67) B containing term q1: (d6, 0.6), (d5, 0.5), (d3, 0.4), (d1, 0.3), (d7, 0.2), (d8, 0.1) A computes: (d6, 0.6+ ---- ),(d5, 0.5+0.67), A to B: (d1, 0.9), (d2, 0.8), (d3, 0.7) rmin = 0.7 B computes: (d1, 0.9 + 0.3)(d2, 0.8 + ----)(d3, 0.7 + 0.4) B retains: (d1, 1.2)(d3, 1.1) A B rk = 1.1 Top 2 documents: 1. (d1, 1.2)2. (d5, 1.17) rk – rmin = 1.1 - 0.7 = 0.4 B to A: (d6, 0.6), (d5, 0.5), because f(d6,5,q1) > 0.4 together with (d1, 1.2), (d3, 1.1) Seminar "Peer-to-peer Information Systems"
Efficient Query Processing Problems with the DPP • works only with queries containing 2 search terms • random lookups can cause disk accesses, since large index structures reside on hard disk bad response time • How must the value of x be chosen? (x should be the number of postings transmitted from A and B, s.t. DPP works correct without extra roundtrip; depends on the k and length of the inverted lists) • By deriving appropriate formulae based on extensive testing • By sampling-based methods that estimate the number of documents appearing in both lists Seminar "Peer-to-peer Information Systems"
Efficient Query Processing Evaluation of DPP • 900 two-term queries selected form a set of over 1 million • Testing corpora: 120 million web pages (1.8TB) that were crawled by their own crawler • Value of x determined by experiments on TA • Computation within nodes are not taken into account • Commmunication costs and estimated times of DPP for the top-10 documents and standard cosine measure: Seminar "Peer-to-peer Information Systems"
Future Work • Framework for generating optimized query execution plans for multi-keyword queries • New algorithmic techniques for the index synchronization problem • New strategies for load balancing and rebuilding of lost replicas • More experimental evaluation concerning different types of queries Seminar "Peer-to-peer Information Systems"
Questions? „The general question remains whether the near future will see massive P2P-based systems for challenging applications such as web search and large-scale IR, beyond the current simple applications such as file sharing.“ Seminar "Peer-to-peer Information Systems"