XML Distributed Retrieval

EmiranCurtmola @ UCSD K. K. Ramakrishnan @ at&t Alin Deutsch @ UCSD DiveshSrivastava @ at&t XML Distributed Retrieval

Motivation • Democratization of datacreation on the web • Easy to create and publish data • Self-organization in online communities • Easy to form online communities in an ad-hoc fashion • Members create, publish and share data items • Need to query the overall community data collection (all the published data)

“The virtual newspaper” community The community data collection P4 P5 P3 P3 P2 local data local data local data P5 P7 P7 P8 P6 P1 local data local data P4 local data Query Q1: find the articles talking about fire in San Diego P8 P1 P2 P6 local data local data Query Q2: find the articles about San Francisco Query Q3: find the articles talking about food Query Q4: find the articles that give the weather in New York Efficient querying the community data collection?

State-of-the-Art in Querying • Topic-based approach • Users creates static topics • a topic is a rendezvous point between consumers and publishers • consumers subscribe (query) to topics of interest • publishers classify content into topics • Limitation • Consumer interests can not be specified at a very fine granularity (too many topics) e.g., “news about fire damage when more than 1,000 people impacted and related to Santa Ana conditions in San Diego county, and information about related government relief efforts underway”

Ad-hoc Querying • Content-based approach: on actual content • E.g., search engines, hosted online communities local data global data Central site The community data collection P4 P5 P3 local data local data P7 P8 P1 local data local data local data Query Q1: find the articles talking about fire in San Diego P2 P6 local data local data Query Q2: find the articles about San Francisco Query Q3: find the articles talking about food Query Q4: find the articles that give the weather in New York

Limitations of Centralized Approach • Centralized approach disintermediates publishers from consumers via a centralized authority • Publishers need to give up their data • against the community of autonomous members • Publishers can not know who is interested and who accesses their data • Insufficient timeliness • Freshness of data depends on crawling frequency

Decentralized Approach:Move Queries Instead of Data local data The community data collection P4 P5 P3 local data local data P7 P8 P1 local data local data local data Query Q1: find the articles talking about fire in San Diego P2 P6 local data local data Query Q2: find the articles about San Francisco Query Q3: find the articles talking about food Query Q4: find the articles that give the weather in New York

Our Goal for Querying • Data resides with the publisher • publishers maintain complete control over who accesses their data • Consumers can send ad-hoc queries over the content of community data collection

Challenges • Distributed nature of the data among publishers • Data is not materialized globally but it resides with each publisher • Large number of decentralized publishers and consumers • Publishers: “whom to tell” among the host of potential consumers? • Consumers: “whom to ask” among the myriad of available publishers? • Avoid flooding the network

Proposal for Query Dissemination • The community setup • Network of logical routers as infrastructure for the community • Publishers connect to this network at the edge • Build an overlay network to act as a distributed index structure • Routers are organized into a network called a query dissemination tree (QDT) • Use QDT to disseminate queries • Queries always posed at root • Queries forwarded by routers to relevant publishers based on the certain information • every node contains a summary of data stored in its subtrees

A Query Dissemination Tree (QDT) Only the overlay connections between the nodes of QDT are shown Node 3’s summary (set of terms) San Diego, San Francisco, stocks, food, weather, gold, New York 1 2 13 8 union of its subtrees’ summaries 3 14 16 9 P1’s advertised set of terms: San Diego, San Francisco, stocks, food, weather 4 6 10 P4 P5 17 20 23 P2 P1 P3 18 21 24 router P6 P7 P8 P2’s advertised set of terms: San Diego, gold, New York, food P publisher

XML Content Descriptors (CDs) • An XML document D is described (imperfectly) by a set of content descriptors, CD(D) • A query Q is also described by a set of CDs, CD(Q) • To estimate if Q has a match against D • we check CD(Q)  CD(D)

rss channel editor item description title link San Diego, fire … Jupiter reuters.com ReutersNews Representing Documents Using CDs Sample XML article published by P1 • CDs can be • all simple keywords: • San Diego, • fire, • Jupiter, • ReutersNews, • reuters.com • keywords with full path from root: • /rss/channel/description/San Diego • /rss/channel/description/fire • /rss/channel/editor/Jupiter • /rss/chanel/item/title/ReutersNews • /rss/channel/item/link/reuters.com • etc. • keywords with only last tag on path: • description/San Diego • description/fire • editor/Jupiter • title/ReutersNews • link/reuters.com

Query Routing in a QDT Q3=<food> Q3 Q3 Q3 Q3 Q3 Q3 Q3  check set inclusion: query into node’s summary 1 Bloom Filter 2 13 8 3 14 16 9 4 6 10 P4 P5 17 20 23 Only P1 and P2 publish articles about food P2 P1 P3 18 21 24 … food … … food … P6 P7 P8 Q3

Traffic Congestion at Top of QDT The tree topology introduces congestion during query dissemination

Traffic Congestion at Top of QDT Routing a query Routing a query workload • non-zero time to process • a query at a node 1 Bottleneck (the load decreases from root to leaves due to filtering) 2 13 8 3 14 16 9 4 6 10 P4 P5 17 20 23 P2 P1 P3 18 21 24 … food … … food … P6 P7 P8 How to relieve the congestion? 

Techniques for Load Balancing • Overlaying multiple logical QDTs over the same underlay network • a node belongs to multiple QDTs but at different levels • Goal: organize the nodes into QDTs such that • the distribution of tree levels for a node is uniform across the QDTs

Overlaying Multiple QDTs: QDT1 3 2 1 13 14 4 16 QDT1 1 23 6 20 8 17 24 P4 P5 9 21 P2 P1 P3 10 P6 P7 P8 18

Overlaying Multiple QDTs: QDT2 3 2 1 13 14 4 16 QDT2 23 6 20 8 17 24 P4 P5 1 9 21 P2 P1 P3 10 P6 P7 P8 18

Overlaying Multiple QDTs QDT1 QDT2 1 1 QDT3 QDT4 1 1

Query Routing for Multiple QDTs • Partition community data collection (set of CDs) into blocks • Build one QDT tree per block • QDTi groups all publishers with CDs in Bi • Routing a query • Terms in query determine the relevant blocks • Send query to the corresponding QDT • Check the full query with publishers’ storage Example of routing Q3 Q3 falls in B4  use QDT4 QDT1 QDT2 QDT3 QDT4 Q3=<food> QDT4 for B4 … food … … food …

Relieving the Congestion Q1=<fire, San Diego> QDT1 QDT2 Q3=<food> QDT3 QDT4

Queries Spanning on Multiple Blocks • Q4=<New York, weather> • Route Q4 on both trees? • NO: generate redundant traffic, therefore more messages • Routing on both trees can touch the same nodes  we show it suffices to send the query to either of the trees QDT3 QDT4

Routing Alternatives • Routing Q4=<New York, weather> Q4: routing by <New York> Q4: routing by <weather> QDT4 QDT3 Check the all query terms at each publisher!

Routing Alternatives • Routing Q4=<New York, weather> • Ideally, route after the most selective term • In practice, not possible but use informed routing • keep track of popular CDs • avoid routing with low selective (popular) CDs Q4: routing by <New York> Q4: routing by <weather> QDT4 QDT3

Discussion: The Design Space • How many query dissemination trees? • 1 tree for all published terms • Con: traffic congestion in the upper level of the dissemination tree • Pro: queries routed in tree are very selective • the more conjuncts, the more selective the query  early pruning of subtrees to be visited • 1 tree per term • Pro: congestion-free • Con: tree maintenance (as many trees as terms) • Con: single-term queries less selective  unnecessary visit more peers • “Sweet spot” expected to lie between above extremes Our solution 

Finding the Sweet Spot • Empirical fact • upper 2 tree levels in a QDT are the most congested • One solution: cyclical permutation of nodes on the tree levels Goal: all routers appear precisely once in the top 2 levels of any QDT

Sweet Spot when 4 QDTs 1 QDT1 1 QDT2 2 2 13 13 8 8 3 3 14 16 14 16 9 9 4 6 10 P4 P5 17 20 23 4 6 10 17 20 23 P4 P5 P2 P2 P1 P3 18 21 24 P1 P3 18 21 24 P6 P7 P8 P6 P7 P8

Sweet Spot when 4 QDTs 1 QDT1 1 QDT2 2 13 8 3 4 20 3 14 16 9 4 6 10 P4 P5 17 20 23 9 23 6 1 16 21 17 18 10 14 P2 P1 P3 18 21 24 20 4 24 2 10 18 21 17 8 6 1 23 P6 P7 P8 QDT4 QDT3 20 24 13 23 3 1 9 18 2 P4 P4 P4 P5 P5 P5 21 8 14 16 24 13 4 1 3 P2 P2 P2 P1 P1 P1 P3 P3 P3 2 6 9 8 10 14 17 16 13 1 1 P6 P6 P6 P7 P7 P7 P8 P8 P8

Experimental Goals • Effect of number of QDTs • find the “sweet spot” to load balance • Effect of routing strategy (informed routing) • optimize based on query selectivity estimation • Effect of QDT topology • study the overlay organization of the peers

Experimental Setup • 10,000-node overlay network simulator • 9,400 publishers and 600 routers • XML Wikipedia dump of 1.1M articles (8.6GB) • Query workload: 50,000 conjunctive queries • each query has 1..10 conjunctive terms • each query has at least one match in the global data collection • QDT topology • Multicast trees e.g., Scribe (QDTS) • Balanced trees (QDTB)

Measuring the Throughput • Processing load at each node • is a function of nr. messages reaching a node • Peak load: is the maximum load over all nodes • Average load: is the nr. messages in the network divided by nr. Routers The ideal loadwe can achieve is the average load for the 1-QDT case  New metric: theload reduction • how close is the actual peak load (when k QDTs) from the ideal load

Effect of Number of QDTs • Varying the number of QDTs, we confirm • the nr. of QDTs given by the cyclical permutation method returns the highest load reduction  The “sweet spot” is well defined  For this nr. of QDTs the load reduction is near the optimum

Effect of Number of QDTs • Result: bring actual peak load very close to the ideal load • near-optimum peak load reduction at 15 QDTs for Scribe generated topologies

Effect of Routing Strategy • Query selectivity estimation • for only 1-3% state, we get 65-75% of the routing benefit

Effect of QDT Topology • Fanout-balanced trees are closest to optimal throughput

Summary • Infrastructure for ad-hoc querying in online communities where the publishers keep control over their own data • Ongoing Work • Ranked results • Disseminate only to top-K relevant publishers • Find only top-K matching documents • Support for more expressive XML queries • Simulation  Build Prototype

Thank You!

Effect of Number of QDTs

Effect of Routing Strategy

Efficient Representation of Summaries • Naïve solution • keep “exact node summaries” as a complete list of published terms • Con: memory intensive  arbitrarily large summaries • Con: costly to check set inclusion • How to achieve fast term inclusion sets? • How to represent summaries using little space? • Allow estimates • without false negatives: to avoid incomplete answers • bounded false positives: to avoid wasting bandwidth  Represent summaries (term sets) using Bloom filters

XML Distributed Retrieval

XML Distributed Retrieval

Presentation Transcript

DISTRIBUTED INFORMATION RETRIEVAL

Distributed Information Retrieval

XML Retrieval

XML Retrieval

XML Retrieval

INEX: Evaluating content-oriented XML retrieval

Querying Distributed Data using XML

An Adaptive XML Retrieval System

Evaluation of XML Information Retrieval Systems

XML Information Retrieval and INEX

Information Searching and Retrieval from Distributed Databases using Mediators, CORBA and XML

A Distributed Indexing Strategy for Efficient XML Retrieval

Distributed Instance Retrieval over Heterogeneous Ontologies

XML Information Retrieval

Distributed Information Retrieval Jamie Callan

Ranked Information Retrieval on XML Data

Structure/XML Retrieval

XML and Distributed Applications

XML Information Retrieval

Lecture 21: XML Retrieval

Parallel and Distributed Information Retrieval