1 / 2

For more information please send email to uirmak@cis.poly or suel@poly .

A Motivating Application. Notify the subscriber if an “interesting” document appears on the web. Problem Definition

aolani
Télécharger la présentation

For more information please send email to uirmak@cis.poly or suel@poly .

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Motivating Application Notify the subscriber if an “interesting” document appears on the web Problem Definition • Given large number of subscriptions (in the order of millions) how can we efficiently match large number of incoming documents (thousands per second) against all subscriptions? Challenges • Scalability and load balancing • Support for enhanced subscription capabilities • Automatic resource (RSS) discovery and efficient crawling • Improved service (a longer history of matches, ranking) EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINESUtku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly*, and Rauf Izmailov*Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey • What is RSS? • Rich Site Summary (version 0.91) • RDF Site Summary (versions 0.9 and 1.0) • Really Simple Syndication (version 2.0) • Provides: • Web content (or summaries) • Meta-data (TITLE, URL and DESCRIPTION) • Goals: • Web Syndication • Allow readers to keep track of updates • Internal Representations for Efficient Matching • Use of Inverted Index: • Queries are indexed by their terms • Reduces the number of queries examined • Queries, Terms and Documents are • represented by unique identifiers (QIDs, TIDs, DIDs) • Comparison to Traditional Search • Retrospective Search: • On a previously crawled file collection • Searching the past • Collection of files is static • Queries are dynamic • Prospective Search: • On newly added or updated files • Searching the future • Files are dynamic • Collection of queries is static New: Q1 York: Q1 Yankees: Q1 Q2 Red: Q2 Q3 Sox: Q2 Q3 Boston: Q3 Q1: New York Yankees Q2: Yankees Red Sox Q3: Boston Red Sox • Query (Subscription) Types • AND only:All terms have to appear • k-out-of-n: At least k (out of all n) terms have to appear • Boolean: Boolean expression with AND, OR and NOT For more information please send email to uirmak@cis.poly.edu or suel@poly.edu.

  2. 1 1 1 EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINES (continued)Utku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly*, and Rauf Izmailov*Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey • Datasets and Experimental Evaluations • Subscriptions: Query logs from excite.com • Documents: Crawled & parsed web pages • Evaluation: Throughput with various • numbers of subscriptions • A Primitive Matching Algorithm (AND only) • For each TID in the document • - Find queries that contain TID (using inverted index) • - Maintain a counter (for each query returned) • There is a match if (counter == query size) Opt 2: Use of Bloom Filters • Bloom Filter: A probabilistic, space- efficient method for membership queries • For each new item, set the corresponding bit to 1 • False negatives are guaranteed not to occur Advantage: Reduced cost of maintaining the accumulators Opt 3: Partitioning the Queries • Create multiple smaller inverted indexes • Repeat the matching algorithm Advantage: Better locality (in the processor cache) A Clustering Approach • Queries usually have common terms and some are contained by others • If a query is already evaluated on a document, contained queries can be answered very efficiently For more information please send email to uirmak@cis.poly.edu or suel@poly.edu.

More Related