150 likes | 251 Vues
Explore a probabilistic approach to rank top-k documents in Boolean query systems without traditional ranking support, showcasing a query relaxation method for improved relevance scoring. This innovative technique enhances query execution efficiency and quality of search results.
 
                
                E N D
Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International UniversityYuhengHu, Arizona State UniversityPanosIpeirotis, New York University
Motivation: PubMed(and USPTO, and Linked In, and…) • PubMed offers only ranking by date, author, title, or journal • Usually, user like ranking by relevance • Measured by IR ranking function, like tf-idf
Problem Definition • Input • Query Q contains term t1,…tn • Database D contain documents d1,…,dm • Output • Top-kdocuments ranked according to a relevance score function • Example of ranking function: tf.idf • Baseline: Submit a disjunctive query with all query keywords, retrieve all the documents, locally re-rank • Problems with Baseline method: Too many results! • “immunodeficiency virus structure”  1,451,446 results
Query Relaxation Approach • A tf.idf query has OR semantics • Using queries will AND semantics returns promising documents earlier on • Gradual query relaxation allows fast execution • Key questions: • Which (conjunctive) queries to execute? • When to stop?
Problem Setting and Challenges • Boolean query interface, (e.gPubMed) • Limited data access through web service (quota per day) • No useful ranking functions • No indices to rely on • No statistics exported from database
idf, (easy part) tf, (challenging part) tf parameter of Poissonfor the term in database Probabilistic Approach • Document Score • Estimate tf (and scores) probabilistically: • The tf of the terms in a database tend to follow a Poisson distribution • Document scores also follow a Poisson
The k-th highest score so far Query candidate Probabilistic top-k with query relaxation • Querying strategy – How to pick a good query candidate? • A good query should have good “benefit” • Benefit: Probability that document in results for relaxed query qin top-k. Pr{ScoreQ(D,q) > τ} Score follows Poisson, function of the λ parameters of query terms in Q We choose the query candidate q with maximum probability
Estimation of Poisson Parameters • Sample-based estimation: Fetch documents, construct sample, use estimates from sample • Need very extensive sampling size for reliable estimates • Query-based estimation: Combine sampling and query execution • Every query generates a sample and provides candidate top-k docs • Main challenge: Adjust estimates to compensate for querying bias(we are looking for top-k documents, we do not perform random sampling)
Query-based Sampling • Document sample returned for each query is not random! • Sample is “conditional” on query terms (guaranteed to appear) • Need to acknowledge in estimates that queries are trying to find the top-k, not intended for random sampling • Without correction, estimates significantly off
Top-kalgorithm using query relaxation • Send conjunctive query to the database with all terms • Update statistics for each termusing estimates from the biased sample • Compute benefits for each possible query relaxation • If benefit (i.e., probability of finding top-k document) belowthreshold, stop; else go to step 1
Experiments • Datasets • PubMed • TREC • Quality Measure • Spearman’s Footrule • Algorithms • Baseline • Summary-based • Query-based
Experiments: Quality • Compared footrule distance compared to baseline (baseline = retrieve everything, fetch locally, rerank) • Lower values better • Query-based sampling consistently better than alternatives
Experiments: Efficiency • Measured #documents, queries, and execution time of alternative techniques
Conclusion • Technique for top-k queries on top of document databases without ranking support • Introduction of an exploration-exploitation framework for building necessary statistics on-the-fly, during query execution • Order-of-magnitude efficiency improvements, small losses in quality
Thank you ! Questions?