Query Relaxation for Top-k Ranked Queries in Boolean Query Interfaces without Ranking Support

Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International UniversityYuhengHu, Arizona State UniversityPanosIpeirotis, New York University

Motivation: PubMed(and USPTO, and Linked In, and…) • PubMed offers only ranking by date, author, title, or journal • Usually, user like ranking by relevance • Measured by IR ranking function, like tf-idf

Problem Definition • Input • Query Q contains term t1,…tn • Database D contain documents d1,…,dm • Output • Top-kdocuments ranked according to a relevance score function • Example of ranking function: tf.idf • Baseline: Submit a disjunctive query with all query keywords, retrieve all the documents, locally re-rank • Problems with Baseline method: Too many results! • “immunodeficiency virus structure”  1,451,446 results

Query Relaxation Approach • A tf.idf query has OR semantics • Using queries will AND semantics returns promising documents earlier on • Gradual query relaxation allows fast execution • Key questions: • Which (conjunctive) queries to execute? • When to stop?

Problem Setting and Challenges • Boolean query interface, (e.gPubMed) • Limited data access through web service (quota per day) • No useful ranking functions • No indices to rely on • No statistics exported from database

idf, (easy part) tf, (challenging part) tf parameter of Poissonfor the term in database Probabilistic Approach • Document Score • Estimate tf (and scores) probabilistically: • The tf of the terms in a database tend to follow a Poisson distribution • Document scores also follow a Poisson

The k-th highest score so far Query candidate Probabilistic top-k with query relaxation • Querying strategy – How to pick a good query candidate? • A good query should have good “benefit” • Benefit: Probability that document in results for relaxed query qin top-k. Pr{ScoreQ(D,q) > τ} Score follows Poisson, function of the λ parameters of query terms in Q We choose the query candidate q with maximum probability

Estimation of Poisson Parameters • Sample-based estimation: Fetch documents, construct sample, use estimates from sample • Need very extensive sampling size for reliable estimates • Query-based estimation: Combine sampling and query execution • Every query generates a sample and provides candidate top-k docs • Main challenge: Adjust estimates to compensate for querying bias(we are looking for top-k documents, we do not perform random sampling)

Query-based Sampling • Document sample returned for each query is not random! • Sample is “conditional” on query terms (guaranteed to appear) • Need to acknowledge in estimates that queries are trying to find the top-k, not intended for random sampling • Without correction, estimates significantly off

Top-kalgorithm using query relaxation • Send conjunctive query to the database with all terms • Update statistics for each termusing estimates from the biased sample • Compute benefits for each possible query relaxation • If benefit (i.e., probability of finding top-k document) belowthreshold, stop; else go to step 1

Experiments • Datasets • PubMed • TREC • Quality Measure • Spearman’s Footrule • Algorithms • Baseline • Summary-based • Query-based

Experiments: Quality • Compared footrule distance compared to baseline (baseline = retrieve everything, fetch locally, rerank) • Lower values better • Query-based sampling consistently better than alternatives

Experiments: Efficiency • Measured #documents, queries, and execution time of alternative techniques

Conclusion • Technique for top-k queries on top of document databases without ranking support • Introduction of an exploration-exploitation framework for building necessary statistics on-the-fly, during query execution • Order-of-magnitude efficiency improvements, small losses in quality

Thank you ! Questions?

Query Relaxation for Top-k Ranked Queries in Boolean Query Interfaces without Ranking Support

Query Relaxation for Top-k Ranked Queries in Boolean Query Interfaces without Ranking Support

Presentation Transcript

Ranking of Database Query Results

Common Support Queries

Query Health: Distributed Population Queries

Ranked Sources of Research

Beyond Boolean Queries

Query Health Distributed Population Queries

Query Health: Distributed Population Queries

Query Health: Distributed Population Queries

Interactive Query Formulation over Web Service-Accessed Sources

Queries and Interfaces

Building Ranked Mashups of Unstructured Sources with Uncertain Information

Over-Approximating Boolean Programs with Unbounded Thread Creation

Query Containment for Conjunctive Queries With Regular Expressions

Support Queries

Query Ranking in Probabilistic XML Data

Towards Robust Indexing for Ranked Queries

Query Specific Ranking

Ranking-based Processing of SQL Queries

Interactive Query Formulation over Web Service-Accessed Sources

Interactive Query Formulation over Web Service-Accessed Sources

Towards Robust Indexing for Ranked Queries

RankSQL : Supporting Ranking Queries in RDBMS