Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics Saarbrücken Germany ACM SigIR ‘05

An Initial Example… • Robust Track ’04, hard query no. 363 (Aquaint news corpus) “transportation tunnel disasters” • Increased robustness • Count only the best match per document and expansion set • Increased efficiency • Top-k-style query evaluations • Open scans on new terms only on demand • No threshold tuning transportation tunnel disasters 1.0 1.0 1.0 transit highway train truck metro “rail car” car … 0.9 0.8 0.7 0.6 0.6 0.5 0.1 tube underground “Mont Blanc” … 0.9 0.8 0.7 catastrophe accident fire flood earthquake “land slide” … 1.0 0.9 0.7 0.6 0.6 0.5 d2 d1 Expansion terms from relevance feedback, thesaurus lookups, Google top-10 snippets, etc. Term similarities, e.g., Robertson&Sparck-Jones, concept similarities, or other correlation measures Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Outline • Computational model & background on top-k algorithms • Incremental Merge over inverted lists • Probabilistic candidate pruning • Phrase matching • Experiments & Conclusions Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Computational Model • Vector space model with a Cartesian product space D1×…×Dm and a data set D  D1×…×Dm m • Precomputed local scoress(ti,d)∈ Di for all d∈ D • e.g., TF*IDF variations, probabilistic models (Okapi BM25), etc. • typically normalized to s(ti,d)∈ [0,1] • Monotonous score aggregation • aggr: (D1×…×Dm )  (D1×…×Dm )→ + • e.g.,sum, max, product (sum over log sij ), cosine (L2 norm) • Partial-match queries (aka. “andish”) • Non-conjunctive query evaluations • Weak local matches can be compensated • Access model • Inverted index over large text corpus, • Inverted lists sorted by decreasing local scores  Inexpensive sequential accesses to per-term lists: “getNextItem()”  More expensive random accesses: “getItemBy(docid)” Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

No-Random-Access (NRA) Algorithm [Fagin et al., PODS ‘02] Corpus:d1,…,dn • NRA(q,L): • scan all lists Li (i = 1..m) in parallel // e.g., round-robin • <d, s(ti,d)> = Li.getNextItem() • E(d) = E(d)  {i} • highi = s(ti,d) • worstscore(d) = ∑E(d)s(t ,d) • bestscore(d) = worstscore(d) + ∑E(d)high • if worstscore(d) > min-k then • add d to top-k • min-k = min{worstscore(d’) | d’  top-k} • else if bestscore(d) > min-k then • candidates = candidates  {d} • if max {bestscore(d’) | d’ candidates}  min-k then return top-k d1 d1 d1 s(t1,d1) = 0.7 … s(tm,d1) = 0.2 Query:q = (t1,t2,t3) Inverted Index k = 1 d78 0.9 d23 0.8 d10 0.8 d1 0.7 d88 0.2 t1 Scan depth 1 … Scan depth 2 Naive Join-then-Sort in between O(mn) and O(mn2) runtime Scan depth 3 d64 0.8 d23 0.6 d10 0.6 d10 0.2 d78 0.1 t2 … d10 0.7 d78 0.5 d64 0.4 d99 0.2 d34 0.1 STOP! t3 … Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Dynamic & Self-tuning Query Expansions virtual index list~t3 t3,m‘ t3,1 t3,2 … d42 d11 d92 d11 d42 d78 d92 d11 d10 d42 d92 d11 d32 d1 d21 d87 ... ... ... • Incrementally merge inverted lists Li1…Lim’in descending order of local scores • Dynamically add lists into set of active expansions exp(ti) according to the combined term similarities and local scores • Best match score aggregation top-k (t1,t2,~t3) t1 t2 d66 d95 incr. merge d93 d17 d95 d11 d99 d101 ... ... Increased retrieval robustness & fewer topic drifts Increased efficiency through fewer active expansions No threshold tuning required ! Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Incremental Merge Operator t1 d78 0.9 d23 0.8 d10 0.8 d1 0.4 d88 0.3 ... t2 d10 0.7 d64 0.8 d23 0.8 d12 0.2 d78 0.1 ... 0.4 0.72 0.18 t3 d11 0.9 d78 0.9 d64 0.7 d99 0.7 d34 0.6 ... 0.45 0.35 0.9 ~t Index list meta data (e.g., histograms) Relevance feedback, Thesaurus lookups, … Initial high-scores Expansion terms ~t = {t1,t2,t3} Correlation measures, Large corpus statistics, … sim(t, t1 ) = 1.0 sim(t, t2 ) = 0.9 Expansion similarities sim(t, t3 ) = 0.5 Incremental Merge iteratively triggered by top-k operator “getNextItem()” d88 0.3 d78 0.9 d23 0.8 d10 0.8 d64 0.72 d23 0.72 d10 0.63 d11 0.45 d78 0.45 d1 0.4 ... Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Probabilistic Candidate Pruning [Theobald, Schenkel, Weikum, VLDB ‘04] • For each physical index list Li • Treat each s(ti,d)  [0,1] as a random variable Siand consider • Approximate local score distribution using an equi-width histogram with n buckets Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Probabilistic Candidate Pruning [Theobald, Schenkel, Weikum, VLDB ‘04] • For each physical index list Li • Treat each s(ti,d)  [0,1] as a random variable Siand consider • Approximate local score distribution using an equi-width histogram with n buckets • For a virtual index list~Li = Li1…Lim’ • Consider the max-distribution • Alternatively, construct meta histogram for the active expansions Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Probabilistic Candidate Pruning [Theobald, Schenkel, Weikum, VLDB ‘04] • For each physical index list Li • Treat each s(ti,d)  [0,1] as a random variable Siand consider • Approximate local score distribution using an equi-width histogram with n buckets • For a virtual index list~Li = Li1…Lim’ • Consider the max-distribution • Alternatively, construct meta histogram for the active expansions • For all d in the candidate queue • Consider the convolution over score distributions for aggregated scores • Drop d from candidates, if Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Incremental Merge for Multidimensional Predicates Incr.Merge Nested Top-k Nested Top-k cable optic optics fiber fiber undersea d14 0.9 d78 0.9 d78 0.9 d34 0.9 d41 0.9 d78 0.8 d18 0.9 d23 0.8 d10 0.7 d12 0.8 d17 0.6 d23 0.8 d1 0.8 d10 0.8 d78 0.6 d75 0.5 d10 0.8 d23 0.6 d23 0.8 d1 0.7 d7 0.4 d5 0.4 d2 0.3 d1 0.7 d32 0.7 d88 0.2 d23 0.3 d23 0.1 d88 0.2 d47 0.1 … … … … … … q = (undersea „fiber optic cable“) Top-k • Nested Top-k operator iteratively prefetches & joins candidate items for each subquery condition “getNextItem()” • Provides [wortscore(d), bestscore(d)] guarantees to superordinate top-k operator • Propagates candidates in descending order of bestscore(d) values for monotonicity • Top-level top-k operator performs phrase tests only for the most promising items (random IO) (Expensive predicates & minimal probes [Chang & Hwang, SIGMOD ‘02]) • Single threshold condition for algorithm termination (candidate pruning at the top-level queue only) sim(„fiber optic cable“, „fiber optic cable“) = 1.0 sim(„fiber optic cable“, „fiber optics“) = 0.8 random access term-to- position index Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Experiments – Aquaint with Fixed Expansions MAP rPrec P@10 # SA # RA CPU sec max KB max m • Aquaint corpus of English news articles (528,155 docs) • 50 “hard” queries from TREC 2004 Robust track • WordNet expansions using a simple form of WSD • Okapi-BM25 model for local scores, Dice coefficients as term similarities • Fixed expansion technique (synonyms + first-order hyponyms) with m ≤ 118 Title-only Static Expansions Dynamic Expansions Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Experiments – Aquaint with Fixed Expansions cont’d Probabilistic Pruning Performance Incremental Merge vs. Top-k with Static ExpansionsEpsilon controls pruning aggressiveness Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Conclusions & Current Work • Increased efficiency • Incremental Merge vs. Join-then-Sort & top-k using static expansions • Very good precision/runtime ratio for probabilistic pruning • Increased retrieval robustness • Largely avoids topic drifts • Modeling of fine grained semantic similarities (Incremental Merge & Nested Top-k operators) • Scalability (see paper) • Large expansions (< 876 terms per query) on Aquaint • Experiments on Terabyte collection • Efficient support for XML-IR (INEX Benchmark) • Inverted lists for combined tag-term pairs e.g., sec=mining • Efficiently supports child-or-descendant axis e.g., //article//sec=mining • Vague content & structure queries (VCAS) e.g., //article//~sec=~mining • Incremental Merge over Data-Guide-like XPath locators • VLDB ’05, Trondheim Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Thank you! Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing