PhD Defense May 16 th 2006 Martin Theobald Max Planck Institute for Informatics

TopXEfficient and Versatile Top-k Query Processing for Text, Structured, and Semistructured Data PhD Defense May 16th 2006 Martin Theobald Max Planck Institute for Informatics VLDB ‘05

An XML-IR Scenario (INEX IEEE) … article article title title “Current Approaches to XML Data Manage- ment” “The XML Files” bib sec sec sec sec bib title title “The Ontology Game” title “Native XML Data Bases.” item “The Dirty Little Secret” par par item title “Native XML data base systems can store schemaless data ... ” “XML queries with an expressive power similar to that of Datalog …” par “XML” par “Sophisticated technologies developed by smart people.” url “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” “w3c.org/xml” par inproc par title “XML-QL: A Query Language for XML.” “Proc. Query Languages Workshop, W3C,1998.” “What does XML add for retrieval? It adds formal ways …” “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files …” //article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)] RANKING VAGUENESS PRUNING

Outline • Data & relevance scoring model • Database schema & indexing • TopX query processing • Index access scheduling & probabilistic candidate pruning • Dynamic query relaxation & expansion • Experiments & conclusions

Data Model article 1 6 title abs sec 2 2 1 3 4 5 “xml data manage” “xml manage system vary wide expressive power“ title par 5 3 6 4 “native xml data base” “native xml data base system store schemaless data“ ftf(“xml”, article1 ) = 4 “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ <article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data.</par> </sec> </article> • XML tree model • Pre/postorder labels for all tags and merged tag-term pairs  XPath Accelerator [Grust, Sigmod ’02] • Redundant full-content text nodes • Full-content term frequencies ftf(ti,e) “native xml data base native xml data base system store schemaless data“

Full-Content Scoring Model individual element statistics • Extended Okapi-BM25 probabilistic model for XML with element-specific parameterization [VLDB ’05 & INEX ’05] Basic scoring idea within IR-style family of TF*IDF ranking functions bib[“transactions”] vs. par[“transactions”] • Additional static score mass c for relaxable structural conditions and non-conjunctive (“andish”) XPath evaluations

Inverted Block-Index for Content & Structure sec[“xml”] Random Access (RA) Sorted Access (SA) title[“native”] par[“retrieval”] sec[“xml”] par[“retrieval”] title[“native”] • Combined inverted index over merged tag-term pairs • (on redundant element full-contents) • Sequential block-scans • Group elements in descending order of (maxscore, docid) per list • Block-scan all elements per doc for a given (tag, term) key • Stored as inverted files or database tables • (two B+-tree indexes over full range of attributes)

Navigational Index sec C=1.0 Sorted Access (SA) title[“native”] par[“retrieval”] Random Access (RA) title[“native”] par[“retrieval”] sec • Additional element directory • Random accesses on B+-tree index using(docid, tag) as key • Carefully scheduled probes • Schema-oblivious indexing & querying • Non-schematic, heterogeneous data sources (no DTD required) • Supports full NEXI syntax • Supports all 13 XPath axes (+level )

TopX Query Processor • Adapt Threshold Algorithm (TA) paradigm [Fagin et al., PODS ‘01] • Focus on inexpensive SA & postpone expensive RA (NRA & CA) • Keep intermediate top-k & enqueue partially evaluated candidates • Lower/Upper score guarantees for each candidate d • Remember set of evaluated query dimensionsE(d) worstscore(d) = ∑iE(d)score(ti, ed) bestscore(d) = worstscore(d) + ∑iE(d) highi • Early min-k threshold termination • Return current top-k,iff • TopX core engine [VLDB ’04] • SA batching & efficient queue management • Multi-threaded SA & query processing • Probabilistic cost model for RA scheduling • Probabilistic candidate pruning for approximate top-k results • XML engine [VLDB ’05] • Efficiently deals with uncertainty in the structure & content (“andish XPath”) • Controlled amount of RA(unique among current XML-top-k engines) • Dynamically switch between document & element granularity

TopX Query Processing By Example (NRA) 171 9 9 46 171 46 46 171 46 9 9 46 46 46 171 171 171 46 9 84 9 worst=2.2best=2.2 worst=1.7best=2.5 worst=0.5best=2.3 worst=0.5best=2.4 worst=0.9 worst=0.5 worst=0.9best=2.8 worst=0.5best=2.5 worst=0.9best=2.7 worst=0.5best=0.5 worst=0.9best=2.9 worst=0.5best=1.3 worst=0.9best=2.55 worst=0.9best=1.0 worst=0.9best=1.8 worst=0.9best=2.75 worst=0.9best=2.8 worst=1.0best=1.6 worst=1.0best=2.8 worst=1.0 worst=1.0best=2.65 worst=1.0best=2.75 worst=1.0best=1.9 worst=0.9 worst=0.85best=2.75 72 216 51 51 216 216 216 72 51 216 216 51 72 3 28 28 3 3 28 182 3 182 28 28 28 3 3 worst=0.8best=2.65 worst=2.2 worst=1.7 worst=0.8best=1.6 worst=0.8best=2.45 worst=0.1best=0.9 worst=0.0best=2.45 worst=0.0best=2.9 worst=0.0best=2.75 worst=0.0best=2.65 worst=0.0best=1.7 worst=1.6 worst=0.0 best=1.35 worst=0.0best=2.8 worst=1.6best=2.1 worst=0.0 best=1.4 worst=0.85best=2.65 worst=0.85best=2.45 worst=0.85best=2.15 Top-2 results sec[“xml”] par[“retrieval”] title[“native”] min-2=0.5 min-2=1.6 min-2=1.0 min-2=0.9 min-2=0.0 par[“retrieval”] sec[“xml”] title[“native”] 1.0 1.0 1.0 0.9 0.9 1.0 0.8 0.8 0.85 0.5 0.75 0.1 doc2 doc17 doc1 doc5 Pseudo- doc Candidate queue doc3

“Andish” XPath over Element Blocks article getSubtree- Score() getParentScore() 0.0 [*, *] bib sec 0.0 [*, *] 0.0 [*, *] getParentScore() getSubtree- Score() getSubtree- Score() item= w3c sec= xml sec= retrieve par= native par= xml par= database worstscore(d) = 0.14 0.63 RA 1.18 C=0.2 C=1.0 0.2 [1, 419] 1.0 [1, 419] bib 1.38 3.69 1.0 [398, 418] 0.2 [398, 418] 0.2 [169, 348] 0.2 [351, 389] 0.2 [392, 395] 1.0 [169, 348] 1.0 [351, 389] 1.0 [392, 395] SA item= w3c 0.21 [169, 348] 0.16 [351, 389] 0.11 [37, 46] 0.11 [351, 389] 0.07 [389, 388] 0.06 [354, 353] 0.04 [375, 378] 0.02 [372, 371] 0.49 [174, 324] 0.24 [354, 353] 0.18 [357, 359] 0.16 [65, 64] 0.14 [347, 343] 0.13 [166, 164] 0.12 [354, 353] • Incremental & non-conjunctive XPath evaluations using • Hash joins on the content conditions • Staircase joins[Grust, VLDB ‘03]on the structure • Tight & accurate[worstscore(d), bestscore(d)]bounds for early pruning (ensuring monotonous updates) • Virtual support elements for navigation

Random Access Scheduling – Minimal Probing article bib sec item= w3c sec= xml sec= retrieve par= native par= xml par= database RA • MinProbe: • Schedule RAs only for the most promising candidates • Extending “Expensive Predicates & Minimal Probing” [Chang&Hwang, SIGMOD ‘02] • Schedule batch of RAs on d, only iff worstscore(d) + rdc > min-k 1.0 [1, 419] 1.0 [398, 418] 1.0 [169, 348] SA 0.24 [354, 353] 0.16 [351, 389] 0.12 [354, 353] 0.06 [354, 353] 0.49 [174, 324] 0.11 [351, 389] rank-k worstscore evaluated content & structure- related score unresolved, static structural score mass

Cost-based Scheduling (CA) – Ben Probing • Goal: Minimize overall execution cost #SA + cR/cS #RA • Access costs on d are wasted, if d does not make it into the final top-k (considering both structural selectivities & content scores) • Probabilistic cost model comparing different types of Expected Wasted Costs • EWC-RAs(d) of looking up d in the remaining structure • EWC-RAc(d) of looking up d in the remaining content • EWC-SA(d) of not seeing d in the next batchof b SAs • BenProbe: Schedule batch of RAs on d, iff #EWC-RAs|c(d) cR/cS < #EWC-SA • Bounds the ratio between #RA and #SA • Schedule RAs late & last • Schedule RAs in asc. order of EWC-RAs|c(d)

Selectivity Estimator [VLDB ’05] sec par= “xml” bib= “vldb” figure= “java” //sec[//figure=“java”] [//par=“xml”] [//bib=“vldb”] • Split the query into a set of basic, characteristic XML patterns: twigs, paths & tag-term pairs sec • Consider structural selectivities of unresolved & non-redundant patterns Y PS [d satisfies all structural conditions Y] = bib= “vldb” conjunctive p1 = 0.682 p2 =0.001 p3 =0.002 p4 =0.688 p5 =0.968 p6 =0.002 p7=0.023 p8 = 0.067 p9 =0.011 //sec[//figure]//par //sec[//figure]//bib //sec[//par]//bib //sec//figure //sec//par //sec//bib //bib=“vldb” //par=“xml” //figure=“java” “andish” PS [d satisfies a subset Y’ of structural conditions Y] = • Consider binary correlations between structural patterns and/or tag-term pairs (data sampling, query logs, etc.)

Score Predictor [VLDB ’04] f1 0 1 high1 f2 2 0 δ(d) 0 1 high2 • Consider score distributions of the content-related inverted lists PC [d gets in the final top-k] = Probabilistic candidate pruning: Drop d from the candidate queue, iff PC [d gets in the final top-k] < ε (with probabilistic guarantees for relative precision & recall) Convolutions of score histograms(assuming independence) title[“native”] par[“retrieval”] sampling Closed-form convolutions, e.g., truncated Poisson Moment-generating functions & Chernoff-Hoeffding bounds Combined score predictor & selectivity estimator

Dynamic and Self-tuning Query Expansion [SIGIR ’05] ~disaster … transport d42 d11 d92 d37 tunnel d95 d66 d93 d17 accident disaster fire d95 d11 d42 d37 d78 d99 d101 ... ... d11 d42 d10 d92 d32 d11 d21 d1 d87 ... ... ... TREC Robust Topic no. 363 Top-k (transport, tunnel, ~disaster) • Incrementally merge inverted lists for a set of active expansions exp(t1)..exp(tm) in descending order of scores s(ti, d) • Max-score aggregation for fending offtopic drifts • Dynamically expand set of active expansions only when beneficial for finding the final top-k results • Specialized expansion operators • Incremental Merge operator • Nested Top-k operator (phrase matching) • Supports text, structured records & XML • Boolean (but ranked) retrieval mode SA SA SA Incr. Merge

Data Collections & Competitors • INEX ‘04 Ad-hoc Track setting • IEEE collection with 12,223 docs & 12M elemt’s in 534 MB XML data • 46 NEXI queries with official relevance judgments and a strict quantization e.g.,//article[.//bib=“QBIC” and .//par=“image retrieval”] • TREC ‘04 Robust Track setting • Aquaint news collection with 528,155 docs in 1,904 MB text data • 50 “hard” queries from TREC Robust Track ‘04 with official relevance judgments e.g., “transportation tunnel disasters” or “Hubble telescope achievements” • Competitors for XML setup • DBMS-style Join&Sort • Using index full scans on the TopX index (Holistic Twig Joins) • StructIndex [Kaushik et al, Sigmod ’04] • Top-k with separate indexes for content & structure • DataGuide-like structural index • Eager RAs (Fagin’s TA) • StructIndex+ • Extent chaining technique for DataGuide-based extent identifiers (skip scans on the content index)

INEX: TopX vs. Join&Sort & StructIndex k P@k rel.Prec MAP@k # SA epsilon # RA relPrec CPU sec Join&Sort 10 n/a 9,122,318 0 12.01 StructIndex 10 n/a 761,970 3,25,068 17.02 StructIndex+ 10 n/a 77,482 5,074,384 80.02 0.34 0.09 1.00 TopX – MinProbe 10 0.0 635,507 64,807 1.38 TopX – BenProbe 10 0.0 723,169 84,424 3.22 TopX – BenProbe 1,000 0.0 882,929 1,902,427 16.10 0.03 0.17 1.00 46 NEXI Queries

INEX: TopX with Probabilistic Pruning P@k MAP@k rel.Prec # SA epsilon # RA CPU sec k TopX - MinProbe 10 0.00 635,507 64,807 1.38 0.34 0.09 1.00 10 0.25 392,395 56,952 2.31 0.34 0.08 0.77 10 0.50 231,109 48,963 0.92 0.31 0.08 0.65 10 0.75 102,118 42,174 0.46 0.33 0.08 0.51 10 1.00 36,936 35,327 0.46 0.30 0.07 0.38 46 NEXI Queries

TREC Robust: Dynamic vs. Static Query Expansion • Careful WordNet expansions using automatic Word Sense Disambiguation & phrase detection [WebDB ’03 & PKDD ’05] with (m<118) • MinProbe RA scheduling for phrase matching (auxiliary term-offset table) • Incremental Merge + Nested Top-k (mtop< 22) vs. Static Expansions(mtop< 118) 50 Keyword + Phrase Queries

Conclusions • Efficient and versatile TopX query processor • Extensible framework for XML-IR & full-text search • Very good precision/runtime ratio for probabilistic candidate pruning • Self-tuning solution for robust query expansions & IR-style vague search • Combined SA and RA scheduling close to lower bound for CA access cost [Submitted for VLDB ’06] • Scalability • Optimized for query processing IO • Exploits cheap disk space for redundant index structures (constant redundancy factor of 4-5 for INEX IEEE) • Extensive TREC Terabyte runs with 25,000,000 text documents (426 GB) • INEX 2006 • New Wikipedia XML collection with 660,000 documents & 120,000,000 elements (~ 6 GB raw XML) • Official host for the Topic Development and Interactive Track (69 groups registered worldwide) • TopX WebService available (SOAP connector)

That’s it. Thank you!

TREC Terabyte: Comparison of Scheduling Strategies Thanks to Holger Bast & Deb Majumdar!

PhD Defense May 16 th 2006 Martin Theobald Max Planck Institute for Informatics

PhD Defense May 16 th 2006 Martin Theobald Max Planck Institute for Informatics

Presentation Transcript

MAX PLANCK

Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/

Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/

Max Planck

Max Planck Institute Magdeburg

Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Max Planck Institute Magdeburg

Victoria Naipal Max-Planck Institute for Meteorology

TSR Max Planck Institute Heidelberg

Gerhard Weikum Max Planck Institute for Informatics mpi-inf.mpg.de/~weikum/

Gerhard Weikum Max Planck Institute for Informatics mpi-inf.mpg.de/~weikum/

Max Planck

Martin Theobald Max-Planck-Institut Informatik Stanford University

Max Planck Institute for Chemistry

Max Planck Institute for Demographic Research

Max Planck Institute Magdeburg

Max-Planck-Institute Magdeburg

Gerhard Weikum Max Planck Institute for Informatics mpi-inf.mpg.de/~weikum/

Max Planck Institute for Psycholinguistics

Max Planck

PhD Defense May 16 th 2006 Martin Theobald Max Planck Institute for Informatics

Max Planck Institute Magdeburg