Martin Theobald Max-Planck-Institut Informatik Stanford University

TopX Efficient & Versatile Top-k Query Processing for Text, Semistructured & Structured Data Martin Theobald Max-Planck-Institut Informatik Stanford University

article article title title “Current Approaches to XML Data Manage- ment” “The XML Files” bib sec sec sec sec bib title title “The Ontology Game” title “Native XML Data Bases.” item “The Dirty Little Secret” par par item title “Native XML data base systems can store schemaless data ... ” “XML queries with an expressive power similar to that of Datalog …” par “XML” par “Sophisticated technologies developed by smart people.” url “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” “w3c.org/xml” par inproc par title “XML-QL: A Query Language for XML.” “Proc. Query Languages Workshop, W3C,1998.” “What does XML add for retrieval? It adds formal ways …” “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files …” //article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)] RANKING VAGUENESS PRUNING

Frontends • Web Interface • Web Service • API TopX Query Processor Probabilistic Index Access Scheduling Candidate Queue Candidate Cache Scan Threads Top-k Queue SA Probabilistic Candidate Pruning Query Processing Time Random Access Sequential Access Dynamic Query Expansion Incremental XPath Engine Auxiliary Predicates RA Index Metadata • Selectivities • Histograms • Correlations Thesaurus WordNet, OpenCyc, etc. DBMS / Inverted Lists Unified Text & XML Schema Indexing Time RA Indexer /Crawler

Data Model article 1 6 title abs sec 2 2 1 3 4 5 “xml data manage” “xml manage system vary wide expressive power“ title par 5 3 6 4 “native xml data base” “native xml data base system store schemaless data“ “xml data manage xmlmanage system vary wide expressive power native xml native xmldata base system store schemaless data“ “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ <article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data. </par> </sec> </article> ftf (“xml”, article1 ) = 4 “native xml data base native xml data base system store schemaless data“ • XML trees (no XLinks or ID/IDref attributes) • Pre-/postorder node labels • Redundant full-content text nodes

Scoring Model [INEX ’06/’07] • XML-specific extension to Okapi BM25 (originating from probabilistic text IR) • ftf instead of tf • ef instead of df • Element type-specific length normalization • Tunable parameters k1and b bib[“transactions”] vs. par[“transactions”]

TopX Query Processing [VLDB ’05] 19 8 0.8 8 14 89 5 0.4 11 16 … 32 1 0.09 3 1 … 35 4 0.05 5 8 171 46 9 46 46 21 8 0.04 3 20 worst=0.5 worst=0.9 worst=1.0 worst=0.9 … 216 51 28 3 182 28 worst=2.2 worst=1.7 worst=1.6 //sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)] sec[“xml”] par[“retrieval”] title[“native”] 1.0 1.0 1.0 0.9 0.9 1.0 0.8 0.8 0.85 0.5 0.75 0.1 Top-2 Candidate Queue min-2=0.5 min-2=1.6 min-2=1.0 min-2=0.9 max-q=2.15 max-q=2.55 max-q=2.45 max-q=2.75 max-q=2.8 max-q=3.0 max-q=1.6 max-q=2.9 max-q=2.7 min-2=0.0

Index Access Scheduling [VLDB ’06] 1.0 1.0 1.0 0.9 0.9 Δ3,3 = 0.2 0.9 Δ1,3 = 0.8 0.7 0.9 0.8 0.2 0.6 0.8 … … … Inverted Block Index • SA Scheduling • Look-ahead Δi through precomputed score histograms • Knapsack-based optimization of Score Reduction • RA Scheduling • 2-phase probing: Schedule RAs “late & last” • Extended probabilistic cost model for integrating SA & RA scheduling SA SA SA RA

Probabilistic Pruning [VLDB ’04] 2 0 δ(d) f1 0 1 high1 f2 0 1 high2 • Convolutions of score distributions(assuming independence) P[d gets in the final top-k] = title[“native”] par[“retrieval”] sampling Probabilistic candidate pruning: Drop dfrom the candidate queue if P[d gets in the final top-k] < ε With probabilistic guarantees for precision & recall Indexing Time Query Processing Time

Dynamic Query Expansion [SIGIR ’05] ~disaster … transport d42 d11 d92 d37 tunnel d95 d66 d93 d17 accident disaster fire d95 d11 d42 d37 d78 d99 d101 ... ... d11 d42 d10 d92 d32 d11 d21 d1 d87 ... ... ... TREC Robust Topic #363 Top-k (transport, tunnel, ~disaster) • Incrementally merge inverted lists for expansion ti,1...ti,m in descending order of s(tij, d) • Best-match score aggregation • Specialized expansion operators • Incremental Merge operator • Nested Top-k operator (efficient phrase matching) • Boolean (but ranked) retrieval mode • Supports any sorted inverted index for text, structured records & XML SA SA SA Incr. Merge

Incremental Merge Operator t1 d78 0.9 d23 0.8 d10 0.8 d1 0.4 d88 0.3 ... t2 d10 0.7 d64 0.8 d23 0.8 d12 0.2 d78 0.1 ... 0.4 0.72 0.18 t3 d11 0.9 d78 0.9 d64 0.7 d99 0.7 d34 0.6 ... 0.45 0.35 0.9 ~t Thesaurus lookups/ Relevance feedback Index list metadata (e.g., histograms) Initial high-scores Expansion terms ~t = { t1, t2, t3 } Large corpus term correlations sim(t, t1 ) = 1.0 sim(t, t2 ) = 0.9 Expansion similarities sim(t, t3 ) = 0.5 SA d88 0.3 d23 0.8 d10 0.8 d64 0.72 d23 0.72 d10 0.63 d11 0.45 d78 0.45 d1 0.4 d78 0.9 ... Meta histograms seamlessly integrate Incremental Merge into probabilistic scheduling and candidate pruning

Some Experiments • New XML-ified Wikipedia corpus (INEX 2006) • 660,000 documents w/ 130,000,000 elements • 125 INEX queries, each as content-only (CO) and content-and-structure (CAS) formulation • CO: +“state machine” figure Mealy Moore • CAS: //article[about(., “state machine” )] //figure[about(., Mealy ) or about(., Moore )] • Primary cost metric: Cost = #SA + cR/cS #RA

TopX vs. Full-Merge • Significant cost savings for large ranges of k • CAS cheaper than CO !

Efficiency vs. Effectiveness • Very good precision/runtime ratio for probabilistic pruning

Static vs. Dynamic Expansions • Query expansions with up to m=292 keywords & phrases • Balanced amount of sorted vs. random disk access • Adaptive scheduling wrt. cR/cS cost ratio • Dynamic expansions superior to static expansions & full-merge in both efficiency & effectiveness

Thanks… Gerhard Weikum Ralf Schenkel Norbert Fuhr, Michalis Vazirgiannis Holger Bast, Debapriyo Majumdar All the MPI & INEX folks

topx.sourceforge.net See our Sigmod’07 demo!

Martin Theobald Max-Planck-Institut Informatik Stanford University

Martin Theobald Max-Planck-Institut Informatik Stanford University

Presentation Transcript

Martin Kay Stanford University

MAX PLANCK

Max-Planck-Institut f ür Plasmaphysik

Max-Planck Institut für Plasmaphysik

Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Martin Kay Stanford University

Max-Planck-Institut für Biochemie

Martin Kay Stanford University

Max-Planck-Institut für Gravitationsphysik Albert-Einstein-Institut

Martin Kay Stanford University

Max-Planck-Institut für Plasmaphysik, EURATOM Association

Martin Kay Stanford University

Max-Planck-Institut für Plasmaphysik, EURATOM Association

Max-Planck-Institut für Plasmaphysik Euratom Association

Martin Kay Stanford University

Max-Planck-Institut für Plasmaphysik

PhD Defense May 16 th 2006 Martin Theobald Max Planck Institute for Informatics

Max-Planck-Institut für Gravitationsphysik ( Albert-Einstein-Institut )

PhD Defense May 16 th 2006 Martin Theobald Max Planck Institute for Informatics