XML Information Retrieval

XML IR hot DB & IR research topic for >6 years XQuery/IR, XSearch, XIRQL, XXL, Niagara, …,TopX, SphereSearch, … XPath/XQuery Full Text >50 participating groups in the Initiative for the Evaluation of XML Retrieval (INEX 2006) Most systems allow to specify constraints on content and structure XML Information Retrieval Why does Google not use XML IR? („There is no XML data“ is not a valid answer) Why do end users not use XPath? EDBT 2006, Munich, Germany

Database vs. IR World • Structural query languagesdo not work in practise: • Schema is unknown or heterogeneous • Language is too complex • Humans don‘t think XPath • Results often unsatisfying //professor[contains(.,SB) and contains(.//course,IR] I need information about a professor in SB who teaches IR. • System support to generate „good“ structured queries: • User interfaces („advanced search“) • Natural language processing • Interactive query refinement EDBT 2006, Munich, Germany

Relevance Feedback Structural Features for Feedback on XML Evaluation Summary and Outlook Outline EDBT 2006, Munich, Germany

Relevance Feedback for Interactive Query Refinement XML 1 IR IR 2 index 3 Fagin index 4 index XML IR … query evaluation XMLnot(Fagin) 1. User submits query 2. User marks relevant and nonrelevant docs • Feedback for XML IR: • Start with keyword query • Find structural expansions • Create structural query 3. System finds best terms to distinguish between relevant and nonrelevant docs 4. System submits expanded query EDBT 2006, Munich, Germany

Dimensions for Structural Expansion User marksrelevant result article frontmatter body backmatter sec sec citation„Serge Abiteboul“ author„Baeza-Yates“ sec „Semistructured data…“ subsec„XML has evolved…“ subsec p p p„With the advent of XSLT…“ Possible dimensions: Tag+Content of other elements in the document Content ofresult Path tothe result C: XML P: article/body/sec/subsec D: //author[Baeza] //citation[Abiteboul] EDBT 2006, Munich, Germany

Weights for Content and Doc Dimensions • Compute Rocchio weights [1971] for each feature (also used as weight in query): whererf number of relevant results with fR number of relevant resultsnf number of nonrelevant results with fN number of nonrelevant results Alternatively:consider accumulated scoremass instead of rf, nf • Order features by weight, break ties with Mutual Information of score and relevance distributions • Select top-NC content features, top-ND document features EDBT 2006, Munich, Germany

Tag names alone cannot enhance retrieval quality, complete paths are too strict. Use path fragments: Prefixes: article/#, article/body/# Infixes: #/section/# Subpaths: #/body/section/# Paths with wildcards: article/#/section/# Suffixes: #/subsection Full paths Weights based on Rocchio Path-based Constraints article body section subsection EDBT 2006, Munich, Germany

Engine-based: Generate an expanded query with structural constraints Submit to structural query engine Rerank (large) existing set of results Hybrid: Evaluate some of the new conditions with an engine rerank the resulting set of results Evaluation of Expanded Queries Three options: EDBT 2006, Munich, Germany

Generating Expanded Queries author[Baeza] citation[Abiteboul] Initial query: query evaluation * *[query evaluation] *[query evaluation XML] : descendant-or-self axis Path dimension is handled differently Tag+Content of other elements in the document Content ofresult C: XML D: //author[Baeza] //citation[Abiteboul] EDBT 2006, Munich, Germany

Basic approach: Consider set E of results (|E|~1000) for the initial keyword query with scores s(e) For each element e, compute score wd(e) in each dimension d: Compute all features for e in dimension d Compute wd(e) as cosine of e‘s feature vector and the selected query features for dimension d Normalize all scores to [-1,1] and add partial scores Sort E by combined score Reranking Query Results Hybrid evaluation: evaluate some dimensions(like content) with engine, the others with reranking EDBT 2006, Munich, Germany

Architecture query XML SearchEngine results query + results expanded query feedback results of expanded query Feedback Dimensions Content Module Path Module Doc Module … reranked results Scoring + Reranking EDBT 2006, Munich, Germany

Relevance Feedback Structural Features for Feedback on XML Evaluation Conclusion and Future Work Outline EDBT 2006, Munich, Germany

INEX collection(IEEE-CS journal and conference articles): 12,107 XML docs with 12 mio. elements queries with manual relevance assessments 52 keyword queries from 2003 & 2004 with our TopX Search Engine [VLDB05] Automatic feedback for top-k from relevance assessments Evaluation ignores results used for feedback (not: testing on the training data) Evaluation Settings EDBT 2006, Munich, Germany

Experimental Results with TopX Content and document dimensions together are best. EDBT 2006, Munich, Germany

Experimental Results with TopX: Paths Position 1 for INEX 2005 Relevance Feedback Track (of 15) EDBT 2006, Munich, Germany

Consider other feedback dimensions [ECIR 2006] Relevance Feedback for queries with structure Active feedback: proactively ask user for feedback on selected elements Exploit correllation of expansion candidates Integration with Graphical User Interface Evaluation of feedback algorithms (INEX 2006 Relevance Feedback Track): Eliminate effect of „training on data“ Eliminate influence of search engine Current and Future Work EDBT 2006, Munich, Germany

Structural Feedback is an important step towards making XML IR work. Reasonable results with even simple choice of expansion dimensions. Many open problems are left for future research. Conclusions EDBT 2006, Munich, Germany

Thank you! EDBT 2006, Munich, Germany

XML Example Professor Address ... City: SB Name Gerhard Weikum Country Germany Teaching Research Course Project Title: IR Title Intelligent Search of XML Data Syllabus Description Information retrieval ... ... ... Sponsor German Science Foundation Book Article ... ... <Professor> <Name>Gerhard Weikum</Name> <Teaching/> <Address> <City>…</City> <Country>Germany</Country> </Address> <Research/> </Professor> Gerhard Weikum Saarbrücken Germany <Professor> <Name>Gerhard Weikum</Name> <Teaching/> <Address> <City>Saarbrücken</City> <Country>Germany</Country> </Address> <Research/> </Professor> EDBT 2006, Munich, Germany

XML-IR Example Professor[SB] Course[IR] Research[XML] Which professors from Saarbruecken (SB) are teaching IR and have research projects on XML? Professor Address ... City: SB Name Gerhard Weikum Country Germany Teaching Research • Challenges: • Ranked retrieval of elements, not documents • Information spread over multiple documents Course Project Title: IR Title Intelligent Search of XML Data Syllabus Description Information retrieval ... ... ... Sponsor German Science Foundation Book Article ... ... //Professor[contains(.,“SB“) and contains(.//Course,“IR“)and contains(.//Research,“XML“)] EDBT 2006, Munich, Germany

XML-IR Example Which professors from Saarbruecken (SB) are teaching IR and have research projects on XML? Professor Lecturer Address ... Address Max-Planck Institute for CS, Germany City: SB Name Gerhard Weikum Country Germany Name Ralf Schenkel Teaching Research Interests Semistructured Data, IR Teaching: • Challenges: • Ranked retrieval of elements, not documents • Information spread over multiple documents • Heterogeneous schemas and content • Similarity queries with a vast number of potential results Course Project Title: IR Seminar Title Intelligent Search of XML Data Syllabus Description Information retrieval ... ... ... Contents Ranked Search ... Literature Sponsor German Science Foundation Book Article ... ... //~Professor[contains(.,“~SB“) and contains(.//~Course,“~IR“)and contains(.//~Research,“~XML“)] //Professor[contains(.,“SB“) and contains(.//Course,“IR“)and contains(.//Research,“XML“)] EDBT 2006, Munich, Germany

INEX INEX RF Track EDBT 2006, Munich, Germany

Data Structures Professor[SB] Course[IR] Research[XML] Professor[SB] Course[IR] Research[XML] • 1) Build index lists for each tag-term pair, grouped by document, sorted by max score in document • Block-fetch all elements for the same doc • Create and/or update candidates, including testing PCs in memory • Maintain score and best score for each candidate, prune when possible EDBT 2006, Munich, Germany

XML Information Retrieval Area Overview and Contributions The TopX Search Engine Structural Relevance Feedback Outline EDBT 2006, Munich, Germany

Query and scoring model for similarity queries Extend top-k query processing algorithms for sorted lists [Buckley85, Güntzer et al. 00, Fagin01]to XML queries & data, including similarity queries Exploit cheap disk space for highly redundant indexing TopX: Efficient XML IR Goal: Efficiently compute the best results of a similarity query EDBT 2006, Munich, Germany

<P> Gerhard Weikum <C>IR</C> SB <R>XML</R></P> TopX Data Model docid=1pre=1; post=3 tag=“P“ content=“Gerhard Weikum IR SB XML“ 1 docid=1pre=3; post=2tag=“R“content=“XML“ docid=1pre=2; post=1tag=“C“content=“IR“ 2 3 • pure tree model, ignoring links • content of descendants replicated, per-element term scores (using tf/idf scores or variant of Okapi BM25 model) • pre/postorder labels reflecting element hierarchy [Grust02] EDBT 2006, Munich, Germany

Query = tree/graph pattern with mandatory/optional content conditions (CC) mandatory path conditions (PC) mandatory target element formulated in XPath-like language Special case: Keyword Query *[IR XML database] Queries Professor[SB] Course[IR] Research[XML] EDBT 2006, Munich, Germany

Query Scores for Content Conditions with element statistics • Basic scoring idea within IR-style family of TF*IDF ranking functions • Content-based scores cast into an Okapi-BM25 probabilistic model with element-specific model parameterization wheretf(ci,e) number of occurrences of ci in element eNT number of elements with tag TefT(ci) number of elements with tag T that contain ci EDBT 2006, Munich, Germany

Query Scores 171 171 0.8 182 182 0.5 Professor[SB] Course[IR] Research[XML] • candidate = connected sub-pattern with element ids and scores • result = candidate with scores for all mandatory conditions and the target element • content-based score of resultwith elements e1,…,em for query q with CC T1[c1], ...,Tm[cm] (some ei may be empty) Additional extensions for path conditions EDBT 2006, Munich, Germany

Structural Features User marksrelevant result article frontmatter body backmatter sec sec author„Baeza-Yates“ sec „Semistructured data…“ subsec„XML has evolved…“ subsec p p p„With the advent of XSLT…“ Possible features: Tag+Content of descen-dants of ancestors Tag+Contentof ancestors Content ofresult Tag+Content ofdescendants AD: article//author[Baeza] C: XML D: p[XSLT] A: sec[data] EDBT 2006, Munich, Germany

XML Information Retrieval

XML Information Retrieval

Presentation Transcript

XML Retrieval

XML Retrieval

XML Retrieval

Will XML and Information Retrieval Make Society Transparent?

Information Retrieval

Evaluation of XML Information Retrieval Systems

XML Information Retrieval and INEX

Information Retrieval

Ranked Information Retrieval on XML Data

Structure/XML Retrieval

XML Information Retrieval

XML Distributed Retrieval

Lecture 21: XML Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

information retrieval

Information Retrieval