INEX: Evaluating content-oriented XML retrieval

INEX: Evaluating content-oriented XML retrieval Mounia Lalmas Queen Mary University of London http://qmir.dcs.qmul.ac.uk

Outline • Content-oriented XML retrieval • Evaluating XML retrieval: INEX

XML Retrieval • Traditional IR is about finding relevant documents to a user’s information need, e.g. entire book. • XML retrieval allows users to retrieve document components that are more focussed to their information needs, e.g a chapter of a book instead of an entire book. • The structure of documents is exploited to identify which document components to retrieve.

Book Chapters Sections World Wide Web This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence.. Paragraphs Structured Documents • Linear order of words, sentences, paragraphs … • Hierarchy or logical structure of a book’s chapters, sections … • Links (hyperlink), cross-references, citations … • Temporal and spatial relationships in multimedia documents

World Wide Web This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence.. Structured Documents Explicit structureformalised through document representation standards (mark-up languages) • Layout LaTeX (publishing), HTML (Web publishing) • Structure SGML, XML (Web publishing, engineering), MPEG-7 (broadcasting) • Content/Semantic RDF, DAML + OIL, OWL (semantic web) <b><font size=+2>SDR</font></b><img src="qmir.jpg" border=0> <section> <subsection> <paragraph>… </paragraph> <paragraph>… </paragraph> </subsection></section> <Book rdf:about=“book”> <rdf:author=“..”/> <rdf:title=“…”/></Book>

XML: eXtensible Mark-upLanguage • Meta-language (user-defined tags) currently being adopted as the document format language by W3C • Used to describe content and structure (and not layout) • Grammar described in DTD ( used for validation) <lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into XML retrieval </title> <paragraph> …. </paragraph> … </chapter> … </lecture> <!ELEMENT lecture (title, author+,chapter+)> <!ELEMENT author (fnm*,snm)> <!ELEMENT fnm #PCDATA> …

XML: eXtensible Mark-upLanguage Use of XPath notation to refer to the XML structure chapter/title: title is a direct sub-component of chapter //title: any title chapter//title: title is a direct or indirect sub-component of chapter chapter/paragraph[2]: any direct second paragraph of any chapter chapter/*: all direct sub-components of a chapter <lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into SDR </title> <paragraph> …. </paragraph> … </chapter> … </lecture>

Querying XML documents • Content-only (CO) queries 'open standards for digital video in distance learning' • Content-and-structure (CAS) queries //article [about(., 'formal methods verify correctness aviation systems')] /body//section [about(.,'case study application model checking theorem proving')] • Structure-only (SA) queries /article//*section/paragraph[2]

Content-oriented XML retrieval Return document components of varying granularity (e.g. a book, a chapter, a section, a paragraph, a table, a figure, etc), relevant to the user’s information need both with regards to content and structure.

Content-oriented XML retrieval Retrieve thebest components according to content and structure criteria: • INEX: most specific component that satisfies the query, while being exhaustive to the query • Shakespeare study: best entry points, which are components from which many relevant components can be reached through browsing • ???

Challenges 0.2 Article?XML,?retrieval ?authoring 0.9 XML 0.5 XML 0.2 XML 0.4 retrieval 0.7 authoring 0.2 0.4 0.5 Section 2 Section 1 Title 0.6 0.4 0.4 no fixed retrieval unit + nested elements + element types • how to obtain document and collection statistics? • which component is a good retrieval unit? • which components contribute best to content of Article? • how to estimate? • how to aggregate?

Approaches … vector space model bayesian network fusion collection statistics language model cognitive model smoothing proximity search tuning belief model boolean model relevance feedback phrase parameter estimation probabilistic model logistic regression component statistics ontology term statistics natural language processing extending DB model

RSV RSV RSV RSV RSV normalised RSV normalised RSV normalised RSV normalised RSV normalised RSV Vector space model article index abstract index section index merge sub-section index paragraph index tf and idf as for fixed and non-nested retrieval units (IBM Haifa, INEX 2003)

Language model element language model collection language model smoothing parameter  element score high value of  leads to increase in size of retrieved elements element size element score article score rank element query expansion with blind feedback ignore elements with  20 terms results with  = 0.9, 0.5 and 0.2 similar (University of Amsterdam, INEX 2003)

Evaluation of XML retrieval: INEX • Evaluating the effectiveness of content-oriented XML retrieval approaches • Collaborative effort  participants contribute to the development of the collection queries relevance assessments • Similar methodology as for TREC, but adapted to XML retrieval • 40+ participants worldwide • Workshop in Schloss Dagstuhl in December (20+ institutions)

INEX Test Collection • Documents (~500MB), which consist of 12,107 articles in XML format from the IEEE Computer Society; 8 millions elements • INEX 2002 30 CO and 30 CAS queries inex2002 metric • INEX 2003 36 CO and 30 CAS queries CAS queries are defined according to enhanced subset of XPath inex2002 and inex2003 metrics • INEX 2004 is just starting

Tasks • CO: aim is to decrease user effort by pointing the user to the most specific relevant portions of documents. • SCAS: retrieve relevant nodes that match the structure specified in the query. • VCAS: retrieve relevant nodes that may not be the same as the target elements, but are structurally similar.

Relevance in XML • A element is relevant if it “has significant and demonstrable bearing on the matter at hand” • Common assumptions in IR • Objectivity • Topicality • Binary nature • Independence article section 1 2 3 paragraph 1 2

Relevance in INEX • Exhaustivity how exhaustively a document component discusses the query: 0, 1, 2, 3 • Specificity how focused the component is on the query: 0, 1, 2, 3 • Relevance (3,3), (2,3), (1,1), (0,0), … all sections relevant  article very relevant all sections relevant  article better than sections one section relevant  article less relevant one section relevant  section better than article … article section

Relevance assessment task • Completeness • Element  parent element, children element • Consistency • Parent of a relevant element must also be relevant, although to a different extent • Exhaustivity increase going  • Specificity decrease going  • Use of an online interface • Assessing a query takes a week! • Average 2 topics per participants article section 1 2 3 paragraph 1 2

Interface Current assessments Groups Navigation

Assessments • With respect to the elemens to assess 26 % assessments on elements in the pool (66 % in INEX 2002). 68 % highly specific elements not in the pool • 7 % elements automatically assessed • INEX 2002 23 inconsistent assessments per query for one rule

Metrics Need to consider: • Two dimensions of relevance • Independency assumption does not hold • No predefined retrieval unit • Overlap • Linear vs. clustered ranking article section

INEX 2002 metric Quantization: strict generalized

INEX 2002 metric Precision as defined by Raghavan’89 (based on ESL) where n is estimated

Overlap problem

INEX 2003 metric Ideal concept space (Wong & Yao ‘95) c t

INEX 2003 metric Quantization: strict generalised

INEX 2003 metric Ignoring overlap:

INEX 2003 metric Considering overlap:

INEX 2003 metric • Penalises overlap by only scoring novel information in overlapping results • Assume uniform distribution of relevant information • Issue of stability • Size considered directly in precision (is it intuitive that large is good or not?) • Recall defined using exh only • Precision defined using spec only

Alternative metrics • User-effort oriented measures Expected Relevant Ratio Tolerance to Irrelevance • Discounted Cumulated Gain

Lessons learnt • Good definition of relevance • Expressing CAS queries was not easy • Relevance assessment process must be “improved” • Further development on metrics needed • User studies required

Conclusion • XML retrieval is not just about the effective retrieval of XML documents, but also about how to evaluate effectiveness • INEX 2004 tracks • Relevance feedback • Interactive • Heterogeneous collection • Natural language query http://inex.is.informatik.uni-duisburg.de:2004/

INEX: Evaluating content-oriented XML retrieval Mounia Lalmas Queen Mary University of London http://qmir.dcs.qmul.ac.uk

INEX: Evaluating content-oriented XML retrieval

INEX: Evaluating content-oriented XML retrieval

Presentation Transcript

Evaluating Online Content

XML Retrieval

XML Retrieval

XML Retrieval

Evaluating content-oriented XML retrieval: The INEX initiative

XML Information Retrieval and INEX

JJE: INEX XML Competition

Searching in an XML Corpus Using Content and Structure INEX 2003, Germany

The Overlap Problem in Content-Oriented XML Retrieval Evaluation

XML Information Retrieval

Content-Oriented Listening

INEX 2002 - 2006: Understanding XML Retrieval Evaluation

Structure/XML Retrieval

BlackBoard Content Collection Retrieval

XML Information Retrieval

XML Distributed Retrieval

Lecture 21: XML Retrieval

INEX 2009 XML Mining Track

Content Based Image Retrieval