Evaluating content-oriented XML retrieval: The INEX initiative

Evaluating content-oriented XML retrieval: The INEX initiative Mounia Lalmas Queen Mary University of London http://qmir.dcs.qmul.ac.uk

Outline • Information retrieval • XML retrieval • Evaluating information retrieval • Evaluating XML retrieval: INEX

Information retrieval Example of a user information need (e.g. on the WWW): “Find all documents about sailing charter agencies that (1) offer sailing boats in the Greek islands, and (2) are registered with the RYA. The documents should contain boat specification, price per week, e-mail and other contact details.” A formal representation of an information need constitutes a query

Information retrieval IR is concerned with the representation, storage, organisation, and access to repositories of information, usually under the form of documents. Primary goal of an IR system “Retrieve all the documents which are relevant (useful) to a user query, while retrieving as few non-relevant documents as possible.”

Conceptual model for IR Documents Query Indexing Formulation Document representation Query representation Retrieval function Relevancefeedback Retrieval results

XML Retrieval • Traditional IR is about finding relevant documents to a user’s information need, e.g. entire book. • XML allows users to retrieve document components that are more focussed to their information needs, e.g a chapter of a book instead of an entire book. • The structure of documents is exploited to identify which document components (XML elements) to retrieve.

XML: eXtensible Mark-upLanguage • Meta-language (user-defined tags) currently being adopted as the document format language by W3C • Used to describe content and structure (and not layout) • Grammar described in DTD ( used for validation) <lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into XML retrieval </title> <paragraph> …. </paragraph> … </chapter> … </lecture> <!ELEMENT lecture (title, author+,chapter+)> <!ELEMENT author (fnm*,snm)> <!ELEMENT fnm #PCDATA> …

XML: eXtensible Mark-upLanguage • Use of XPath notation to refer to the XML structure chapter/title: title is a direct sub-component of chapter //title: any title chapter//title: title is a direct or indirect sub-component of chapter chapter/paragraph[2]: any direct second paragraph of any chapter chapter/*: all direct sub-components of a chapter <lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into SDR </title> <paragraph> …. </paragraph> … </chapter> … </lecture>

Queries • Content-only (CO) queries • Standard IR queries but here we are retrieving document components • “London tube strikes” • Content-and-structure (CAS) queries • Put on constraints on which types of components are to be retrieved • E.g. “Sections of an article in the Times about congestion charges” • E.g. Articles that contain sections about congestion charges in London, and that contain a picture of Ken Livingstone”

Documents Query Indexing Formulation Document representation Query representation Retrieval function Relevancefeedback Retrieval results Conceptual model for XML retrieval Structured documents Content + structure tf, idf, acc Inverted file + structure index Matching content + structure Presentation of related components

article 1 2 3 paragraph 1 2 Example of XML approaches The representation of a composite element (e.g. article and section) is defined as the aggregated representation of its sub-elements section Sec3 is then also about “XML” (in fact very much about “XML”), “retrieval”, “authoring” p1 is about “XML” “retrieval”p2 is about “XML”, “authoring”

Example of XML approaches Document {?t1, ?t2, ?t3} Title Section_1 Section_2 {0.9 t1, 0.4 t2} {0.5 t1} {0.2 t1, 0.7 t3} ? = Aggregated weight of ti in Document based on the instances of ti in the sub-elements (Title, Section_1 and Section_2)

Evaluation • The goal of an IR system • retrieve as many relevant documents as possible and as few non-relevant documents as possible • Comparative evaluation of technical performance of IR systems = effectiveness • ability of the IR system to retrieve relevant documents and suppress non-relevant documents • Effectiveness • combination of recall and precision

Relevance • A document is relevant if it “has significant and demonstrable bearing on the matter at hand”. • Common assumptions: • Objectivity • Topicality • Binary nature • Independence

Retrieved Relevant Retrieved and relevant Document collection Recall / Precision

Recall / Precision relevant documents for a given query {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}

100 90 80 70 60 system 1 precision 50 system 2 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 recall Comparison of systems

Test collection • Document collection = document themselves • depend on the task, e.g. evaluating web retrieval requires a collection of HTML documents. • Queries / requests • simulate real user information needs. • Relevance judgements • stating for a query the relevant documents. • See TREC

Evaluation of XML retrieval: INEX • Evaluating the effectiveness of content-oriented XML retrieval approaches • Collaborative effort = participants contribute to the development of the collection • queries • relevance assessments • Similar methodology as for TREC, but adapted to XML retrieval.

INEX Test Collection • The INEX test collection (2002) • Documents (~500MB), which consist of 12,107 articles in XML format from the IEEE Computer Society • 30 CO and 30 CAS queries • Relevance assessments per retrieved components, by participating groups • Relevance defined in terms of “relevance” and “coverage” • Participants: 36 active groups worldwide • In 2003, INEX has 36 CO and 30 CAS queries • Same document collections • CAS queries are defined according to a subset of XPath. • Relevance assessments per retrieved components, by participating group • Relevance defined in terms of “exhaustivity” and “specificity” • Participants: 40 active groups worldwide • INEX 2004 is just starting

Example of CO topic <inex_topic topic_id="126" query_type="CO" ct_no="25"> <title>Open standards for digital video in distance learning</title> <description>Open technologies behind media streaming in distance learning projects</description> <narrative> I am looking for articles/components discussing methodologies of digital video production and distribution that respect free access to media content through internet or via CD-ROMs or DVDs in connection to the learning process. Discussions of open versus proprietary standards of storing and sending digital video will be appreciated. </narrative> <keywords>media streaming,video streaming,audio streaming, digital video,distance learning,open standards,free access</keywords>

Example of CAS topic <title>//article[about(.,'formal methods verify correctness aviation systems')]/body//*[about(.,'case study application model checking theorem proving')]</title> <description>Find documents discussing formal methods to verify correctness of aviation systems. From those articles extract parts discussing a case study of using model checking or theorem proving for the verification. </description> <narrative>To be considered relevant a document must be about using formal methods to verify correctness of aviation systems, such as flight traffic control systems, airplane- or helicopter- parts. From those documents a body-part must be returned (I do not want the whole body element, I want something smaller). That part should be about a case study of applying a model checker or a theorem proverb to the verification. </narrative> <keywords>SPIN, SMV, PVS, SPARK, CWB</keywords>

article 1 2 3 paragraph 1 2 Relevance in XML • A element is relevant if it “has significant and demonstrable bearing on the matter at hand” • Common assumptions in IR • Objectivity • Topicality • Binary nature  • Independence  section

Relevance in XML • Exhaustivity • how exhaustively a document component discusses the topic of request • Specificity • how focused the component is on the topic of request (i.e. discusses no other, irrelevant topics) • 4-graded: 0, 1, 2 , 3 • needed because of the structure • Relevance: (3,3), (2,3), (1,1), (0,0), etc

article 1 2 3 paragraph 1 2 Relevance assessment task • Exhaustivity • Element  parent element, children element • Consistency • Parent of a relevant element must also be relevant, although to a different extent • Exhaustivity increase going  • Specificity decrease going  • Use of an online interface • Assessing a query takes a week! • Average 2 topics per participants • Only participants that complete the assessment task have access to the collection section

Metrics Recall/precision can used but must take into consideration: • near misses (we do not retrieve the best component e.g. p[4] but one near enough e.g. p[2]) • overlap (we retrieve a component e.g. doc[23] and one of its sub-components e.g. sec[3]) doc[23] sec[3] p[2] p[4]

Conclusion • XML retrieval is not just about the effective retrieval of XML documents, but also how to evaluate the effectiveness • INEX 2004 • More rigorous query topic format (e.g. parser) • New metrics (e.g. not based on precision/recall) • Tracks • Relevance feedback • Interactive • Heterogeneous collection • Natural language query

Thank you http://inex.is.informatik.uni-duisburg.de:2004/

Evaluating content-oriented XML retrieval: The INEX initiative

Evaluating content-oriented XML retrieval: The INEX initiative

Presentation Transcript

Initiative for nurses

INFO624 -- Week 9 Effective Information Retrieval

CPSC 2100 Software Design and Development

Applications (1 of 2): Information Retrieval

Information Retrieval to Knowledge Retrieval , one more step

Information Retrieval

Chapter 9

Practical Online Retrieval Evaluation SIGIR 2011 Tutorial

CHAPTER 12

Content Marketing Kickstarter

Text Information Retrieval and Applications – Advanced Topics

Memory

Evaluating Electronic Resources

Text-retrieval Systems

Information Retrieval and Recommendation Techniques

Content-based Image Retrieval (CBIR)

Object Oriented Analysis and Design Using UML

Text Information Retrieval and Applications – Advanced Topics

Service Oriented Architecture

Information Retrieval and Search Engines

Learning Embeddings for Similarity-Based Retrieval

Modeling the Internet and the Web: Text Analysis