XML Retrieval and Evaluation: Where are we?

XML Retrieval and Evaluation: Where are we? Mounia Lalmas Queen Mary University of London

Outline • XML retrieval paradigm • Focussed retrieval • Overlapping elements • XML retrieval evaluation • INEX • Topics • Relevance • Assessments • Metrics

Acknowledgements Norbert Fuhr, Shlomo Geva, Norbert Gövert, Gabriella Kazai, Saadia Malik, Benjamin Piwowarski, Thomas Rölleke, Börkur Sigurbjörnsson, Zoltan Szlavik, Vu Huyen Trang, Andrew Trotman, Arjen de Vries FERMI model (Chiaramella, Mulhem & Fourel, 1996) DELOS EU Network of Excellence (FP5 & FP6) ARC exchange programme (DAAD & British Council) EPSRC (UK funding body)

XML: eXtensible Mark-up Language • Meta-language adopted as document format language by W3C • Describe content and logical structure • Use of XPath notation to refer to the XML structure <book> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into XML retrieval </title> <paragraph> …. </paragraph> … </chapter> … </book> chapter/title: title is a direct sub-component of chapter //title: any title chapter//title: title is a direct or indirect sub-component of chapter chapter/paragraph[2]: any direct second paragraph of any chapter chapter/*: all direct sub-components of a chapter

Book Chapters Sections World Wide Web This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence.. Subsections XML Retrieval vs Information Retrieval • Traditional information retrieval is about finding relevant documents, e.g. entire book. • XML retrieval allows users to retrieve document components of varying granularity (XML elements) that are more focussed, e.g. a subsection of a book instead of an entire book. SEARCHING = QUERYING + BROWSING

Querying XML documents • With respect to content only (CO) • Standard queries but retrieving XML elements • “London tube strikes” • With respect to content and structure (CAS) • Constraints on the types of XML elements to be retrieved • E.g. “Sections of an article in the Times about congestion charges” • E.g. Articles that contain sections about congestion charges in London, and that contain a picture of Ken Livingstone”

Focussed Retrieval Retrieve thebest XML elements according to content and structure criteria: • Shakespeare study: best entry points, which are components from which many relevant components can be reached through browsing (Kazai, Lalmas & Reid, ECIR 2003) • INEX: most specific component that satisfies the query, while being exhaustive to the query should should not return “related” elements, i.e. on the same path, i.e. overlapping elements

INEX: INitiative for the Evaluation of XML Retrieval • Started in 2002 and runs each year April - December, ending with a workshop at Schloss Dagstuhl Germany • Funded by DELOS, EU network of excellence in digital libraries • Documents (~500MB): 12,107 articles in XML format from IEEE Computer Society; 8 millions elements! • INEX 2002 60 topics; inex_eval metric • INEX 2003 66 topics; enhanced subset of XPath; inex_eval and inex_eval_ng metrics • INEX 2004 75 topics; subset of 2003 XPath subset NEXI (Sigurbjörnsson & Trotman, INEX 2003) ; inex_eval, inex_eval_ng, XCG, t2i and ERR metrics

INEX Topics: Content-only <title>Open standards for digital video in distance learning</title> <description>Open technologies behind media streaming in distance learning projects</description> <narrative> I am looking for articles/components discussing methodologies of digital video production and distribution that respect free access to media content through internet or via CD-ROMs or DVDs in connection to the learning process. Discussions of open versus proprietary standards of storing and sending digital video will be appreciated. </narrative> <keywords>media streaming,video streaming,audio streaming, digital video,distance learning,open standards,free access</keywords>

INEX Topics: Content-and-structure <title>//article[about(.,'formal methods verify correctness aviation systems')]//sec//* [about(.,'case study application model checking theorem proving')]</title> <description>Find documents discussing formal methods to verify correctness of aviation systems. From those articles extract parts discussing a case study of using model checking or theorem proving for the verification. </description> <narrative>To be considered relevant a document must be about using formal methods to verify correctness of aviation systems, such as flight traffic control systems, airplane- or helicopter- parts. From those documents a section-part must be returned (I do not want the whole section, I want something smaller). That part should be about a case study of applying a model checker or a theorem proverb to the verification. </narrative> <keywords>SPIN, SMV, PVS, SPARK, CWB</keywords>

Content-and-structure topics: Restrictions • Returning “attribute” type elements (e.g. author, date) not allowed “return authors of articles containing sections on XML retrieval approaches” • Aboutness criterion must be specified- at least - in the target elements “return all paragraphs contained in sections that discuss XML retrieval approaches” • Branches not allowed “return sections about XML retrieval that are contained in articles that contain paragraphs about INEX experiments” • … Are we imposing too many restrictions?

Ad hoc retrieval: Tasks • Content-only (CO): aim is to decrease user effort by pointing the user to the most relevant elements (2002, 2003, 2004) • Strict content-and-structure (SCAS): retrieve relevant elements that exactly match the structure specified in the query (2002, 2003) • Vague content-and-structure (VCAS): • retrieve relevant elements that may not be the same as the target elements, but are structurally similar (2003) • retrieve relevant elements even if do not exactly meet the structural conditions; treat structure specification as hints as to where to look (2004) CO CAS

article XML retrieval evaluation s1 s2 s3 XML retrieval XML evaluation ss1 ss2 Relevance in XML retrieval • A document is relevant if it “has significant and demonstrable bearing on the matter at hand”. • Common assumptions in laboratory experimentation: • Objectivity • Topicality • Binary nature • Independence

article XML retrieval evaluation s1 s2 s3 XML retrieval XML evaluation ss1 ss2 Relevance in XML retrieval: INEX 2002 • Relevance = (0,N) (1,S) (1,L) (1,E) (2,S) (2,L) (2,E) (3,S) (3,L) (3,E) topical relevance = how much the element discusses the query: 0, 1, 2, 3 component coverage = how focused the element is on the query: N, S, L, E • If an element is relevant so must be its parent element, ... • Topicality not enough • Binary nature not enough • Independence is wrong (Kazai, Lalmas, Fuhr & Gövert, JASIST 2004)

article XML retrieval evaluation s1 s2 s3 XML retrieval XML evaluation ss1 ss2 Relevance in XML retrieval: INEX 2003-4 • Relevance = (0,0) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3) exhaustivity = how much the element discusses the query: 0, 1, 2, 3 specificity = how focused the element is on the query: 0, 1, 2, 3 • If an element is relevant so must be its parent element, ... • Topicality not enough • Binary nature not enough • Independence is wrong (based on Chiaramella, Mulhem & Fourel, FERMI fetch and browse model 1996)

Relevance assessment task • Topics are assessed by the INEX participants • Pooling technique • Completeness • Rules forcing assessors to assess related elements • E.g. element assessed relevant  its parent element and children elements must also be assessed • … • Consistency • Rules to enforce consistent assessments • E.g. Parent of a relevant element must also be relevant - although to a different extent • E.g. Exhaustivity increases going up; specificity decreases going down • … (Piwowarski & Lalmas, CIKM 2004)

Current assessments Groups Navigation Interface • Assessing a topics takes a week! • Average 2 topics per participant • Duplicate assessments (12 topics) in INEX 2004 (Piwowarski, INEX 2003)

Assessments: some results • Elements assessed in INEX 2003 26% assessments on elements in the pool (66 % in INEX 2002). 68% highly specific elements not in the pool • INEX 2002 23 inconsistent assessments per topic for one rule • Agreements in INEX 2004 12.19% non-zero agreements - 22% at article level 3.42% exact agreements - 7% at article level higher agreements for CAS topics (Piwowarski & Lalmas, CORIA 2004; Kazai, Masood & Lalmas, ECIR 2004)

Retrieve the best XML elements according to content and structure criteria: • Most exhaustive and the most specific = (3,3) • Near misses = (3,3) (2,3) (1,3) • Near misses = (3,3) (3,2) (3,1) • Near misses = (3,3) (2,3) (1,3) (3,2) (3,1) (1,2) … • Focussed retrieval = no overlapping elements should not retrieve an element and its child element, …

Measuring effectiveness • Need to consider: • Two four-graded dimensions of relevance (exh, spec) • Overlapping elements in retrieval results • Overlapping elements in recall-base • Metrics (Kazai, INEX 2003) • inex_eval (also known as inex2002) official INEX 2004 metric • inex_eval_ng (also known as inex2003) (Gövert, Fuhr, Kazai & Lalmas, INEX 2003) • ERR (expected ratio of relevant units) (Piwowarski & Gallinari, INEX 2003) • XCG (XML cumulative gain) (Kazai, Lalmas & de Vries, SIGIR 2004) • t2i (tolerance to irrelevance) (de Vries, Kazai & Lalmas, RIAO 2004)

Two four-graded dimensions of relevance • Near misses • How to distinguish between (2,3) and (3,3), … , when evaluating retrieval results? • Several “user models” • Impatient: only reward retrieval of highly exhaustive and specific elements (3,3) • Happy as long as specific: only reward retrieval of highly specific elements (3,3), (2,3) (1,3) • … • Very patient: reward - to a different extent - the retrieval of any relevant elements; i.e. everything apart (0,0) • Use a quantisation function for each “user model”

Examples of quantisation functions Impatient Very patient

inex_eval • Based on precall(Raghavan, Bollmann & Jung, TOIS 1989) itself based on expected search length (Cooper, JASIS 1968) • Use several quantisation functions • Overall performance as simple average across quantisation functions (INEX 2004) • Report an overlap indicator (INEX 2004) • Does not consider overlap in retrieval results • Does not consider overlap in recall-base

Overlap in retrieval results Not very focussed retrieval!

Overlap in retrieval results Rank Systems (runs) Avg Prec % Overlap 1. IBM Haifa Research Lab(CO-0.5-LAREFIENMENT) 0.1437 80.89 2. IBM Haifa Research Lab(CO-0.5) 0.1340 81.46 3. University of Waterloo(Waterloo-Baseline) 0.1267 76.32 4. University of Amsterdam(UAms-CO-T-FBack) 0.1174 81.85 5. University of Waterloo(Waterloo-Expanded) 0.1173 75.62 6. Queensland University of Technology(CO_PS_Stop50K) 0.1073 75.89 7. Queensland University of Technology(CO_PS_099_049) 0.1072 76.81 8. IBM Haifa Research Lab(CO-0.5-Clustering) 0.1043 81.10 9. University of Amsterdam(UAms-CO-T) 0.1030 71.96 10. LIP6(simple) 0.0921 64.29 Official INEX 2004 Effectiveness Results for CO topics Not very focussed retrieval!

Overlap in recall-base 100% recall only if all relevant elements returned meaning returning overlapping elements Contradicts aim of focussed retrieval!

Quantisations Do we need all these quantisation functions? Do we need such complex relevance definition and assessments? …

inex_eval_ng • Exhaustivity and specificity based on the notion of an ideal concept space (Wong & Yao, TOIS 1995) upon which precision and recall are defined. • Considers overlap in retrieval results • Does not consider overlap in recall-base

Quantisations strict vs generalised (Pearson’s correlation coefficient) Do we need all these quantisation functions? Do we need such complex relevance definition and assessments? …

inex_eval vs inex_eval_ng Pearson’s correlation coefficient Which metric to use?

Overlap in recall-base As for inex_eval, 100% recall only if all relevant elements returned meaning returning overlapping elements Contradicts aim of focussed retrieval!

XCG: XML cumulated gain • Based on cumulated gain measure for IR (Kekäläinen and Järvelin, TOIS 2002) • Accumulates gain obtained by retrieving elements at fixed ranks; not based on precision and recall • Requires the construction of an ideal recall-base and associated ideal run, with which retrieval runs are compared • Consider overlap in both retrieval results and recall-base

XCG: Top 10 INEX 2004 runs [?] rank by inex_eval Which metric to use?

Conclusion and future work • XML retrieval is not just about the effective retrieval of XML documents, but also about what and how to evaluate • Is the overlap problem a real problem? • Currently: analysis of the evaluation results (QMUL, U Duisburg-Essen, LIP6/ U Chile and CMU) • INEX 2004 • Interactive track • Heterogeneous collection track • Relevance feedback track • Natural language track • INEX 2005 • Sort out how to measure performance!!!! • Understanding what users want and how they behave? • New collection with more realistic information needs - topics • … • Possible tracks: document mining, question-answering and multimedia

Dank jullie wel!

XML Retrieval and Evaluation: Where are we?