The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

The Index-based XXL Search Enginefor Querying XML Datawith Relevance Ranking Anja Theobald and Gerhard Weikum University of the Saarland Saarbrücken, Germany weikum@cs.uni-sb.de http://www-dbs.cs.uni-sb.de

Conclusion • Problem: • diversity of Web / Intranet data •  despite XML, global schema is a myth •  users are swamped with results or • are looking for needles in haystacks Our contribution: • combine XML querying with relevance ranking • demonstrate efficiency and search result quality • with XXL search engine prototype

Outline Adding relevance to XML • The XXL search engine: index-based query processing • Experiments •

<Uni> ETH Zürich <Fak> Nat.-Techn. Fak. I <FR> Fachrichtung Informatik <Lehre> ... <Hauptstudium> <Vorlesung> Leistungsanalyse <Dozent> ... </> <Inhalt> ... Warteschlangen ... </> <Lit href=springer/nelson.xml > <Lit href=... > </Vorlesung> <Vorlesung> Sprachverarbeitung <Inhalt> ... Markovketten ... </> </Vorlesung> ... </Lehre> ... </FR> ... </Fak> ... </Uni> <Uni> Uni Stuttgart <Fak> Nat.-Techn. Fak. I <FR> Fachrichtung Informatik <Lehre> ... <Hauptstudium> <Vorlesung> Leistungsanalyse <Dozent> ... </> <Inhalt> ... Warteschlangen ... </> <Lit href=springer/nelson.xml > <Lit href=... > </Vorlesung> <Vorlesung> Sprachverarbeitung <Inhalt> ... Markovketten ... </> </Vorlesung> ... </Lehre> ... </FR> ... </Fak> ... </Uni> <Uni> Uni Saarland <School> Math & Engineering <Dept> CS <Teaching> ... <GradStudies> <Course> Performance analysis <Lecturer> ... </> <Content> Queueing models .. </> <Lit href=springer/nelson.xml > <Lit href=... > </Course> <Course> Speech processing <Content> ... Markov chains... </> </Course> ... </Teaching> .. </Dept> .. </School> ... </Uni> Book Title: Stochastic ... Author: R. Nelson Review: ... Chapter on Markov chains Uni: Uni Saarland ... School: ... School: ... ... Dept: ... CS ... ... Dozent URL=... Inhalt Teaching ... ... GradStudies ... Course: Speech processing Course: Performance analysis ... ... Content: ... Markov chains ... Content: ... Queueing models Lit: Lit: Uni: Uni Stuttgart Uni: Uni Augsburg Semistructured data: elements, attributes, links organized as labeled graph ... School: CS ... Curriculum: E Commerce ... Course: Mobile Comm. ... Weekend: Data Mining ... Prerequisites: ... Markov processes ... ... ... XML Data Graph

Regular expressions over path labels + Logical conditions over element contents XML Querying Book www.allunis.de/unis.xml Title: Stochastic ... Author: R. Nelson Review: ... Chapter on Markov chains Uni: Uni Stuttgart ... School: CS Uni: Uni Saarland ... ... Course: Mobile comm. School: ... School: ... ... ... Prerequisites: ... Markov processes Dept: ... CS ... Uni: Uni Augsburg Teaching ... ... GradStudies Curriculum: E Commerce ... Course: Speech processing Course: Performance analysis ... Weekend: Data Mining ... ... ... Outline: ... statistical methods for classification ... Content: ... Markov chains ... Content: ... Queueing models Lit: Lit: Select U, C From www.allunis.de/unis.xml Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“ And D.#.Course As C And C.# Like „%Markov chain%“

XML Querying Book www.allunis.de/unis.xml Title: Stochastic ... Author: R. Nelson Review: ... Chapter on Markov chains Uni: Uni: Uni Stuttgart ... Markov chains School: School: CS CS Uni: Uni: Uni Saarland Course: ... ... Course: Mobile comm. School: School: ... School: School: ... ... ... Prerequisites: ... Markov processes Dept: Dept: ... CS CS ... Uni: Uni: Uni Augsburg Teaching ... ... GradStudies Curriculum: E Commerce ... Course: Course: Course: Speech processing Course: Performance analysis ... Weekend: Data Mining ... ... ... Outline: ... statistical methods for classification ... Content: ... Markov chains ... Content: ... Queueing models Lit: Lit: Markov chains U, C Select U, C From www.allunis.de/unis.xml Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“ And D.#.Course As C And C.# Like „%Markov chain%“ Uni As U U.#.School?.#.(Inst | Dept)+ As D D Like „%CS%“ D.#.Course As C C.# Like „%Markov chain%“

There is no global schema for Intranets or the Web  Relevance ranking of results is absolutely crucial ! Boolean vs. Ranked Retrieval

Ranked Retrieval with XXL Book www.allunis.de/unis.xml Title: Stochastic ... Author: R. Nelson Review: ... Chapter on Markov chains Uni: Uni Stuttgart ... School: CS Uni: Uni Saarland ... ... Course: Mobile comm. School: ... School: ... ... ... Prerequisites: ... Markov processes Dept: ... CS ... Uni: Uni Augsburg Teaching ... ... GradStudies Curriculum: E Commerce ... Course: Speech processing Course: Performance analysis ... Weekend: Data Mining ... ... ... Outline: ... statistical methods for classification ... Content: ... Markov chains ... Content: ... Queueing models Lit: Lit: Select U, C From www.allunis.de/unis.xml Where Uni As U And U.# As D And D ~~ „CS“ And D.#.~Course As C AND C.# ~~ „Markov chain“

Dozent URL=... Inhalt ... Result ranking of XML data based on semantic similarity Ranked Retrieval with XXL Book www.allunis.de/unis.xml Title: Stochastic ... Author: R. Nelson Review: ... Chapter on Markov chains Uni: Uni Stuttgart ... School: CS Uni: Uni Saarland ... ... Course: Mobile comm. School: ... School: ... ... ... Prerequisites: ... Markov processes ... Dept: ... CS Uni: Uni Augsburg Teaching ... ... GradStudies Curriculum: E Commerce ... Course: Speech processing Course: Performance analysis ... Weekend: Data Mining ... ... ... Outline: ... statistical methods for classification ... Content: ... Markov chains ... Content: ... Queueing models Lit: Lit: Select U, C From www.allunis.de/unis.xml Where Uni As U And U.# As D And D ~~ „Computer Science“ And D.#.~Course As C and C.# ~~ „Markov chain“

Outline Adding relevance to XML  The XXL search engine: index-based query processing • Experiments •

Semantic similarity conditions on names and contents ... F.#.~Lecturer As D And D.~Area ~~ „XML“ Based on tf*idf similarity of contents, ontological similarity of names probabilistic combination of conditions XXL: Flexible XML Search Language Extensible, simple core language Where clause: conjunction of regular path expressions with binding of variables Elementary conditions on element/attribute names and contents Select F, D, S From www.allunis.de/unis.xml Where Uni.#.School?.#.(Inst|Dept) As F And F.#.Lecturer As D And F.#.Student As S And D.Name = S.Name And D.Area Like „%XML%“

XXL Result Ranking Query: Where Uni.#.School?.#.(Inst|Dept)+ As D And D.#.~Lecturer As D And D.~Area ~~ „XML“ Data graph: Result graph: 1.0 Uni: UniSaarland Uni: UniSaarland 1.0 Dept: CS Dept: Math Dept: CS Dept: Math 0.9 Prof: GW Prof: GW 0.8 Teaching Project: IR for semistruct. data Project: IR for semistruct. data 0.6 Project: Digital libraries Course: IR Relevance score: 0.432 = 1.0 * 1.0 * 0.9 * 0.8 * 0.6 Seminar: XML

WWW • Query decomposition into • index-supported subexpressions • wide range of optimizations ...... ..... ...... ..... XXL Search Engine XXL servlets Path indexer XXL applet Query processor Content indexer Ontology Select ... Where Uni.#.(Inst|Dept) As F And F ~~ „Computer Science“ And F.#.~Course.# ~~ „Markov Chains“ Uni.#.(Inst|Dept) As F F ~~ „Computer Science“ F.#.~Course.# ~~ „Markov Chains“ F.#.~Course.# ~~ „Markov Chains“ F.#.~Seminar.# ~~ „Markov Chains“ F.#.~Seminar.# ~~ „Markov Chains“

Index Structures Element Path Index: materializes all (parent, child) element name pairs and dynamically checks transitive connectivity Uni, {id1, {<School, {id13, id14}> <Prof, {id111, id117, id119}>}, id2, {<Prof>, {id15}>} } School, {id13, {<Dean, {id27}>, <Dept, {id31, id32, id33}>}, id14, { ... } } precomputes all term occurrences in element contents, with frequency statistics Element Content Index: Engineering, idf=..., {<id79, tf=...>, <id85, tf=...>} XML, idf=..., {<id46, tf=...>, <id49, tf=...>, <id53, tf=...>} contains synonyms, hypernyms, and hyponyms of element names, and „semantic“ distances Element Ontology Index: Course, {<Seminar, 0.9>, <Project, 0.7>}, {<Teaching, 0.9>} {<Telecourse, 0.9>, <Video lecture, 0.7>, <Meditation, 0.1>}

Inst Uni % Dept Query Decomposition & Evaluation • decompose query into subqueries • choose global evaluation order of subqueries • represent subquery as NFSA • for each subquery choose local evaluation strategy (top-down or bottom-up) • evaluate subexpressions using indexes • compute subquery result paths with relevance scores • combine result paths into result graph Example query: Example of subquery NFSA: Uni.#.(Inst|Dept)+ As F And F ~~ „Computer Science“ And F.#.~Course.# ~~ „Markov Chains“ Uni.#.(Inst|Dept)+ Uni.#.(Inst|Dept)+

Observation: WWW / Intranet Information becomes better searchable when it is more explicitly structured and canonically annotated <Uni> Univ. Saarland <School> Engineering <Dept> Computer Science <Faculty> Prof. Dr. GW <Project> Semistructured Data ... XML</> ... ...... ..... ...... ..... Univer- sity Jour- nal Dept Univer- sity Confe- rence Insti- tute Jour- nal Dept Prof Confe- rence Publi- cation Insti- tute Prof • c (Course(c)   s ((Dept(s)  Inst(s))  Curriculum (c,s))) Publi- cation Course Re- search • c (Course(c)   s ((Dept(s)  Inst(s))  Curriculum (c,x))) Course Re- search „Poor man‘s ontology“: Teach- ing Semi- nar Pro- ject Graph of concepts capturing hypernym/hyponym relationships (e.g., from WordNet) Teach- ing Semi- nar Pro- ject  quantitative reasoning („semantic similarity“ measures) The Role of Ontologies

Outline Adding relevance to XML  The XXL search engine: index-based query processing  Experiments •

Example Data

Example Query SELECT * FROM INDEX WHERE ~drama.#.scene AS C AND C.speech AS S AND (S.speaker ~ "Woman") AND S.line AS L AND (L.CONTENT ~ "leader") AND C.speech AS M AND (M.speaker = "MACBETH")

Example Ontology thane – (a feudal lord or baron in Scotland) => lord, noble, nobleman – (a titled peer of the realm) => male aristocrat – (a man who is an aristocrat) => leader – (a person who rules or guides or inspires others)

Example Ontology woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman) ... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil)

Example Results Relevance = 0.0070400005 <scene> <speech> <speaker> Second Witch </speaker> <line> All hail, Macbeth, hail to thee, thane of Cawdor! </line> </speech> <speech> <speaker> MACBETH </speaker> <line> ... </line> </speech> </scene>

Test data: 100 XML documents with a total of 240 000 elements (ot.xml, nt.xml, ..., hamlet.xml, macbeth.xml, ..., SigmodRecord.xml) XXL Runtime Measurements Q1: Select * From Index Where #.publication AS A And A.~headline ~~ „XML“ And A.author% AS B 1 2 3 4 Q2: Select * From Index Where #.play AS A And A.#.personae AS B And B.~figure ~~ „King“ And B. title AS C #results: top-down bottom-up w/ optimization: 131 14.3 sec 694 sec 2.68 sec (incl. 0.37 sec) 2bu 1bu 3td 58 8.5 sec 3.7 sec 4.64 sec (incl. 0.33 sec) 1bu 2td 3td 4td

Conclusion Research avenue: explore and leverage synergies between XML (querying),(relevance-ranking) IR, (domain-specific or personal) ontologies, and machine learning (for classification, annotation, etc.) Goal: should be able to find results for every search in one day (computer time) with < 1 minintellectual effort that the best human experts can find with infinite time • pursued in CLASSIX project (joint DFG project with Norbert Fuhr‘s group in Dortmund)

The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Presentation Transcript

Querying XML

Querying XML

Effective XML Keyword Search with Relevance Oriented Ranking

Ranking support for keyword search on structured data using relevance model

Querying XML

Querying XML

Querying XML Documents and Data

Querying Distributed Data using XML

ViST: a dynamic index method for querying XML data by tree structures

XSEarch: A Semantic Search Engine for XML

9 Querying XML Data and Documents

Querying XML Views

Querying Streaming XML Data

8 Querying XML Data and Documents

7 Querying XML

XQuery Processing with Relevance Ranking

Search Engine Ranking Factors

Querying XML Documents

The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking