Ranked Information Retrieval on XML Data

Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003 Bernadette Blum, Christian Nicolaus, Markus Uhl

Outline 1. Introduction in Information Retrieval 2. Information Retrieval on XML Data 3. Approaches • ELIXIR • The ELIXIR language • The ELIXIR query processing algorithm • Experiments, Conclusion • XRANK • Data model • Ranking function • Data structures and algorithms • Experiments 4. Conclusion Ranked Information Retrieval on XML Data

1. Introduction in Information Retrieval • Definition: • Information Retrieval (IR) is the technology for searching in collections (corpora, intranets, Web) of weakly structured documents: text, HTML, XML, ... • search engines, digital libraries, similarity search on scientific data • Vector space model (text analysis): • based on word occurrence frequency • documents and queries are vectors • result ranking based on similarity metric in vector space Ranked Information Retrieval on XML Data

1. Introduction in Information Retrieval (II) • Link analysis (structure analysis): • weighting documents • improve result ranking Page rank approach (I): • web as directed graph G • “random walk” of a web surfer • follow hyperlinks with probability (1-) • “random jump” with probability  Ranked Information Retrieval on XML Data

1. Introduction in Information Retrieval (III) Page rank approach (II): /5 q Document Hyperlink (1-)/3 /5 “random jump” (1-)/3 (1-)/3 /5 /5 /5 Probability of “random jump” Probability of following hyperlink (1- ) p(q)= + “random jump” hyperlinks Ranked Information Retrieval on XML Data

2. Information Retrieval on XML Data • XML: standard for exchange of structured data and documents • existing query languages (e.g. XML-QL, Quilt, XQL, …  XQuery) • no ranked or weighted results based on textual similarity • but extensions (XXL, XIRQL …) 2 Approaches ELIXIR SQL-like approach XRANK Keyword based approach Ranked Information Retrieval on XML Data

3.1 ELIXIR • ELIXIR = “expressive and efficient language for XML information retrieval” • extension to XML-QL: similarity operator “~” • “~” computed by WHIRL • returns best r answers Ranked Information Retrieval on XML Data

ELIXIR – The ELIXIR language • Syntax: • XML-QL Syntax (SQL-like) output format CONSTRUCT <item>$b</> WHERE <items.book year=$yb>$b</> in “db.xml”, <items.cd>$c</> in “db.xml”, $yb > 1990, $b~$c. pattern statements + predicates boolean operators ELIXIR’s similarity operator • similarity calculation even between 2 variables ( expressiveness) • no nested queries Ranked Information Retrieval on XML Data

ELIXIR – The ELIXIR language (II) WHIRL (I): • Word-based Heterogeneous Information Retrieval Logic • extends DATALOG with “~” • only relational data • efficiently supports ranked IR • Syntax (Horn clause): conjunction of relational predicates output($y, $a, $t) :- book($y, $a, $t), $y>1950, $t~$a. output relation input relation boolean operator similarity operator Ranked Information Retrieval on XML Data

ELIXIR – The ELIXIR language (III) WHIRL (II): • Similarity computation“~”: • standard IR term vector techniques • weighting terms (TF-IDF values) • cosine measure: (V Vocabulary of distinct terms; Terms t  V;Documents d, d’  R|V|) Ranked Information Retrieval on XML Data

ELIXIR – The ELIXIR query processing algorithm Example (naïve approach): XML-QL query Q2 <q2> { CONSTRUCT <tuple>$b</><c>$c</></> WHERE <items.book>$b</> in “db.xml”, <items.cd>$c</> in “db.xml” } </> full cross product ! Similarity computation for every tupel ($b, $c) Ranked Information Retrieval on XML Data

ELIXIR – The ELIXIR query processing algorithm (II) Problem: full cross product ! Ranked Information Retrieval on XML Data

ELIXIR – The ELIXIR query processing algorithm (III) Solution: • not simply map the full XML data into relational model • invoke WHIRL as a “subroutine” ( efficiency) Avoid generating full cross product! Ranked Information Retrieval on XML Data

ELIXIR – The ELIXIR query processing algorithm (IV) Start query Q1 3 Stages: intermediate queries Q2, Q3, Q4 • 1. Partition into a set, Q21 … Q2N, of XML-QL queries • avoid generating full cross product • ordinary predicates 2 pattern statements with variables that are compared with a similarity predicate => distinct Q2j queries • 2. WHIRL query Q3 • similarity predicates • ordered table of the r best answers • 3. XML-QL query Q4 • transformation of Q3’s output • specified XML structure by Q1 Ranked Information Retrieval on XML Data

ELIXIR – The ELIXIR query processing algorithm (V) Example (Step I – Partition in Q2n queries): <q21><tuple>Traditional Ukrainian cookery</></> <tuple>Being and nothingness</></> <tuple>Shooting Elvis</></></> <q21> { CONSTRUCT <tuple>$b</></> WHERE <items.book>$b</> in "db.xml" } </> XML-QL query Q21 XML-QL query Q22 <q22> { CONSTRUCT <tuple><c>$c</></> WHERE <items.cd>$c</> in "db.xml" } </> <q22><tuple><c>Ukrainian folk music</></> <tuple><c>Being there</></> <tuple><c>Milk cow blues</></></> Avoid generating full cross product! Ranked Information Retrieval on XML Data

ELIXIR – The ELIXIR query processing algorithm (VI) Example (Step II – WHIRL query Q3): <q21><tuple>Traditional Ukrainian cookery</></> <tuple>Being and nothingness</></> <tuple>Shooting Elvis</></></> <q3><tuple>Traditional Ukrainian cookery</></> <tuple>Being and nothingness</></></> WHIRL query Q3 q3($b) :- q21($b), q22($c), $b ~ $c. <q22><tuple><c>Ukrainian folk music</></> <tuple><c>Being there</></> <tuple><c>Milk cow blues</></></> Ranked Information Retrieval on XML Data

ELIXIR – The ELIXIR query processing algorithm (VII) Example (Step III – XML-QL query Q4): <q3><tuple>Traditional Ukrainian cookery</></> <tuple>Being and nothingness</></></> <results> { CONSTRUCT <item>$b</> WHERE <q3.tuple>$b</></> in "q3.xml“ } </> XML-QL query Q4 Final XML OUTPUT <results><item>Traditional Ukrainian cookery</> <item>Being and nothingness</></> Ranked Information Retrieval on XML Data

ELIXIR – Experiments, Conclusion Experiments: Total processing time … • … depends on details of each query and input data • … increases marginal with number of answers r • … increases linearly with number of similarity join predicates • Partition (Step 1) of initially query dominate (expensive parsing and traversing) Ranked Information Retrieval on XML Data

ELIXIR – Experiments, Conclusion (II) Conclusion: • ELEXIR extends XML-QL by supporting IR-similarity-features for ranking • similarity joins even between 2 variables (expressiveness) • Algorithm: • rewrite original ELIXIR query in a series of intermediate XML-QL and WHIRL queries. • no full cross product, only filtered tuples of variable bindings (efficiency) • But … • only non-nested queries • strict three-stage approach may be suboptimal in some cases (partition) Ranked Information Retrieval on XML Data

XRANK: Ranked Keyword Search over XML Documents Ranked Information Retrieval on XML Data

Introduction • XRANK - Keyword Search over XML documents • results: • XML elements that contain all searched keywords • ranking: • at granularity of XML elements • based on hyperlink structure • advantages: • user does not have to learn a query language • no knowledge about the structure of XML documents is needed • generalized keyword search engine • (both HTML and XML are possible) Ranked Information Retrieval on XML Data

Data Model • G = (V, CE, HE) : collection of XML documents • V : set of XML elements (tags and attributes) • CE : set of containment edges • HE : set of hyperlinked edges • (u,v) in CEv is a sub-element of u • (u,v) in HE  u contains a hyperlink to v • contains(v,k) v (in)directly contains the keyword k Ranked Information Retrieval on XML Data

Example: XML Graph ... XML element value Ranked Information Retrieval on XML Data

KeywordQuery Results (1) How to define results of keyword search queries over XML documents? elements that contain all keywords – no sub-element contains all keywords! elements with at least one sub-element containining all keywords & at least one sub-element containing some keywords ⋃ Ranked Information Retrieval on XML Data

Ranking Elements How to rank XML elements? ElemRank • extension of PageRank at the granularity of elements • objective importance of XML elements • based on hyperlinked and nested structure of XML • elements Ranked Information Retrieval on XML Data

ElemRank (1) n: # XML elements nc(u) : # sub-elements of u nh (u) : # outgoing hyperlinks from u CE-1 : {(v,u) | (u,v)  CE} “reverse containment edges“ E : HE  CE  CE -1 nc(u) = 3 nh(u) = 3 u containment edge reverse containment edge hyperlink edge Ranked Information Retrieval on XML Data

ElemRank (2)  : prob. for following a hyperlink 1- - - : prob. for a random jump  : prob. for using a containment edge  : prob. for using a reverse containment edge ε  / 3+ ε / 10  / 1+ ε / 10  / 3 + ε/10 ε / 10  / 3 + ε / 10  / 3+ ε / 10  / 3 + ε / 10 ε / 10  / 3+ ε / 10 containment edge reverse containment edge hyperlink edge Ranked Information Retrieval on XML Data

ElemRank (3) ElemRank e(v) = e(u) nh(u) e(u) nc(u) e(u) 1 (1- - - ) * 1/n +  * ∑ +  * ∑ +  * ∑ (u,v)  HE (u,v)  CE (u,v)  CE-1 (0 ≤ , ,  ≤ 1) random navigation via hyperlinks via forward containment edges via reverse containment edges Ranked Information Retrieval on XML Data

Ranking Function (1) • ranking functions should take into account: • result specifity • hyperlinks • keyword proximity contains(v,k) ∃ sequence (v1,v2), ..., (vn-1,vn) s.t. vn directly contains k r(v,k) = ElemRank(vn) * decayn-1(0 ≤ decay ≤ 1) • based on hyperlinked structure • result specifity Ranked Information Retrieval on XML Data

Ranking Function (2) • m occurences of keyword k • computation of r1, ..., rm • r*(v,k) = f(r1, ..., rm) (with accumulation function f - e.g. max or sum) p = proximity measure • query q consists of keywords k1, ..., kn • R(v,q) = ( r*(v,ki)) * p(v,k1, ..., kn) • keyword proximity Ranked Information Retrieval on XML Data

<CDs> <CD id = “1“> <title> R.E.M. – Out Of Time</title> <song> <title> Radio Song</title> <time> 4:12</time> </song> <song> <title> Losing My Religion</title> <time> 4:26</time> </song> ... </CD> <CD id = “2“> <title> R.E.M. – Automatic For...</title> ... </CD> ... </CDs> Ranked Information Retrieval on XML Data

XRANK Architecture ranked result list keyword search query XML documents Query Evaluator data acces ElemRank computation XML elements index structures & algorithms with ElemRanks Ranked Information Retrieval on XML Data

Naïve Approach • naïve inverted list: • contains all XML elements that contain the keyword ... key1 elem11 elem12 ... key2 elem21 elem22 etc. • space overhead • spurious results • inaccurate ranking Ranked Information Retrieval on XML Data

Dewey IDs 0 <CDs> ... 0.0 0.1 <CD> <CD> ... ... 0.0.0 0.0.1 0.1.0 0.0.2 <title> <song> <song> <title> R.E.M. – Out Of Time R.E.M. – Automatic For The People 0.0.1.0 0.0.1.1 0.0.2.0 0.0.2.1 <title> <time> <title> <time> Radio Song 4:12 Losing My Religion 4:26 Ranked Information Retrieval on XML Data

DIL– Data Structure • Dewey inverted list: • contains the Dewey IDs of all XML elements that • directly contain the keyword • sorted by Dewey ID (ascending) Dewey ID ElemRank position list [0] 0.0.0 75 R.E.M. 80 [0] 0.1.0 … Dewey ID ElemRank position list [2] Religion 0.0.2.0 88 … Ranked Information Retrieval on XML Data

DIL – Query Processing (1) • key idea: computation of longest common prefix (lcp) of Dewey IDs pot_result posList [1] posList [2] DeweyID rank [1] rank [2] 1. 0 75 0 y 0 70 0 n 0 65 0 n Ranked Information Retrieval on XML Data

DIL – Query Processing (2) pot_result posList [1] posList [2] pot_result posList [1] posList [2] DeweyID DeweyID rank [1] rank [2] rank [1] rank [2] 2. 1. 0 75 0 y 0 88 2 n 70 0 n 2 83 2 n 0 y 0 65 0 n 0 70 78 0 2 lcp 0 65 73 0 2 n Ranked Information Retrieval on XML Data

DIL – Query Processing (3) pot_result posList [1] posList [2] pot_result posList [1] posList [2] DeweyID DeweyID rank [1] rank [2] rank [1] rank [2] 2. 1. 0 75 0 y 0 88 2 n 0 70 0 n 2 83 2 n y 0 65 0 n 0 70 78 0 2 lcp 0 65 73 0 2 n lcp 3. 0 80 0 n 1 75 0 n { 0.0 , 0 } y 0 70 73 0 2 Ranked Information Retrieval on XML Data

RDIL – Data Structure • ranked Dewey inverted list: • each Dewey ID in the list has a position in the B+-tree • B+-tree sorted by Dewey ID (ascending) • inverted list sorted by ElemRank (descending) B+-tree on Dewey IDs 0.0.0 … 0.1.0 ElemRank Dewey ID 80 0.1.0 R.E.M. 0.0.0 75 … Ranked Information Retrieval on XML Data

RDIL – Query Processing (1) key1 key2 key3 B+ B+ B+ on Dewey IDs entry11 entry21 entry31 entry12 entry22 entry32 sorted by ElemRank entry13 entry23 entry33 ... ... ... lcp with Dewey ID11 ⇨result heap Ranked Information Retrieval on XML Data

RDIL – Query Processing (2) key1 key2 key3 B+ B+ B+ on Dewey IDs entry11 entry21 entry31 entry12 entry22 entry32 sorted by ElemRank entry13 entry23 entry33 ... ... ... lcp with Dewey ID21 ⇨result heap etc. Ranked Information Retrieval on XML Data

RDIL – Query Processing (3) key1 key2 key3 B+ B+ B+ on Dewey IDs entry11 entry21 entry31 entry12 entry22 entry32 sorted by ElemRank entry13 entry23 entry33 ... ... ... max. reachable Ranking≤ ∑ Ranking = threshold Ω Ranked Information Retrieval on XML Data

RDIL – Query Processing (4) RDIL algorithm stops if threshold Ω <lowest ElemRank in result heap because max. reachable ranking ≤ Ω < lowest ElemRank in result heap ⇨ max. reachable ranking <lowest ElemRank in result heap ! Ranked Information Retrieval on XML Data

XRANK Architecture ranked result list keyword search query XML documents Query Evaluator data acces ElemRank computation XML elements DIL / RDIL with ElemRanks Ranked Information Retrieval on XML Data

Experimental Results (1) Ranked Information Retrieval on XML Data

Experimental Results (2) Ranked Information Retrieval on XML Data

Comparison DIL - RDIL DIL RDIL • inverted lists sorted by • Dewey ID • compute longest common prefix on Dewey IDs • extracts the minimum • of all remaining Dewey IDs • all lists are completely • scanned • outperforms RDIL • if keyword correlation is low • inverted lists sorted by • ElemRank • chooses next list sequentially • stops if a certain threshold • is reached • outperforms DIL if • keyword correlation is high Ranked Information Retrieval on XML Data

Conclusion 2 Approaches • ELIXIR: • SQL-like structure based search • extends XML-QL by supporting IR-similarity-features for ranking • ranked results based only on textual similarity (even between 2 variables) • XRANK: • keyword based search à la Google • ranked results based on textual similarity • hierarchical and hyperlinked structure Ranked Information Retrieval on XML Data

Ranked Information Retrieval on XML Data

Ranked Information Retrieval on XML Data

Presentation Transcript

Introducing ranked retrieval

XML Retrieval

XML Retrieval

Special Topics on Information Retrieval

XML Retrieval

Will XML and Information Retrieval Make Society Transparent?

Special Topics on Information Retrieval

Evaluation of XML Information Retrieval Systems

XML Information Retrieval and INEX

XML Information Retrieval

Special Topics on Information Retrieval

Ranked Retrieval

Ranked Retrieval

Structure/XML Retrieval

Special Topics on Information Retrieval

Special Topics on Information Retrieval

XML Information Retrieval

XML Distributed Retrieval

Dagstuhl Seminar 08111 on Ranked XML Querying

Ranked Retrieval

Special Topics on Information Retrieval

Ranked Retrieval