1 / 48

Ranked Information Retrieval on XML Data

Ranked Information Retrieval on XML Data. Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003 Bernadette Blum, Christian Nicolaus, Markus Uhl. Outline. 1. Introduction in Information Retrieval

redford
Télécharger la présentation

Ranked Information Retrieval on XML Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003 Bernadette Blum, Christian Nicolaus, Markus Uhl

  2. Outline 1. Introduction in Information Retrieval 2. Information Retrieval on XML Data 3. Approaches • ELIXIR • The ELIXIR language • The ELIXIR query processing algorithm • Experiments, Conclusion • XRANK • Data model • Ranking function • Data structures and algorithms • Experiments 4. Conclusion Ranked Information Retrieval on XML Data

  3. 1. Introduction in Information Retrieval • Definition: • Information Retrieval (IR) is the technology for searching in collections (corpora, intranets, Web) of weakly structured documents: text, HTML, XML, ... • search engines, digital libraries, similarity search on scientific data • Vector space model (text analysis): • based on word occurrence frequency • documents and queries are vectors • result ranking based on similarity metric in vector space Ranked Information Retrieval on XML Data

  4. 1. Introduction in Information Retrieval (II) • Link analysis (structure analysis): • weighting documents • improve result ranking Page rank approach (I): • web as directed graph G • “random walk” of a web surfer • follow hyperlinks with probability (1-) • “random jump” with probability  Ranked Information Retrieval on XML Data

  5. 1. Introduction in Information Retrieval (III) Page rank approach (II): /5 q Document Hyperlink (1-)/3 /5 “random jump” (1-)/3 (1-)/3 /5 /5 /5 Probability of “random jump” Probability of following hyperlink (1- ) p(q)= + “random jump” hyperlinks Ranked Information Retrieval on XML Data

  6. 2. Information Retrieval on XML Data • XML: standard for exchange of structured data and documents • existing query languages (e.g. XML-QL, Quilt, XQL, …  XQuery) • no ranked or weighted results based on textual similarity • but extensions (XXL, XIRQL …) 2 Approaches ELIXIR SQL-like approach XRANK Keyword based approach Ranked Information Retrieval on XML Data

  7. 3.1 ELIXIR • ELIXIR = “expressive and efficient language for XML information retrieval” • extension to XML-QL: similarity operator “~” • “~” computed by WHIRL • returns best r answers Ranked Information Retrieval on XML Data

  8. ELIXIR – The ELIXIR language • Syntax: • XML-QL Syntax (SQL-like) output format CONSTRUCT <item>$b</> WHERE <items.book year=$yb>$b</> in “db.xml”, <items.cd>$c</> in “db.xml”, $yb > 1990, $b~$c. pattern statements + predicates boolean operators ELIXIR’s similarity operator • similarity calculation even between 2 variables ( expressiveness) • no nested queries Ranked Information Retrieval on XML Data

  9. ELIXIR – The ELIXIR language (II) WHIRL (I): • Word-based Heterogeneous Information Retrieval Logic • extends DATALOG with “~” • only relational data • efficiently supports ranked IR • Syntax (Horn clause): conjunction of relational predicates output($y, $a, $t) :- book($y, $a, $t), $y>1950, $t~$a. output relation input relation boolean operator similarity operator Ranked Information Retrieval on XML Data

  10. ELIXIR – The ELIXIR language (III) WHIRL (II): • Similarity computation“~”: • standard IR term vector techniques • weighting terms (TF-IDF values) • cosine measure: (V Vocabulary of distinct terms; Terms t  V;Documents d, d’  R|V|) Ranked Information Retrieval on XML Data

  11. ELIXIR – The ELIXIR query processing algorithm Example (naïve approach): XML-QL query Q2 <q2> { CONSTRUCT <tuple><b>$b</><c>$c</></> WHERE <items.book>$b</> in “db.xml”, <items.cd>$c</> in “db.xml” } </> full cross product ! Similarity computation for every tupel ($b, $c) Ranked Information Retrieval on XML Data

  12. ELIXIR – The ELIXIR query processing algorithm (II) Problem: full cross product ! Ranked Information Retrieval on XML Data

  13. ELIXIR – The ELIXIR query processing algorithm (III) Solution: • not simply map the full XML data into relational model • invoke WHIRL as a “subroutine” ( efficiency) Avoid generating full cross product! Ranked Information Retrieval on XML Data

  14. ELIXIR – The ELIXIR query processing algorithm (IV) Start query Q1 3 Stages: intermediate queries Q2, Q3, Q4 • 1. Partition into a set, Q21 … Q2N, of XML-QL queries • avoid generating full cross product • ordinary predicates 2 pattern statements with variables that are compared with a similarity predicate => distinct Q2j queries • 2. WHIRL query Q3 • similarity predicates • ordered table of the r best answers • 3. XML-QL query Q4 • transformation of Q3’s output • specified XML structure by Q1 Ranked Information Retrieval on XML Data

  15. ELIXIR – The ELIXIR query processing algorithm (V) Example (Step I – Partition in Q2n queries): <q21><tuple><b>Traditional Ukrainian cookery</></> <tuple><b>Being and nothingness</></> <tuple><b>Shooting Elvis</></></> <q21> { CONSTRUCT <tuple><b>$b</></> WHERE <items.book>$b</> in "db.xml" } </> XML-QL query Q21 XML-QL query Q22 <q22> { CONSTRUCT <tuple><c>$c</></> WHERE <items.cd>$c</> in "db.xml" } </> <q22><tuple><c>Ukrainian folk music</></> <tuple><c>Being there</></> <tuple><c>Milk cow blues</></></> Avoid generating full cross product! Ranked Information Retrieval on XML Data

  16. ELIXIR – The ELIXIR query processing algorithm (VI) Example (Step II – WHIRL query Q3): <q21><tuple><b>Traditional Ukrainian cookery</></> <tuple><b>Being and nothingness</></> <tuple><b>Shooting Elvis</></></> <q3><tuple><b>Traditional Ukrainian cookery</></> <tuple><b>Being and nothingness</></></> WHIRL query Q3 q3($b) :- q21($b), q22($c), $b ~ $c. <q22><tuple><c>Ukrainian folk music</></> <tuple><c>Being there</></> <tuple><c>Milk cow blues</></></> Ranked Information Retrieval on XML Data

  17. ELIXIR – The ELIXIR query processing algorithm (VII) Example (Step III – XML-QL query Q4): <q3><tuple><b>Traditional Ukrainian cookery</></> <tuple><b>Being and nothingness</></></> <results> { CONSTRUCT <item>$b</> WHERE <q3.tuple><b>$b</></> in "q3.xml“ } </> XML-QL query Q4 Final XML OUTPUT <results><item>Traditional Ukrainian cookery</> <item>Being and nothingness</></> Ranked Information Retrieval on XML Data

  18. ELIXIR – Experiments, Conclusion Experiments: Total processing time … • … depends on details of each query and input data • … increases marginal with number of answers r • … increases linearly with number of similarity join predicates • Partition (Step 1) of initially query dominate (expensive parsing and traversing) Ranked Information Retrieval on XML Data

  19. ELIXIR – Experiments, Conclusion (II) Conclusion: • ELEXIR extends XML-QL by supporting IR-similarity-features for ranking • similarity joins even between 2 variables (expressiveness) • Algorithm: • rewrite original ELIXIR query in a series of intermediate XML-QL and WHIRL queries. • no full cross product, only filtered tuples of variable bindings (efficiency) • But … • only non-nested queries • strict three-stage approach may be suboptimal in some cases (partition) Ranked Information Retrieval on XML Data

  20. XRANK: Ranked Keyword Search over XML Documents Ranked Information Retrieval on XML Data

  21. Introduction • XRANK - Keyword Search over XML documents • results: • XML elements that contain all searched keywords • ranking: • at granularity of XML elements • based on hyperlink structure • advantages: • user does not have to learn a query language • no knowledge about the structure of XML documents is needed • generalized keyword search engine • (both HTML and XML are possible) Ranked Information Retrieval on XML Data

  22. Data Model • G = (V, CE, HE) : collection of XML documents • V : set of XML elements (tags and attributes) • CE : set of containment edges • HE : set of hyperlinked edges • (u,v) in CEv is a sub-element of u • (u,v) in HE  u contains a hyperlink to v • contains(v,k) v (in)directly contains the keyword k Ranked Information Retrieval on XML Data

  23. Example: XML Graph ... XML element value Ranked Information Retrieval on XML Data

  24. KeywordQuery Results (1) How to define results of keyword search queries over XML documents? elements that contain all keywords – no sub-element contains all keywords! elements with at least one sub-element containining all keywords & at least one sub-element containing some keywords ⋃ Ranked Information Retrieval on XML Data

  25. Ranking Elements How to rank XML elements? ElemRank • extension of PageRank at the granularity of elements • objective importance of XML elements • based on hyperlinked and nested structure of XML • elements Ranked Information Retrieval on XML Data

  26. ElemRank (1) n: # XML elements nc(u) : # sub-elements of u nh (u) : # outgoing hyperlinks from u CE-1 : {(v,u) | (u,v)  CE} “reverse containment edges“ E : HE  CE  CE -1 nc(u) = 3 nh(u) = 3 u containment edge reverse containment edge hyperlink edge Ranked Information Retrieval on XML Data

  27. ElemRank (2)  : prob. for following a hyperlink 1- - - : prob. for a random jump  : prob. for using a containment edge  : prob. for using a reverse containment edge ε  / 3+ ε / 10  / 1+ ε / 10  / 3 + ε/10 ε / 10  / 3 + ε / 10  / 3+ ε / 10  / 3 + ε / 10 ε / 10  / 3+ ε / 10 containment edge reverse containment edge hyperlink edge Ranked Information Retrieval on XML Data

  28. ElemRank (3) ElemRank e(v) = e(u) nh(u) e(u) nc(u) e(u) 1 (1- - - ) * 1/n +  * ∑ +  * ∑ +  * ∑ (u,v)  HE (u,v)  CE (u,v)  CE-1 (0 ≤ , ,  ≤ 1) random navigation via hyperlinks via forward containment edges via reverse containment edges Ranked Information Retrieval on XML Data

  29. Ranking Function (1) • ranking functions should take into account: • result specifity • hyperlinks • keyword proximity contains(v,k) ∃ sequence (v1,v2), ..., (vn-1,vn) s.t. vn directly contains k r(v,k) = ElemRank(vn) * decayn-1(0 ≤ decay ≤ 1) • based on hyperlinked structure • result specifity Ranked Information Retrieval on XML Data

  30. Ranking Function (2) • m occurences of keyword k • computation of r1, ..., rm • r*(v,k) = f(r1, ..., rm) (with accumulation function f - e.g. max or sum) p = proximity measure • query q consists of keywords k1, ..., kn • R(v,q) = ( r*(v,ki)) * p(v,k1, ..., kn) • keyword proximity Ranked Information Retrieval on XML Data

  31. <CDs> <CD id = “1“> <title> R.E.M. – Out Of Time</title> <song> <title> Radio Song</title> <time> 4:12</time> </song> <song> <title> Losing My Religion</title> <time> 4:26</time> </song> ... </CD> <CD id = “2“> <title> R.E.M. – Automatic For...</title> ... </CD> ... </CDs> Ranked Information Retrieval on XML Data

  32. XRANK Architecture ranked result list keyword search query XML documents Query Evaluator data acces ElemRank computation XML elements index structures & algorithms with ElemRanks Ranked Information Retrieval on XML Data

  33. Naïve Approach • naïve inverted list: • contains all XML elements that contain the keyword ... key1 elem11 elem12 ... key2 elem21 elem22 etc. • space overhead • spurious results • inaccurate ranking Ranked Information Retrieval on XML Data

  34. Dewey IDs 0 <CDs> ... 0.0 0.1 <CD> <CD> ... ... 0.0.0 0.0.1 0.1.0 0.0.2 <title> <song> <song> <title> R.E.M. – Out Of Time R.E.M. – Automatic For The People 0.0.1.0 0.0.1.1 0.0.2.0 0.0.2.1 <title> <time> <title> <time> Radio Song 4:12 Losing My Religion 4:26 Ranked Information Retrieval on XML Data

  35. DIL– Data Structure • Dewey inverted list: • contains the Dewey IDs of all XML elements that • directly contain the keyword • sorted by Dewey ID (ascending) Dewey ID ElemRank position list [0] 0.0.0 75 R.E.M. 80 [0] 0.1.0 … Dewey ID ElemRank position list [2] Religion 0.0.2.0 88 … Ranked Information Retrieval on XML Data

  36. DIL – Query Processing (1) • key idea: computation of longest common prefix (lcp) of Dewey IDs pot_result posList [1] posList [2] DeweyID rank [1] rank [2] 1. 0 75 0 y 0 70 0 n 0 65 0 n Ranked Information Retrieval on XML Data

  37. DIL – Query Processing (2) pot_result posList [1] posList [2] pot_result posList [1] posList [2] DeweyID DeweyID rank [1] rank [2] rank [1] rank [2] 2. 1. 0 75 0 y 0 88 2 n 70 0 n 2 83 2 n 0 y 0 65 0 n 0 70 78 0 2 lcp 0 65 73 0 2 n Ranked Information Retrieval on XML Data

  38. DIL – Query Processing (3) pot_result posList [1] posList [2] pot_result posList [1] posList [2] DeweyID DeweyID rank [1] rank [2] rank [1] rank [2] 2. 1. 0 75 0 y 0 88 2 n 0 70 0 n 2 83 2 n y 0 65 0 n 0 70 78 0 2 lcp 0 65 73 0 2 n lcp 3. 0 80 0 n 1 75 0 n { 0.0 , 0 } y 0 70 73 0 2 Ranked Information Retrieval on XML Data

  39. RDIL – Data Structure • ranked Dewey inverted list: • each Dewey ID in the list has a position in the B+-tree • B+-tree sorted by Dewey ID (ascending) • inverted list sorted by ElemRank (descending) B+-tree on Dewey IDs 0.0.0 … 0.1.0 ElemRank Dewey ID 80 0.1.0 R.E.M. 0.0.0 75 … Ranked Information Retrieval on XML Data

  40. RDIL – Query Processing (1) key1 key2 key3 B+ B+ B+ on Dewey IDs entry11 entry21 entry31 entry12 entry22 entry32 sorted by ElemRank entry13 entry23 entry33 ... ... ... lcp with Dewey ID11 ⇨result heap Ranked Information Retrieval on XML Data

  41. RDIL – Query Processing (2) key1 key2 key3 B+ B+ B+ on Dewey IDs entry11 entry21 entry31 entry12 entry22 entry32 sorted by ElemRank entry13 entry23 entry33 ... ... ... lcp with Dewey ID21 ⇨result heap etc. Ranked Information Retrieval on XML Data

  42. RDIL – Query Processing (3) key1 key2 key3 B+ B+ B+ on Dewey IDs entry11 entry21 entry31 entry12 entry22 entry32 sorted by ElemRank entry13 entry23 entry33 ... ... ... max. reachable Ranking≤ ∑ Ranking = threshold Ω Ranked Information Retrieval on XML Data

  43. RDIL – Query Processing (4) RDIL algorithm stops if threshold Ω <lowest ElemRank in result heap because max. reachable ranking ≤ Ω < lowest ElemRank in result heap ⇨ max. reachable ranking <lowest ElemRank in result heap ! Ranked Information Retrieval on XML Data

  44. XRANK Architecture ranked result list keyword search query XML documents Query Evaluator data acces ElemRank computation XML elements DIL / RDIL with ElemRanks Ranked Information Retrieval on XML Data

  45. Experimental Results (1) Ranked Information Retrieval on XML Data

  46. Experimental Results (2) Ranked Information Retrieval on XML Data

  47. Comparison DIL - RDIL DIL RDIL • inverted lists sorted by • Dewey ID • compute longest common prefix on Dewey IDs • extracts the minimum • of all remaining Dewey IDs • all lists are completely • scanned • outperforms RDIL • if keyword correlation is low • inverted lists sorted by • ElemRank • chooses next list sequentially • stops if a certain threshold • is reached • outperforms DIL if • keyword correlation is high Ranked Information Retrieval on XML Data

  48. Conclusion 2 Approaches • ELIXIR: • SQL-like structure based search • extends XML-QL by supporting IR-similarity-features for ranking • ranked results based only on textual similarity (even between 2 variables) • XRANK: • keyword based search à la Google • ranked results based on textual similarity • hierarchical and hyperlinked structure Ranked Information Retrieval on XML Data

More Related