1 / 35

XML Retrieval

XML Retrieval. Tarık Teksen Tutal 21.07.2011. Information Retrieval. XML ( Extensible Markup Language) XQuery Text Centric vs Data Centric. Basic XML Concepts. XML. Ordered, Labeled Tree XML Element XML Attribute

guy
Télécharger la présentation

XML Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML Retrieval Tarık Teksen Tutal 21.07.2011

  2. Information Retrieval • XML (Extensible Markup Language) • XQuery • Text Centric vs Data Centric

  3. Basic XML Concepts

  4. XML • Ordered, Labeled Tree • XML Element • XML Attribute • XML DOM (Document Object Model): Standard for accessing and processing XML documents.

  5. XML Structure • An Example:

  6. XML DOM Object • XML DOM Object of the Sample in the Previous Slide • Nodes in a Tree • Parse the Tree Top Down

  7. XPath • Standardfor enumerating paths in an XML document collection • Querylanguage for selecting nodes from an XML document • Definedby the World Wide Web Consortium (W3C)

  8. Schema • Puts Constraints on the Structureof AllowableXML • Two Standarts for Schemas: • XML DTD • XML Schema

  9. Challanges in XML Retrieval

  10. Structured Document Retrieval Principle • A system should always retrievethe most specific part of a document answering the query • In a «Cookbook» collection, if a user queries «Apple Pie», the system should return the relevant, «Apple Pie», chapter of the book, «AppleDeserts», not the entire book. • In the same example however, if user queries «Apple», the book should be returned instead of a chapter.

  11. Indexing Unit • Unstructured: • Files on PC, Pages on the Web, E-Mail Messages etc. • Structured • Non-Overlapping Pseudodocuments • Top-Down • Bottom-Up • All

  12. Indexing Unit • Non-Overlapping Pseudodocuments • Not Coherent

  13. Indexing Unit • Top-Down • Start with one of the latest units (e.g book in a book collection) • Postprocess search results to find for each book the subelement that is the best hit. • Fail to return the best element since relevance of a book is generally not a good predictor for relevance of subelements.

  14. Indexing Unit • Bottom-Up • Search all leaves, select relevant ones • Extend them to larger units in postprocessing • Fail to return the best element since relevance of a subelement is generally not a good predictor for relevance of larger units.

  15. Indexing Unit • Index All the Elements • Not Useful to Index Some Elements (e.g ISBN) • Creates redundancy (Deeper Level Elements are Returned Several Times)

  16. Nested Elements • To Get Rid of Redundancy, • Discard All Small Elements • DiscardAllElementTypesthat Usersdo not Lookat (WorkingXML RetrievalSystemLogs) • DiscardAllElementTypesthat AssessorsGenerallydo not Judgeto be Relevant (If RelevanceAssessmentsare Available) • OnlyKeepElementTypesthat a SystemDesigneror Librarianhas Deemedto be UsefulSearchResults

  17. Nested Elements • Remove Nested Elements in a Postprocessing Step • Collapse Several Nested Elements in the Results List and then Highlight Results

  18. Vector Space Model For XML Retrieval

  19. Lexicalized Subtrees • To get eachword together with its position within the XML tree encoded by a dimension of the vector space • Map XML documents to lexicalized subtrees • Take each text node (leaf) and break it into multiple nodes, one for each word. E.g. split Bill Gates into Bill and Gates • Define the dimensions of the vector space to be lexicalized subtrees of documents – subtrees that contain at least one vocabulary term

  20. Lexicalized Subtrees

  21. Lexicalized Subtrees • Queries and documents can be respresented as vectors in this lexicalized subtree context • Matches can then be computed for example by using the Vector Space Formalism • V.S. Formalism -> Unstructured vs Structured • Dimensions: Vocabulary Terms vs Lexicalized Subtrees

  22. Dimensions: Tradeoff • Dimensionality of Space vs Accuracy of Results • Restrict Dimensions to Vocabulary Terms • Standart Vector Space Retrieval System • Do Not Match the Structure of the Query • Separate Lexicalized Dimension for Each Subtree • Dimensionality of Space Becomes too Large

  23. Dimensions: Compromise • Index All Paths that End with a Single Vocabulary Term (XML-Context Term Pairs) • Structural Term <c, t>: a pair of XML-context c and vocabulary term t

  24. Context Resemblance • To measure the similarity between a path in a query and a path in a document • |cq| and |cd| are the number of nodes in the query path and document path respectively • cq matches cdif and only ifwe can transform cq into cd by inserting additional nodes

  25. Context Resemblance • CR(cq4, cd2) = 3/4 = 0.75 • CR(cq4, cd3) = 3/5 = 0.6

  26. Document Similarity Measure • Final Score for a Document • Variant of the Cosine Measure • Also called «SimNoMerge» • Nota True CosineMeasureSinceItsValuecan be Largerthan 1.0

  27. Document Similarity Measure • V is the vocabulary of non-structural terms • B is the set of all XML contexts • weight (q, t, c), weight(d, t, c) are the weights of term t in XML context c in query q and document d, respectively • standard weighting e.g. idft x wft,d, where idft depends on which elements we use to compute dft.

  28. SimNoMerge Algorithm ScoreDocumentsWithSimNoMerge(q, B, V, N, normalizer)

  29. Evaluation of XML Retrieval

  30. INEX • Initiative for the Evaluation of XML Retrieval • Yearly standardbenchmark evaluation that has produced test collections (documents, sets of queries, and relevance judgments) • Based on IEEE journal collection (since 2006 INEX uses the much larger English Wikipedia test collection) • The relevance of documents is judged by human assessors.

  31. INEX Topics • Content Only (CO) • Regular Keyword Queries Like in Unstructured IR • Content and Structure (CAS) • Structured Constraints in Addition to Keywords • Relevance Assessments are More Complicated

  32. INEX Relevance Assessments • INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance • Component Coverage: • Evaluates Whether the Element Retrieved is «Structurally» Correct • Topical Relevance

  33. INEX Relevance Assessments • Component Coverage: • Exact coverage (E): The information sought is the main topic of the component and the component is a meaningful unit of information • Too small (S): The information sought is the main topic of the component, but the component is not a meaningful (self-contained) unit of information • Too large (L): The information sought is present in the component, but is not the main topic • No coverage (N): The information sought is not a topic of the component • Topical Relevance: • HighlyRelevant(3), FairlyRelevant(2), MarginallyRelevant(1) andNonrelevant(0)

  34. Combining The Relevance Dimensions • All of the combinations are not possible -> 3N • Quantization:

  35. INEX Evaluation Measures • Precision and Recall can be applied • Sum Grades vs Binary Relevance • Overlap is not accounted for • Nested elements in the same search result • Recent INEX focus: • Develop algorithms and evaluation measures that return non-redundant results lists and evaluate them properly.

More Related