1 / 30

XML Retrieval

XML Retrieval. Chapter 10. Introduction. Chapter Outline XML basic concepts Differences between XML and Unstructured Retrieval Vector space model in XML Retrieval Evaluation on XML retrieval: INEX Text-centric vs. Data-centric XML Retrieval. XML Retrieval. Structured Retrieval

kenny
Télécharger la présentation

XML Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML Retrieval Chapter 10

  2. Introduction • Chapter Outline • XML basic concepts • Differences between XML and Unstructured Retrieval • Vector space model in XML Retrieval • Evaluation on XML retrieval: INEX • Text-centric vs. Data-centric XML Retrieval

  3. XML Retrieval • Structured Retrieval • 구조화된 문서에서 정보를 찾음 • Text-centric XML만을 주로 다룸 • e.g. Data-centric XML • XML retrievalvs. Parametric zone search • Unstructured 와 structure의 중간개념 • Parametric field and zones(author, title…) • Flat - no nesting of attributes, # of attribute is small • XML만의 특성 • has more complex tree • attributes can be nested • # of attribute is greater than parametric zone search • Always refers to XML retrieval in this book

  4. Basic concept of XML • XML document • ordered, labeled tree • each node of tree: XML element • has opening, closing tag • has attributes • internal node, leaf node • internal node encodes structure • leaf node has text • XML DOM API • One of standard for processing XML • Hierarchical object structure • Process start from Root element, descend into its children <play> <element> <author>shakespeare</author> </element> <element> <title>Macbeth</title> </element> <element> <act number=“I”> <scene number=“vii”> <verse>Will I with..</verse> <title>Macbeth’s castle</title> </scene> </act> </element> </play>

  5. Basic concept of XML • XPath • standard for enumerating paths in XML Document • act/scene • selects all scene element • whose parent is act • play//scene • selects all scene element • occurring in a play element • /play/title • selects play’s title • /play//title • selects play’s title and scene’s title • /scene/title • no elements • title#“Macbeth” • selects all titles containing term “Macbeth”

  6. Basic concept of XML • NEXI • Narrowed Extended XPath I • Common format of XML Query • Element + Modifier • Example • //article[.//yr=2001 or .//yr=2002]//section[about(.,summer holidays)] • Path filter • two yr condition(arithmetic filtering) • about clause(string filtering)

  7. Challenges in XML Retrieval • Structured Retrieval • Queries and Documents are either structuredor unstructured • ex) //article//section vs. “summer holidays” • Most user wants part of documents • ex) Shakespeare, “Macbeth’s castle” • Should we return <Scene>,< Act>, or entire <play> element? • “Macbeth’s castle” is scene, which is probably user’s need • Structured document retrieval principle • A system should always retrieve the most specific part of document answering query

  8. Challenges in XML Retrieval • Structured document retrieval principle • Applying principle in practice: not easy • Title#“Macbeth” • → /play/title/“Macbeth” or /play/act/scene/“Macbeth’s Castle” • This time, play title is preferred • Indexing unit problem • Which parts of a document to index? • In unstructured retrieval, whole document is indexing unit • In structured retrieval, several strategy exists

  9. Challenges in XML Retrieval • Indexing unit strategy • Grouping nodes into non-overlapping pseudo-documents • Select one of largest element as indexing unit • descend into its leaves in post-processing(two-step, top-down) • Select all leaves as indexing unit • extend to larger unit in post-processing(two-step, bottom-up) • Index all element Ex) Non-overlapping pseudo-documents

  10. Challenges in XML Retrieval • Relevant statistics for XML retrieval • Nested element can cause confusion in statistics • Ex) inverse document frequency • Term “Gates” both exists author#“Gates” and section#“Gates” • In this case, computing Idf for “Gates”: term only, or structure+term • Schema heterogeneity • Also referred as schema diversity • Equivalent element may have different name → creator(d2) vs. author(d3) • Equivalent element may have different structure: → author(q3) vs. first/last name(d3)

  11. Challenges in XML Retrieval • Schema heterogeneity • Extended query • Transform query: q3 → q4 • in pseudo-xpath expression: book//#“Gates” • Users are not familiar with element names & structure • Allowing any number of intervening nodes between “book” and “gates”

  12. Challenges in XML Retrieval • Schema heterogeneity • Extended query q6 will return nothing • Structural mismatch • extended query do not help here • Should be ranked lower, but should not omitted from search results • Structural constraintshould be interpreted as “hints”

  13. Vector space modelfor XML Retrieval

  14. Vector space model • Concept: Structural Term • Element with single vocabulary term in the end • XML context/term pair, denoted by <C,t> • 7 structural term shown in figure(total 9) • 2 are not shown • /book/author#“Bill” , /book/author#“Gates” Lexicalized Subtree → Not a structural term

  15. Vector space model • XML query examples – structural term q = { (t1, c1), (t2, c1), (t3, c2), (t4)… } <chapter><title>XML tutorials</title></chapter> q = { (XML, chapter/title), (tutorial, chapter/title) } <article> <sec>non-monotonic reasoning</sec> “belief revision” </article> q = { (non-monotonic, article/sec), (reasoning, article/sec), (belief revision, article) } XML Context Non-structure term

  16. Vector space model • SimNoMerge(q,d) • CR : Context Resemblance • B : set of all XML context • V : the vocabulary of non-structural terms • weight(q,t,c), weight(d,t,c) • weight of term t in XML context c in query q and document d • weight: one of weightings from Chapter 6, such as idft·wft,d • Not a true cosine measure – result may larger than 1

  17. Vector space model • Relevance scoring function cosine similarity between query q and document d cosine similarity between XML fragment q and XML document d (from Carmel et al. 2002, An Extension of the Vector Space Model for Querying XML Document)

  18. Vector space model • Structural resemblance • CR: context resemblance • (if Cqmatches Cd) = 0 (if Cqdoes not match Cd) • |Cq|, |Cd| : # of nodes in the query path and document path

  19. Vector space model • CR example • CR(Cq, Cd) = 1 (if path of q = path of d) ex) CR(Cq4, Cd2) = 3/4 = 0.75 CR(Cq4, Cd3) = 3/5 = 0.6

  20. Vector space model • SimNoMerge Pseudo-code N: # of document to retrieve B: all XML context V: all term(unstructured) q: query (contains structured term pair) normalizer: sqrt( sum of (term-doc weight)2 ) Inverted doc index

  21. Evaluation on XML Retrieval • INEX • INnitiative for the Evaluation of XML retrieval • Collection: 12,000 IEEE journal(2002) → en.wikipedia.org(2006) • 2 Types of topics • CAS(Content & Structure) • CO(Content Only) • Component Coverage • Exact coverage(E) • Too small(S) • Too large(L) • No converge(N) • Topical Relevance • Four levels, 3(Highly relevant) ~ 0(Non-relevant)

  22. Evaluation on XML Retrieval • Quantizer function • Combination of relevance & coverage • Q(rel, cov) = • Ex) 2S component • #(relevant items retrieved) = • As an approximation, precision, recall, F measure can be applied on this definition(with notation) 1.00 if (rel, cov) = 3E 0.75 if (rel, cov) ∈ {2E, 3L} 0.50 if (rel, cov) ∈ {1E, 2L, 2S} 0.25 if (rel, cov) ∈ {1S, 1L} 0.00 if (rel, cov) = 0N

  23. Evaluation on XML Retrieval • Effectiveness in XML retrieval is often lower than unstructured retrieval • XML retrieval is harder • Partial retrieval(coverage issue) • XML retrieval scored lower • Binary relevance = { 1 or 0 }, XML retrieval graded = { best case 1 } • Structured retrieval score is not compared with unstructured retrieval

  24. Evaluation on XML Retrieval • Large increase a Precision at k at k=5 and k=10 • Structure help to increase precision at top of the result list • Structured retrieval is better at precision-oriented task • Recall may suffer

  25. Text-centric vs. Data-centric XML • Text-centric XML Retrieval • Long text field • Inexact matching • Relevance-ranked results • Assembly manuals, issues of journals, Newswire articles… • Data-centric XML Retrieval • No ranking • Exact matching • Commonly used for data collection with complex structure • Mainly contain non-text data • Most data-centric XML retrieval systems are extensions of Relational database systems

  26. Appendix • Text-centric vs. Data-centric XML document • Also referred as Document-like vs. Record-like XML document • Document-like XML also referred as Narrative-like XML document • in XML in a Nutshell(O’Reilly, 3rd ed.) • Document-like(=text-centric) XML example • xHTMLs: MSDN library documents, Wikipedia • Meant for human beings to read(with appropriate Schema/DTD) • Record-like(=data-centric) XML example • SOAP, RSS specification using XML • commonly used in communication-type applications • cf. XML database • Does not really store native (text) XML document • Provides XML document as fundamental unit of logical storage • XML-Enabled RDBMS vs. Native XML Database

  27. Text-centric XML document • Page from INEX 2009 corpus

  28. Text-centric XML document • Page from Wikipedia

  29. Data-centric XML document • from SOAP response • Clearly, information for machine

  30. Data-centric XML document • from RSS format • Classified as record-like XML, but partiallyhuman-readable

More Related