1 / 68

WEB BAR 2004 Advanced Retrieval and Web Mining

WEB BAR 2004 Advanced Retrieval and Web Mining. Lecture 11. Overview. Monday XML Clustering 1 Tuesday Clustering 2 Clustering 3, Interactive Retrieval Wednesday Classification 1 Classification 2 Thursday Classification 3 Information Extraction Friday Bioinformatics Projects

billr
Télécharger la présentation

WEB BAR 2004 Advanced Retrieval and Web Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11

  2. Overview • Monday • XML • Clustering 1 • Tuesday • Clustering 2 • Clustering 3, Interactive Retrieval • Wednesday • Classification 1 • Classification 2 • Thursday • Classification 3 • Information Extraction • Friday • Bioinformatics • Projects • Joker • Active learning in Text Mining

  3. Today’s Topics • Quick XML intro • XML indexing and search • Database approach • Xquery • 2 IR approaches

  4. What is XML? • eXtensible Markup Language • A framework for defining markup languages • No fixed collection of markup tags • Each XML language targeted for application • All XML languages share features • Enables building of generic tools

  5. Basic Structure • An XML document is an ordered, labeled tree • character data at the leaf nodes contain the actual data (text strings) • elementnodes are each labeled with • a name (often called the element type), and • a set of attributes, each consisting of a name and a value, • can have child nodes

  6. XML Example

  7. XML Example <chapter id="cmds"> <chaptitle>FileCab</chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application.</para> </chapter>

  8. Elements • Elements are denoted by markup tags • <foo attr1=“value” … > thetext </foo> • Element start tag: foo • Attribute: attr1 • The character data: thetext • Matching element end tag: </foo>

  9. XML vs HTML • Relationship?

  10. XML vs HTML • HTML is a markup language for a specific purpose (display in browsers) • XML is a framework for defining markup languages • HTML can be formalized as an XML language (XHTML) • XML defines logical structure only • HTML: same intention, but has evolved into a presentation language

  11. XML: Design Goals • Separate syntax from semantics to provide a common framework for structuring information • Allow tailor-made markup for any imaginable application domain • Support internationalization (Unicode) and platform independence • Be the future of (semi)structured information (do some of the work now done by databases)

  12. Why Use XML? • Represent semi-structured data (data that are structured, but don’t fit relational model) • XML is more flexible than DBs • XML is more structured than simple IR • You get a massive infrastructure for free

  13. Applications of XML • XHTML • CML – chemical markup language • WML – wireless markup language • ThML – theological markup language • <h3 class="s05" id="One.2.p0.2">Having a Humble Opinion of Self</h3> <p class="First" id="One.2.p0.3">EVERY man naturally desires knowledge <note place="foot" id="One.2.p0.4"> <p class="Footnote" id="One.2.p0.5"><added id="One.2.p0.6"> <name id="One.2.p0.7">Aristotle</name>, Metaphysics, i. 1. </added></p> </note>; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. <added id="One.2.p0.8"><note place="foot" id="One.2.p0.9"> <p class="Footnote" id="One.2.p0.10"> Augustine, Confessions V. 4. </p> </note></added> </p>

  14. XML Schemas • Schema = syntax definition of XML language • Schema language = formal language for expressing XML schemas • Examples • DTD • XML Schema (W3C) • Relevance for XML information retrieval • Our job is much easier if we have a (one) schema

  15. XML Tutorial • http://www.brics.dk/~amoeller/XML/index.html • (Anders Møller and Michael Schwartzbach) • Previous (and some following) slides are based on their tutorial

  16. XML Indexing and Search

  17. Native XML Database • Uses XML document as logical unit • Should support • Elements • Attributes • PCDATA (parsed character data) • Document order • Contrast with • DB modified for XML • Generic IR system modified for XML

  18. XML Indexing and Search • Most native XML databases have taken a DB approach • Exact match • Evaluate path expressions • No IR type relevance ranking • Only a few that focus on relevance ranking • Many types of XML don’t need relevance ranking • If there is a lot of text data, relevance ranking is usually needed.

  19. Timber: DB extension for XML • DB: search tuples • Timber: search trees • Main focus • Complex and variable structure of trees (vs. tuples) • Ordering • Non-native XML database • without relevance ranking • without “IR-type” handling of text

  20. Three Native XML Databases • Toxin • Xirql • IBM Haifa system

  21. ToXin • Exploits overall path structure • Supports any general path query • Query evaluation in three stages • Preselection stage • Selection stage • Postselection stage

  22. ToXin: Motivation • Strawman (Dataguides) • Index all paths occurring in database • Sufficient for simple queries: • Find all authors with last name Smith • Does not allow backward navigation • Example query: • find all the titles of articles authored by Smith

  23. Query Evaluation Stagesfor Backward Navigation • Pre-selection • First navigation down the tree • Selection • Value selection according to filter • Post-selection • Navigation up and down again

  24. ToXin

  25. Evaluation:Factors Impacting Performance • Data source (collection) specific • Document size • Number of XML nodes and values • Path complexity (degree of nesting) • Average value size • Query specific • Selectiveness of path constraint • Size of query answer • Number of elements selected by filter

  26. Test Collections

  27. Query Classification

  28. Evaluation

  29. ToXin: Summary • Efficient system supporting structured queries • All paths are indexed (not just from root) • Path index linear in corpus size • Shortcomings • Order of nodes ignored • No IR-type relevance

  30. IR/Relevance Ranking for XML • Why is this difficult?

  31. IR XML Challenge 1: Term Statistics • There is no document unit in XML • How do we compute tf and idf? • Global tf/idf over all text contexts is problematic • Consider medical collection • “new” not a discriminative term in general • Very discriminative for journal titles • New England Journal of Medicine

  32. IR XML Challenge 2: Fragments • Which fragments are legitimate to return? • Paragraph, abstract, title • Bold, italic • IR systems don’t store content (only index) • Need to go to document for displaying fragment • Problematic if fragment is not simply a node

  33. Remainder of Lecture • Queries for semi-structured text • How they differ from regular IR queries • Xquery • Two XML search systems with relevance ranking • Xirql • IBM Haifa system

  34. Types of (Semi)Structured Queries • Location/position (“chapter no.3”) • Simple attribute/value • /play/title contains “hamlet” • Path queries • title contains “hamlet” • /play//title contains “hamlet” • Complex graphs • Employees with two managers • All of the above: mixed structure/content

  35. XPath • Declarative language for • Addressing (used in XLink/XPointer and in XSLT) • Pattern matching (used in XSLT and in XQuery) • Location path • a sequence of location steps separated by / • Example: • child::section[position()<6] / descendant::cite / attribute::href

  36. Axes in XPath • ancestor, ancestor-or-self, attribute, child, descendent, descendent-or-self, following, following-sibling, namespace, parent, preceding, preceding-sibling, self

  37. Location steps • A single location step has the form: • axis :: node-test [ predicate ] • The axis selects a set of candidate nodes (e.g. the child nodes of the context node). • The node-test performs an initial filtration of the candidates based on their • types (chardata node, processing instruction, etc.), or • names (e.g. element name). • The predicates (zero or more) cause a further, more complex, filtration • child::section[position()<6]

  38. XQuery • SQL for XML • Usage scenarios • Human-readable documents • Data-oriented documents • Mixed documents (e.g., patient records) • Based on XPath

  39. XQuery Expressions • path expressions • element constructors • list expressions • conditional expressions • quantified expressions • datatype expressions

  40. FLWR Expressions • FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p • FOR generates an ordered list of bindings of publisher names to $p • LET associates to each binding a further binding of the list of book elements with that publisher to $b • at this stage, we have an ordered list of tuples of bindings: ($p,$b) • WHERE filters that list to retain only the desired tuples • RETURN constructs for each tuple a resulting value

  41. XQuery vs SQL • Order matters! • document("zoo.xml")//chapter[2]//figure[caption = "Tree Frogs"] • XQuery is turing complete, SQL is not.

  42. XQuery Example Møller and Schwartzbach

  43. XQuery 1.0 Standard on Order • Document order defines a total ordering among all the nodes seen by the language processor. Informally, document order corresponds to a depth-first, left-to-right traversal of the nodes in the Data Model. • … if a node in document A is before a node in document B, then every node in document A is before every node in document B. • This structure-oriented ordering can have undesirable effects. • Example: Medline

  44. Document collection = 100s of XML docs, each with thousands of abstracts <!DOCTYPE MedlineCitationSet PUBLIC "MedlineCitationSet" "http://www.nlm.nih.gov/databases/dtd/nlmmedline_001211.dtd"> <MedlineCitationSet> <MedlineCitation> <MedlineID>91060009</MedlineID> <DateCreated><Year>1991</Year><Month>01</Month><Day>10</Day></DateCreated> <Article>some content</Article> </MedlineCitation>

  45. Document collection = 100s of XML docs, each with thousands of abstracts <!DOCTYPE MedlineCitationSet PUBLIC "MedlineCitationSet" "http://www.nlm.nih.gov/databases/dtd/nlmmedline_001211.dtd"> <MedlineCitationSet> <MedlineCitation> (content) </MedlineCitation> <MedlineCitation> (content) </MedlineCitation> … </MedlineCitationSet>

  46. How XQuery makes ranking difficult • All documents in collection A must be ranked before all documents in collection B. • Fragments must be ordered in depth-first, left-to-right order.

  47. Semi-Structured Queries • More complex than “unstructured” queries • Xquery standard

  48. XIRQL • University of Dortmund • Goal: open source XML search engine • Motivation • “Returnable” fragments are special • “atomic units” • E.g., don’t return a <bold> some text </bold> fragment • Structured Document Retrieval Principle • Empower users who don’t know the schema • Enable search for any person_name no matter how schema refers to it

  49. Atomic Units • Specified in schema • Only atomic units can be returned as result of search (unless unit specified) • Tf.idf weighting is applied to atomic units • Probabilistic combination of “evidence” from atomic units

  50. XIRQL Indexing

More Related