1 / 22

Searching XML Documents via XML Fragments

Searching XML Documents via XML Fragments. D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer. Presented by Hui Fang. Database:. IR:. Schema: Papers (Title, Authors, Conf., Journal). An example document:. <title>. Title. Authors. Conf. Journal.

hayley
Télécharger la présentation

Searching XML Documents via XML Fragments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang

  2. Database: IR: Schema: Papers (Title, Authors, Conf., Journal) An example document: <title> Title Authors Conf. Journal Intel: New chip, new price war . February 1, 2004: 6:32 PM EST. Intel Corp. on Sunday said it had refreshed its line of microchips for desktop computers with a new version of the Pentium 4 processor, designed to run increasingly power-hungry office and home entertainment software faster. In 1998,….. </title> </date> <date> XIRQL N.Fuhr, K. Grobjohann SIGIR ------ <content> Protein John Smith --- Bioinformatics DB+IR: </content> Semi-structured Data Background(1) --- Data Why is semi-structured data important? Well-structured Data Un-structured Data Lack of flexibility Lack of extensibility <paper> <title> XIRQL </title> <author> N.Fuhr </author> <author> K.Grobjohann </author> <conf> SIGIR </conf> </paper> Lack of the logical structure of a document.

  3. Attribute content End tag Start tag … book element id=“25” title year author author publisher Readings in … 1997 Peter Willett Morgan Kaufmann Karen Sparck Jones XML in a nutshell <book id=“25”> <year>1997</year> <author> Karen Sparck Jones </author> <author>Peter Willett </author> <publisher>Morgan Kaufmann</publisher> <title> Readings in Information Retrieval </title> </book> … • Hierarchical data format • Nested element structure having a root • Self describing data (tags), schema is attached to the data itself.

  4. IR: Ranked Query • Keywords: • paper SIGIR • Return the ranked documents according to the relevance. Database: Boolean Query • SQL (Structured Query Language): • SELECT title • FROM papers • WHERE conf=‘SIGIR’ • Return the unranked tuples satisfying the query. Background(2) --- Query How to query semi-structured data (e.g. XML data) ?

  5. Related Work • DB-oriented approaches • E.g. XML-QL, XQL, XQUERY … WHERE <book> <title>Harry Potter </title> <author>$a</author>, <year> $y </year> </book> in “books.xml”, $y>2002 CONSTRUCT <result> <author>$t</author> </result> • DB+IR approaches • E.g. XIRQL • IR-oriented approaches • E.g. this paper

  6. Problem Refinement---CAS Search • Document collection: • XML documents • Each document is a hierarchical structure of nested elements • Markup in the document mainly serves for exposing the logical structure of a document. • Query • content + explicit references to the XML structure • specifies the target element need to be returned An example: Retrieval all articles from the years 1999-2000 and deal with works on nonmonotonic reasoning. Do not retrieve articles that are calendar/call for papers.

  7. Approach • Compare apple and apple • Recall vector space models • Both documents and queries are expressed in free text. • Compare unstructured data to unstructured data • This paper: • Search XML documents via XML fragments

  8. XQuery <results> { for $t in document (“library.xml”//book/title) where contains ($t/text(), “search”) return $t } </results> Query---XML Fragments(1) More intuitive More flexible • Topic 1: Find all books about fishing <book> fishing </book> • Topic 2: Find all books having a title about search <book> <title> fishing </title> </book>

  9. Query --- XML Fragment(2) • Limited expressiveness • E.g. “Finding figures that describe the Corba architecture and the paragraphs that refer to those figures. “ Requires a “join” operation between two elements “figures” and “paragraphs”

  10. d q • Vector Space Model • Represent doc/query by a vector of terms • Relevance between doc and query distance between two vectors Recall: Text Retrieval Task • Give a query • According to the retrieval formula, compute the relevance score for each document; • Rank the documents according to relevance score.

  11. Context resemblance measure Perfect match: ,when ; 0 ,otherwise. Partial match: ,when ci subsequence of ck; 0, otherwise Fuzzy match: Flat (ignore context): Extending the Vector Space Model(1) • Indexing unit: • E.g. (“Harry Potter ”, /book/title) • Can be matched with • (“Harry Potter ”,/book) • (“Harry Potter ”,/book/sec/title) • Retrieval Formula

  12. ,where “Merge-idf” variant: and ,where “Merge” variant: Extending the Vector Space Model(2) If c is rare, idf(t,c) would be high in spite of t being very common.

  13. Evaluation • Runs • Partial-match • Partial-match. merge-idf • Partial-match.merge • Fuzzy-match.merge-idf • Flat (ignore context)

  14. Result(1) • Result for “free-text-oriented” topics • An example topic : <yr>1995,1996,1997,1998,1999</yr> <bdy>XML Electronic commerce </bdy>

  15. Result(2) • Result for “context-oriented” topics • An example topic: <atl> Content-Based retrieval of video databases</atl>

  16. Summary • Using XML fragments with an extended vector space model is promising. • Use different solutions for different types of applications • Something wrong?

  17. Another Problem --- CO Search • Document collection: • XML documents • Query: • a set of keywords • Task: Find smallest element satisfying the query Challenge: rank the components instead of document

  18. Possible Solutions ,where Possible Method(1): treat each component as a document. Problem with this method: XML components are nested. <article> t1 <sec> <p> t2</p></sec> </article>

  19. Impossible to differentiate between the rankings of the three sections Possible Solutions (Cont.) ,where Possible Method(2): counting TF at the component level; computing N & DF at the document level. <article> <sec>t1</sec> <sec>t1</sec> <sec>t2</sec> </article>

  20. Proposed Solution • Create a index for each component type • Elements in each index are regarded as documents • Keep N, DF,TF for the specific component type • Can apply the regular vector space model on each index • Given a query • Run the query in parallel on each index • Return one ranked list of results, one from each index • Normalize the scores in each index into the range (0,1) • Achieved by computing • Merge the normalized results into a one ranked list of all components Assume the set of potential components to be returned must be known in advance. Assume no nesting of the same component.

  21. Conclusion • Possible solutions to solve the following challenges. • Challenge 1 (Information/Doc Unit): What is an appropriate information unit? • Document may no longer be the most natural unit • Components in a document may be more appropriate • Challenge 2 (Query): What is an appropriate query language? • Keyword (free text) query is no longer the only choice • Constraints on the structures can be posed

  22. References • Retrieving the most relevant XML components, by Y. Mass, M. Mandelbrod. INEX’03 workshop. • Searching XML Documents via XML fragments, by D. Carmel, Y. S.Maarek, M. Mandelbrod, Y. Mass and A. Soffer. SIGIR’03 • XIRQL: A Query Language for Information Retrieval in XML Documents by N. Fuhr, K. Großjohann. SIGIR’02

More Related