1 / 118

From Semistructured Data to XML

From Semistructured Data to XML. Dan Suciu AT&T Labs http://www.research.att.com/~suciu/vldb99-tutorial.pdf. How the Web is Today. HTML documents all intended for human consumption many generated automatically by applications. Easy to fetch any Web page, from any server, any platform.

keefer
Télécharger la présentation

From Semistructured Data to XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Semistructured Data to XML Dan Suciu AT&T Labs http://www.research.att.com/~suciu/vldb99-tutorial.pdf

  2. How the Web is Today • HTML documents • all intended for human consumption • many generated automatically by applications Easy to fetch any Web page, from any server, any platform

  3. Limits of the Web Today • application cannot consume HTML • HTML wrapper technology is brittle • screen scraping • OO technology (Corba) requires controlled environment • companies merge, form partnerships; need interoperability fast people are inventive: send data by fax !

  4. Paradigm Shift on the Web • new Web standard XML: • XML generated by applications • XML consumed by applications • data exchange • across platforms: enterprise interoperability • across enterprises Web: from collection of documents to data and documents

  5. Database Community Can Help • query optimization, processing • views, transformations • data warehouses, data integration • mediators, query rewriting • secondary storage, indexes

  6. But Needs a Paradigm Shift Too • Web data differs from database data: • self-describing, schema-less • structure changes without notice • heterogeneous, deeply nested, irregular • documents and data mixed together • designed by document, not db experts • need Web data management

  7. What This Tutorial is About • what the database community has done • semistructured data model • query languages, schemas • what the Web community has done: • data formats/models: XML, RDF • transformation language (XSL), schemas • where they meet and where they differ

  8. Outline • Semistructured data and XML • Query languages • Schemas • Systems issues • Conclusions

  9. Part 1Semistructured Data and XML

  10. Semistructured Data Origins: • integration of heterogeneous sources • data sources with non-rigid structure • biological data • Web data

  11. The Semistructured Data Model Bib &o1 complex object paper paper book references &o12 &o24 &o29 references references author page author year author title http title title publisher author author author &o43 &25 &96 1997 last firstname atomic object firstname lastname first lastname &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu” Object Exchange Model (OEM)

  12. Syntax for Semistructured Data Bib: &o1 { paper: &o12 { … }, book: &o24 { … }, paper: &o29 { author: &o52 “Abiteboul”, author: &o96 { firstname: &243 “Victor”, lastname: &o206 “Vianu”}, title: &o93 “Regular path queries with constraints”, references: &o12, references: &o24, pages: &o25 { first: &o64 122, last: &o92 133} } }

  13. Syntax for Semistructured Data May omit oid’s: { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } } }

  14. Characteristics of Semistructured Data • missing or additional attributes • multiple attributes • different types in different objects • heterogeneous collections self-describing, irregular data, no a priori structure

  15. { row: { name: “John”, phone: 3634 }, row: { name: “Sue”, phone: 6343 }, row: { name: “Dick”, phone: 6363 } } row row row name phone name phone name phone “John” 3634 “Sue” 6343 “Dick” 6363 Comparison with Relational Data

  16. XML • a W3C standard to complement HTML • origins: structured text SGML • motivation: • HTML describes presentation • XML describes content • http://www.w3.org/TR/REC-xml (2/98)

  17. From HTML to XML HTML describes the presentation

  18. HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteoul, Buneman, Suciu <br> Morgan Kaufmann, 1999

  19. XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the content

  20. XML Terminology • tags: book, title, author, … • start tag: <book>, end tag: </book> • elements: <book>…<book>,<author>…</author> • elements are nested • empty element: <red></red> abbrv. <red/> • an XML document: single root element well formed XML document: if it has matching tags

  21. More XML: Attributes <bookprice = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> attributes are alternative ways to represent data

  22. More XML: Oids and References <personid=“o555”> <name> Jane </name> </person> <personid=“o456”> <name> Mary </name> <childrenidref=“o123 o555”/> </person> <personid=“o123” mother=“o456”><name>John</name> </person> oids and references in XML are just syntax

  23. XML Data Model • does not exists • Document Object Model (DOM): • http://www.w3.org/TR/REC-DOM-Level-1 (10/98) • class hierarchy (node, element, attribute,…) • objects have behavior • defines API to inspect/modify the document

  24. XML Parsers • traditional: return data structure (DOM?) • event based: SAX (Simple API for XML) • http://www.megginson.com/SAX • write handler for start tag and for end tag

  25. XML Namespaces • http://www.w3.org/TR/REC-xml-names (1/99) • name ::= [prefix:]localpart <bookxmlns:isbn=“www.isbn-org.org/def”> <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number> </book>

  26. defined here XML Namespaces • syntactic: <number> , <isbn:number> • semantic: provide URL for schema <tagxmlns:mystyle = “http://…”> … <mystyle:title> … </mystyle:title> <mystyle:number> … </tag>

  27. XML v.s. Semistructured Data • both described best by a graph • both are schema-less, self-describing

  28. <personid=“o123”> <name> Alan </name> <age> 42 </age> <email> ab@com </email> </person> { person: &o123 { name: “Alan”, age: 42, email: “ab@com” } } <personfather=“o123”> … </person> { person: { father: &o123 …} } father person father person name email age name age email Alan 42 ab@com Alan 42 ab@com Similarities and Differences similar on trees, different on graphs

  29. More Differences • XML is ordered, ssd is not • XML can mix text and elements: <talk> Making Java easier to type and easier to type <speaker> Phil Wadler </speaker> </talk> • XML has lots of other stuff: entities, processing instructions, comments

  30. RDF • http://www.w3.org/TR/REC-rdf-syntax (2/99) • purpose: metadata for Web • help search engines • syntax in XML • semantics: edge-labeled graphs

  31. RDF Syntax <rdf:Descriptionabout=“www.mypage.com”> <about> birds, butterflies, snakes </about> <author> <rdf:Description> <firstname> John </firstname> <lastname> Smith </lastname> </rdf:Description> </author> </rdf:Description>

  32. RDF Data Model www.mypage.com about author birds, butterflies, snakes firstname lastname John Smith the RDF Data Model is very close to semistructured data

  33. More RDF Examples related www.mypage.com www.anotherpage.com about author author author birds, butterflies, snakes Joe Doe firstname lastname John Smith

  34. <rdf:Descriptionabout=“www.mypage.com”> <about> birds, butterflies, snakes </about> <author> <rdf:DescriptionID=“&o55”> <firstname> John </firstname> <lastname> Smith </lastname> </rdf:Description> </author> </rdf:Description> <rdf:Descriptionabout=“www.anotherpage.com”> <related> <rdf:Descriptionabout=“www.mypage.com”/> </related> <authorrdf:resource=“&o55”/> <author> Joe Doe </author> </rdf:Description>

  35. subject predicate object RDF Terminology statement

  36. More RDF: Containers • bag, sequence, alternative <rdf:Description> <a> <rdf:Bag> <rdf:li> s1 </rdf:li> <rdf:li> s2 </rdf:li> </rdf:Bag> </a> </rdf:Description>

  37. RDF Containers (cont’d) a rdf:type rdf_2 rdf_1 Bag s1 s2

  38. www.thispage.com www.thatpage.com author says topic environment More RDF: Higher Order Statements “the author of www.thispage.com says: ‘the topic of www.thatpage.com is environment’ “ RDF uses reification

  39. Summary of Data Models • semistructured data, XML, RDF • data is self-describing, irregular • schema embedded in the data

  40. Part 2Query Languages • Semistructured data and XML • Query languages • Schemas • Systems issues • Conclusions

  41. Query Languages: Motivation • granularity of the HTML Web: one file • granularity of Web data varies: • single data item: “get John’s salary” • entire database: “get all salaries” • aggregates: “get average salary” • need query language to define granularity

  42. Query Languages: Outline • for semistructured data: • Lorel • UnQL • StruQL • for XML: XML-QL • a different paradigm • structural recursion • XSL

  43. Lorel • part of the Lore system (Stanford) • adapts OQL to semistructured data select X.title from Bib.paper X where X.year > 1995 example: select Bib.paper.title from Bib.paper where Bib.paper.year > 1995 abbreviated to:

  44. Lorel v.s. OQL • implicit coercions: 1995 to “1995” • missing attributes • empty answer v.s. type error • set-valued attributes • in X.year>1995, X may have several years • regular path expressions (next)

  45. Regular Path Expressions Useful for: • syntactic substitute for inheritance: paper|book • navigating partially known structures: lastname? • transitive closure: reference+ select X.title from Bib.paper X, Bib.(paper|book) Y where Y.author.lastname? = “Ullman” and Y.reference+ X

  46. UnQL • Unstructured Query Language • patterns, templates, structural recursion • patterns: select T where Bib.paper: { title: T, year: Y, journal: “TODS”} and Y > 1995

  47. UnQL: Templates select result: { fn: F, ln: L, pub: { title: T, year: Y }} where Bib.paper: { title: T, year: Y, journal: “TODS”} and Y > 1995 Result looks like: { result: { fn: “John”, ln: “Smith”, pub: { title: “P equals NP”, year: 2005}}, result: { fn: “Joe”, ln: “Doe”, pub: { title: “Errata to P=NP”, year: 2006}} … }

  48. Skolem Functions • Maier, 1986 • in OO systems • Kifer et al, 1989 • F-logic • Hull and Yoshikawa, 1990 • deductive db (ILOG) • Papakonstantinou et al., 1996 • semistructured db (MSL) • illustrate with Strudel (next)

  49. Skolem Functions in StruQL • Strudel: a Web Site Management System • StruQL: its query language

  50. Example: Bibliography Data {Bib: { paper: { author: “Jones”, author: “Smith”, title: “The Comma”, year: 1994 } }, { paper: ….. } }

More Related