1 / 56

Organization

Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com http://www-rocq.inria.fr/verso http://www.xyleme.com . Organization. Motivations XML Typing XML Querying XML XML and the Web Illustrations: 2 problems

tamanna
Télécharger la présentation

Organization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semistructured data -- June 2001 Semistructured data:from practice to theorySerge AbiteboulINRIA & Xyleme SASerge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.comhttp://www-rocq.inria.fr/verso http://www.xyleme.com

  2. Organization • Motivations • XML • Typing XML • Querying XML • XML and the Web • Illustrations: 2 problems • Incomplete information • Xyleme • Conclusion

  3. Semistructured data -- June 2001 Motivations

  4. Motivation: Complex data • Structure is irregular (missing/extra data…) • Schema does not exist or is unknown • Schema is rapidly evolving • Relational and ODB models are too rigid • Example: BibTex, HTML, SGML, XML, ASN.1, STEP/Express…

  5. Ontology meta-data Mediator wrapper wrapper wrapper wrapper wrapper wrapper Source Source Source Source Source Source Complex data: mediation User Many data sources coming and going

  6. Motivations: The Web today • Terabytes of data • Private web: not publicly available pages • Deep web: data hidden behind forms • A lot of public pages • Standard is a document/hypertext language HTML

  7. Browsing Search engines in: list of words out: sorted list of URLs Applis: hand-made wrappers Expensive Incomplete Short-lived, not adapted to the Web constant changes The Web today [Raghavan ’00]

  8. A new standard XML • HTML is not appropriate for data exchange on the Web • Standard database models are too constraining for the Web • The solution: a semistructured data model XML • Reminder: a data model consists of a type definition language, a query/update language + more

  9. Semistructured data -- June 2001 The most successful semistructured data model: XML

  10. The origin of XML • Parents • SGML • Relational and OO databases • SGML: markup language for documents • HTML and the Web: billions of pages • Not appropriate for data exchange • XML eXtensible Mark-up Language • W3C and most industrial companies [B2B] • Main idea: separate content and presentation • Use tags to represent structure and semantics

  11. HTML XML comes from SGML – also hypertext language – semistructured data fixed number of tags – not fixed content and presentation – not mixed are mixed very difficult to extract data – much easier from a page old standard for the Web – new standard XML: documents + databases

  12. The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>. Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 Information System HTML HTML = Hypertext Language hard Text + presentation Where is the data ?

  13. Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 ... Information System XML = Semistructured Data <product-table> < product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description> </product> < product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description> ... </product-table> easy Data + Structure Semistructured: more flexible XML

  14. XML: example <dealer> <UsedCars> <ad> <model>Honda</model> <year>96</year></ad> </UsedCars> <NewCars> <ad> <model>Acura</model> </ad> </NewCars> <NewCars> <ad> <model>R406</model> </ad> </NewCars> </dealer> dealer UsedCars NewCars NewCars ad ad ad model year model model Honda 96 Acura R406 It is just an unranked tagged ordered tree

  15. XML • Tree or graph • Data and structure/semantics are mixed • Tags contain typing information • Core constructor is list of tag/value pairs • Details • Each node may have an arbitrary number of children with distinct or not tags • Nodes also have attributes that are unordered and unique per node • Standard means to represent cyclic data: Id Idrefs

  16. XML Very active/noisy field - standards • types (DTD/XML schema), style-sheet (XSL), resource description (RDF...) • DOM, SAX… • WML (wap), MathML, SMIL (multimedia), RSS (news), RDF (metadata)... • How fast will XML conquer the web? • so far rather slow (about 1% now of the visible web; much more in intranets); accelerates (e.g., with Explorer 5.5)

  17. Semistructured data -- June 2001 Typing XML

  18. Typing XML • This is heresy for the freedom of the Web • Essential for data management: query optimization, user interfaces, applications • Differences with standard database typing • Collections are sequences instead of sets • Types may be very large (e.g., from integration) • Data is more irregular so types should be more permissive • New issues sometimes: you have the data, extract its type, an approximate type

  19. Intuition : the type is a tree dealer • Semantics and structure are in paths • dealer/UsedCars/ad • dealer/UsedCars/ad/model UsedCars NewCars ad ad model year model text text text

  20. DTD: a grammar Catalog  Product* Product  Name Price? Cat (Part Quantity)* Part  BasicPart + ComposedPart BasicPart  Pame ComposedPart  Name (Part Quantity)* • Nice and simple • Shortcoming: type of an element is independent of its context

  21. dealer dealer UsedCars NewCars UsedCars NewCars adused ad adnew ad model year model year model model More complex: specialization • Type of ad depends on its context • One way to view it: homomorphism

  22. Regular tree automata • Set of accepted trees: regular tree languages • Definable in monadic second-order logic dealer q0 Acceptance: there is a computation such that all leaves are labeled qf Used New p q ad ad ad ad r r s s m y m y m m qf qf qf qf qf qf • variants: top/down bottom/up, nondeterminism, unranked trees

  23. DTDs+specialization Result: DTDs+specialization = regular tree languages • Closure (intersection, union, complement) • Tests for validation, inclusion • Static analysis

  24. Situation today • Many people are using DTDs • Nice and simple in spite of ugly syntax • New proposal: xml-schema • More powerful but too complicated? • Other proposals: Relax, Trex • Usually based on some kind of regular tree automata • From experience: one will win and not necessarily the best

  25. Semistructured data -- June 2001 Query languages for XML

  26. Query Languages for XML • Extensions of SQL • first-order-logic • Information retrieval keyword search • Navigation via regular expression + pattern matching Lorel, XML-QL, XMAS… • Structural recursion UnQL, XSLT… • No official winner – leader is Xquery

  27. Tree with variables and constraints Pattern matching between the query and the data Each match provides a valuation for X,Y,Z Pattern matching catalog product X Y name price cat=elec <200 Z subcategory

  28. Example in Lorel select <offer> Z/name, P/name, P’/price </offer> from P in catalog/product, Z in discount_stores/store, Z/storecatalog/product P’ where P/category=“camera” and P/make=“canon” and P’/id = P/id • Joins like in relational databases • Construction of complex results • Regular expressions for paths (e.g., W/*/name = “Gates”)

  29. What is new in XML queries • A bit new: limited recursion (like in deductive databases) • A bit new but no big deal: constructed answers (like in OODB) • Very new: ordered data • Bothering • Theoretical base is a bit messy: FO, tree automata, bisimulation • No yardstick like relational calculus/algebra

  30. Proposal : k-pebble transducers stack [milo,suciu,vianu]

  31. root a c a a b b b a k-pebble transducers: result

  32. Semistructured data -- June 2001 XML and the Web

  33. Why it is the same old story • Massive amounts of data • Providers export data, users access data • Query languages, indexing, optimization • Database paradigm: still effective on the Web

  34. Why it is not the same old story • Databases • rigid structure • transactions, concurrency control • data independence • controlled (e.g., known cost model) • coherent system, very • polished artifact • The Web • flexible, no schema • flexible protocols • fuzzy separation • perfect mess (and that’s why people like it?) • closer to a natural ecosystem!

  35. The principles of the Web • The uncertainty principle: you can never be sure of anything or that the data is consistent • The incompleteness principle: they do not give you all the data you want (but some you don’t want :-) • The chaos principle: you can rarely assume the existence of some global schema • The instability principle: everything keeps changing Every piece of data you got is probably wrong, incomplete, does not conform to its expected type and is probably already stale

  36. What can be reused? • Some technology? indexes, B-trees, distributed query processing (concurrency control and transactions not yet) • Database theory? little • Algebra and rewrite rules for optimization • Dependency theory • First order and other logics • Seems that because of the ordering, it opens the gates for many more tools such as regular/tree languages

  37. Metaphor [AV]: the Web is infinite • What are the pages pointing to my homepage? • Google solution: milliseconds – stale data • Freeze the Web: weeks to get exact answer • Exact answer: no means to get it • Leads to reconsider the notion of computation

  38. Computability • Finitely computable: give the answer in finite time • All pages reached from my HP in less than 3 links • Eventually computable: each solution is given in finite time; computation may be infinite • All pages reached from my HP • Not computable • Can my HP be reached starting from my HP? • Also: approximate, partial, stale, pipelined answers

  39. Tough life: the Web is huge • Relational calculus/algebra: logspace data complexity (also AC0) • What is the data complexity of an Xquery of the Web? • Complexity of computing on the Web • Logspace in the Web? • Need to trade quality for performance

  40. The Web keeps changing • Classical: versions, temporal queries • Less classical: monitoring of the Web [Xyleme] • Smart crawling of the Web: flow of docs • Query subscription: query on this flow • Continuous queries • What is the underlying theory?

  41. Semistructured data -- June 2001 Illustration: incomplete information Work with Victor Vianu

  42. Example Access to an electronic catalog Q1: name, subcat, price of electronic products with price less than $200 Q2: name, pictures of cameras at least pictured once

  43. * product product product * missing product2 product1 canon 120 elec nikon 199 elec sony 175 elec camera camera cdplayer catalog Q1: name, subcat, price of electronic products with price less than 200

  44. Missing data after Q1 product1 product2 * * name price cat picture name price cat picture =elec >200 !=elec subcategory subcategory

  45. 3 3 c.jpg akai a.jpg elec camera * product1 catalog product2 * product2b * product2c missing product product product product2a canon 120 elec nikon 199 elec sony 175 elec camera camera cdplayer Q2: name, pictures of cameras at least pictured once

  46. product + product2a Missing data name pricecat picture =elec product1 >200 * subcategory no picture name price cat picture product3 !=elec no picture subcategory name price cat product2c elec product2b subcategory * namepricecat !=camera =elec >200 namepricecatpicture =elec >200 Known data subcategory subcategory !=camera

  47. After two queries • Known information: • Prefix of the real data tree • Missing information • Complex type • Q3: name, price, pictures of cameras costing less than $100 and at least pictured once • can be completely answered using A1, A2 • Q4: list all cameras • can be partially answered using A1, A2

  48. Semistructured data -- June 2001 Illustration: Xyleme

  49. A dynamic warehouse of Web data • Warehouse • Xyleme stores huge quantities of data (teraB) • Xyleme is not a search engine (only index) or a mediator(only virtual data) • XML • Xyleme is focused on XML • Dynamic • Xyleme is interested in data evolution/changes

  50. Technical Challenges 1. Data Acquisition and Maintenance discover data of interest and maintain it up to date • Repository store and index this data 3. Efficient query Processing 4. Semantic Integration provide a simple view of each semantic domain 5. Change Control Monitor the web and offer services such as Query Subscription

More Related