1 / 34

Schema information for semistructured data Part II

Schema information for semistructured data Part II. Yannakopoulos Andreas CSE department UCSD. Semistructured Data. Where? on the Web Characteristics large collections of data varied, irregular & mutable to be modeled in a traditional (relational, OO) approach Objective

morey
Télécharger la présentation

Schema information for semistructured data Part II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Schema information for semistructured dataPart II Yannakopoulos Andreas CSE department UCSD

  2. Semistructured Data • Where? • on the Web • Characteristics • large collections of data • varied, irregular & mutable to be modeled in a traditional (relational, OO) approach • Objective • identify some underlying structure • complete description not always expected (e.g. type hierarchy) • Why? • formulate queries (we can’t query without a schema!) • query optimization & decomposition Schema Information for semistructured data, Yannakopoulos Andreas

  3. Another approach: type hierarchy • S.Nestorov, S.Abiteboul, R.Motwani • “Inferring Structure in Semistructured Data” • Main points • The schema is not given a priori • The derived schema is not a faithful representation of the data set, because such a representation is: • very complex and difficult to use • if done at query time, it can’t be performed quickly • Extract a “reasonably small approximation” of the typing, in the form of a type hierarchy • Heuristic rules for assigning types to elements Schema Information for semistructured data, Yannakopoulos Andreas

  4. Data Model: OEM • rooted, directed, labeled graph • the vertices, not the labels, are the objects! Schema Information for semistructured data, Yannakopoulos Andreas

  5. Algorithm • Idea: discover the types using the relative importance of attributes • Steps • Identify candidate types • Select types and subtypes from the candidates and organize them into a type hierarchy • Derive the typing rules • Validate or type-check the type hierarchy against the data • Definitions • D: data set D, o: objects, S set of labels • attributes(o) : the set of labels on the outgoing edges • roles(o): the set of labels on the incoming edges • at(S): # of objects o in D such that S=attribute(o) • above(S): # of objects o in D such that S attribute(o) • jump(S)=at(S)/above(S) [relative importance] Schema Information for semistructured data, Yannakopoulos Andreas

  6. Example • counting lattice L with an alphabet consisting of all distinct labels in D • count attributes(o) for all objects o in D Schema Information for semistructured data, Yannakopoulos Andreas

  7. Rules for the candidate types • Select all sets of labels S such that jump(S) (threshold) Schema Information for semistructured data, Yannakopoulos Andreas

  8. Rules for the type hierarchy • Primary role: the label occuring most frequently in roles(o) Schema Information for semistructured data, Yannakopoulos Andreas

  9. More rules... • Typing rules • Note: a given object may be assigned to more than one type. In many real life situations, objects do belong to more than one type. For instance, an Employee object is also a Person. Schema Information for semistructured data, Yannakopoulos Andreas

  10. Evaluation of the typing algorithm • Type size (e.g. the number of classes) • Correctness or accuracy of the typing • Example: =0, perfect typing: separate type for each slight variation of the object structure • Example: high, low accuracy • Conclusion • the algorithm is sensitive to the jump threshold • some objects may remain untyped • other objects may be assigned an inexact type Schema Information for semistructured data, Yannakopoulos Andreas

  11. Modifying & Queryingusing DataGuides schemas • R.Goldman, J.Widom • “DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases” • S.Nestorov, J.Ullman, J.Wiener, S.Chawathe • “Representative Objects: Concise Representations of Semistructured Hierarchical Data” • Main points • The same role as traditional metadata. However they are not metadata! • dynamically generated • conform to data, rather than forcing the data to conform (compare with graph schemas) • Used as path index in querying Schema Information for semistructured data, Yannakopoulos Andreas

  12. Data Model: OEM • rooted, directed, labeled graph Schema Information for semistructured data, Yannakopoulos Andreas

  13. Label paths, target sets and querying • Queries are based on label paths (Lorel: Lore language) • A target set is the set of all objects that can be reached by traversing a given label path. • What is the target set of Restaurant.Entrée? Schema Information for semistructured data, Yannakopoulos Andreas

  14. DataGuides Schema Information for semistructured data, Yannakopoulos Andreas

  15. Theory behind DataGuides • Creating a DataGuide over a source database is equivalent to conversion of a NDFA to a DFA. • If the source database is a tree, this conversion takes linear time. • In the worst case, conversion of a graph-structured database may require time and space exponential in the number of objects and edges in the source. • “Solution”: Build a DataGuide over only the first levels. • A single NDFA may have many equivalent DFA. • The minimal one can be found using state minimization algorithms. • But is always the minimal one the best? Schema Information for semistructured data, Yannakopoulos Andreas

  16. Example • Which one is the minimal DataGuide? • Add an object to 10, via a label E. What happens to the dataguides? Schema Information for semistructured data, Yannakopoulos Andreas

  17. Strong DataGuides • Store the set of objects reachable by the label path to facilitate querying (annotations) • Nothing in the definition of a DataGuide prevents multiple label paths from reaching the same object in a DataGuide. • We want a one-to-one correspondence between source target sets and DataGuide objects • Solution: Strong DataGuides! Schema Information for semistructured data, Yannakopoulos Andreas

  18. Modifying a database • DataGuide must be consistent with the source database • Extensive sharing may cause many sub-DataGuides to be recomputed after an update Schema Information for semistructured data, Yannakopoulos Andreas

  19. Query formulation • We can store sample values in a DataGuide’s nodes Schema Information for semistructured data, Yannakopoulos Andreas

  20. Query optimization • A strong DataGuide can serve as a path index • In time proportional to the length of a label path, we can use the DataGuide to find all source objects reachable via that path, independent of the size of the source • Cost model • uniform cost to every object examination • examining an object yields its parents at no additional cost Schema Information for semistructured data, Yannakopoulos Andreas

  21. Example • Select DBG.GroupMember.Publication.Word • one DBG object containing 10,000 GroupMembers • each GroupMember has an average of 100 Publications • but only one Word subobject exists in the entire database • “traditional” querying examines 1,000,000 objects! • using DataGuides we examine only the root, 4 labels and a pointer to the target set Schema Information for semistructured data, Yannakopoulos Andreas

  22. Another example • Lore includes a B-tree based value index (Vindex), but it is based only on the last label in a label path to an object • The Vindex on its own is not very helpful. • For example a year object may also be along the path DBG.Project.Publication.Year • We have to check the path of all returned objects in a bottom-up way • We can still do the “stupid” top-down querying • But using the DataGuide we just make an intersection between two sets (which ones?) Schema Information for semistructured data, Yannakopoulos Andreas

  23. One more example... • Using the Vindex we can identify the candidate Year objects • Intersect them with the appropriate DataGuide (which one?) • Find their parents • Intersect them with the appopriate DataGuide (which one?) • Conclusion: direct access to target sets prevents the search space from growing needlessly large Schema Information for semistructured data, Yannakopoulos Andreas

  24. Querying using graph schemas • P.Buneman, S.Davidson, M.Fernandez, D.Suciu • “Adding structure to unstructured data” • Query language: UnQL • Example: • UnQL queries are just graph transformations Schema Information for semistructured data, Yannakopoulos Andreas

  25. Question... • Can we describe the by a schema Q(S)? • Query optimization • Optimization against views • Trivially satisfied by taking Q(S)={true} • Q(S) should describe precisely the query result • Is there always such a Q(S)? • think of the previous example query Schema Information for semistructured data, Yannakopoulos Andreas

  26. Answer... • No! • The reason why graph schemas cannot precisely describe all query results is because they cannot impose equality constraints on edges in the database. • We can partially fix this by extending the notion of graph schema to allow equality constraints between certain values on edges • So we define extended graph schemas & their expansions Schema Information for semistructured data, Yannakopoulos Andreas

  27. Example Schema Information for semistructured data, Yannakopoulos Andreas

  28. Another example Schema Information for semistructured data, Yannakopoulos Andreas

  29. Theorem Schema Information for semistructured data, Yannakopoulos Andreas

  30. Querying using SchDL schemas • C.Beeri, T.Milo • “Schemas for Integration and Translation of structured and semi-structured data” Schema Information for semistructured data, Yannakopoulos Andreas

  31. Example Schema Information for semistructured data, Yannakopoulos Andreas

  32. Properties of ScmDL schemas • Every graph schema has an equivalent ScmDL schema, but not vice versa. • Virtual versions of ScmDL schemas: ScmDL schemas with internal node. • ScmDL schemas correspond to regular grammars • VScmDL schemas correspond to context-free grammars • Intuition: The internal nodes are similar to the internal nodes of derivation trees of context-free grammars Schema Information for semistructured data, Yannakopoulos Andreas

  33. Query result schema • useful, because it prunes/restricts the search space • in the case of ScmDL schemas we’d like to derive schema information for node variables in the query • Find the set of all possible assignments for the variables Schema Information for semistructured data, Yannakopoulos Andreas

  34. Some results for ScmDL querying • If the schema is very loose, it may be the case that each of the variables can be associated with most of the types in the schema • So we pre-compute the possibly type assignments for P, given the type of the current node Schema Information for semistructured data, Yannakopoulos Andreas

More Related