Schema information for semistructured data Part II

Schema information for semistructured dataPart II Yannakopoulos Andreas CSE department UCSD

Semistructured Data • Where? • on the Web • Characteristics • large collections of data • varied, irregular & mutable to be modeled in a traditional (relational, OO) approach • Objective • identify some underlying structure • complete description not always expected (e.g. type hierarchy) • Why? • formulate queries (we can’t query without a schema!) • query optimization & decomposition Schema Information for semistructured data, Yannakopoulos Andreas

Another approach: type hierarchy • S.Nestorov, S.Abiteboul, R.Motwani • “Inferring Structure in Semistructured Data” • Main points • The schema is not given a priori • The derived schema is not a faithful representation of the data set, because such a representation is: • very complex and difficult to use • if done at query time, it can’t be performed quickly • Extract a “reasonably small approximation” of the typing, in the form of a type hierarchy • Heuristic rules for assigning types to elements Schema Information for semistructured data, Yannakopoulos Andreas

Data Model: OEM • rooted, directed, labeled graph • the vertices, not the labels, are the objects! Schema Information for semistructured data, Yannakopoulos Andreas

Algorithm • Idea: discover the types using the relative importance of attributes • Steps • Identify candidate types • Select types and subtypes from the candidates and organize them into a type hierarchy • Derive the typing rules • Validate or type-check the type hierarchy against the data • Definitions • D: data set D, o: objects, S set of labels • attributes(o) : the set of labels on the outgoing edges • roles(o): the set of labels on the incoming edges • at(S): # of objects o in D such that S=attribute(o) • above(S): # of objects o in D such that S attribute(o) • jump(S)=at(S)/above(S) [relative importance] Schema Information for semistructured data, Yannakopoulos Andreas

Example • counting lattice L with an alphabet consisting of all distinct labels in D • count attributes(o) for all objects o in D Schema Information for semistructured data, Yannakopoulos Andreas

Rules for the candidate types • Select all sets of labels S such that jump(S) (threshold) Schema Information for semistructured data, Yannakopoulos Andreas

Rules for the type hierarchy • Primary role: the label occuring most frequently in roles(o) Schema Information for semistructured data, Yannakopoulos Andreas

More rules... • Typing rules • Note: a given object may be assigned to more than one type. In many real life situations, objects do belong to more than one type. For instance, an Employee object is also a Person. Schema Information for semistructured data, Yannakopoulos Andreas

Evaluation of the typing algorithm • Type size (e.g. the number of classes) • Correctness or accuracy of the typing • Example: =0, perfect typing: separate type for each slight variation of the object structure • Example: high, low accuracy • Conclusion • the algorithm is sensitive to the jump threshold • some objects may remain untyped • other objects may be assigned an inexact type Schema Information for semistructured data, Yannakopoulos Andreas

Modifying & Queryingusing DataGuides schemas • R.Goldman, J.Widom • “DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases” • S.Nestorov, J.Ullman, J.Wiener, S.Chawathe • “Representative Objects: Concise Representations of Semistructured Hierarchical Data” • Main points • The same role as traditional metadata. However they are not metadata! • dynamically generated • conform to data, rather than forcing the data to conform (compare with graph schemas) • Used as path index in querying Schema Information for semistructured data, Yannakopoulos Andreas

Data Model: OEM • rooted, directed, labeled graph Schema Information for semistructured data, Yannakopoulos Andreas

Label paths, target sets and querying • Queries are based on label paths (Lorel: Lore language) • A target set is the set of all objects that can be reached by traversing a given label path. • What is the target set of Restaurant.Entrée? Schema Information for semistructured data, Yannakopoulos Andreas

DataGuides Schema Information for semistructured data, Yannakopoulos Andreas

Theory behind DataGuides • Creating a DataGuide over a source database is equivalent to conversion of a NDFA to a DFA. • If the source database is a tree, this conversion takes linear time. • In the worst case, conversion of a graph-structured database may require time and space exponential in the number of objects and edges in the source. • “Solution”: Build a DataGuide over only the first levels. • A single NDFA may have many equivalent DFA. • The minimal one can be found using state minimization algorithms. • But is always the minimal one the best? Schema Information for semistructured data, Yannakopoulos Andreas

Example • Which one is the minimal DataGuide? • Add an object to 10, via a label E. What happens to the dataguides? Schema Information for semistructured data, Yannakopoulos Andreas

Strong DataGuides • Store the set of objects reachable by the label path to facilitate querying (annotations) • Nothing in the definition of a DataGuide prevents multiple label paths from reaching the same object in a DataGuide. • We want a one-to-one correspondence between source target sets and DataGuide objects • Solution: Strong DataGuides! Schema Information for semistructured data, Yannakopoulos Andreas

Modifying a database • DataGuide must be consistent with the source database • Extensive sharing may cause many sub-DataGuides to be recomputed after an update Schema Information for semistructured data, Yannakopoulos Andreas

Query formulation • We can store sample values in a DataGuide’s nodes Schema Information for semistructured data, Yannakopoulos Andreas

Query optimization • A strong DataGuide can serve as a path index • In time proportional to the length of a label path, we can use the DataGuide to find all source objects reachable via that path, independent of the size of the source • Cost model • uniform cost to every object examination • examining an object yields its parents at no additional cost Schema Information for semistructured data, Yannakopoulos Andreas

Example • Select DBG.GroupMember.Publication.Word • one DBG object containing 10,000 GroupMembers • each GroupMember has an average of 100 Publications • but only one Word subobject exists in the entire database • “traditional” querying examines 1,000,000 objects! • using DataGuides we examine only the root, 4 labels and a pointer to the target set Schema Information for semistructured data, Yannakopoulos Andreas

Another example • Lore includes a B-tree based value index (Vindex), but it is based only on the last label in a label path to an object • The Vindex on its own is not very helpful. • For example a year object may also be along the path DBG.Project.Publication.Year • We have to check the path of all returned objects in a bottom-up way • We can still do the “stupid” top-down querying • But using the DataGuide we just make an intersection between two sets (which ones?) Schema Information for semistructured data, Yannakopoulos Andreas

One more example... • Using the Vindex we can identify the candidate Year objects • Intersect them with the appropriate DataGuide (which one?) • Find their parents • Intersect them with the appopriate DataGuide (which one?) • Conclusion: direct access to target sets prevents the search space from growing needlessly large Schema Information for semistructured data, Yannakopoulos Andreas

Querying using graph schemas • P.Buneman, S.Davidson, M.Fernandez, D.Suciu • “Adding structure to unstructured data” • Query language: UnQL • Example: • UnQL queries are just graph transformations Schema Information for semistructured data, Yannakopoulos Andreas

Question... • Can we describe the by a schema Q(S)? • Query optimization • Optimization against views • Trivially satisfied by taking Q(S)={true} • Q(S) should describe precisely the query result • Is there always such a Q(S)? • think of the previous example query Schema Information for semistructured data, Yannakopoulos Andreas

Answer... • No! • The reason why graph schemas cannot precisely describe all query results is because they cannot impose equality constraints on edges in the database. • We can partially fix this by extending the notion of graph schema to allow equality constraints between certain values on edges • So we define extended graph schemas & their expansions Schema Information for semistructured data, Yannakopoulos Andreas

Example Schema Information for semistructured data, Yannakopoulos Andreas

Another example Schema Information for semistructured data, Yannakopoulos Andreas

Theorem Schema Information for semistructured data, Yannakopoulos Andreas

Querying using SchDL schemas • C.Beeri, T.Milo • “Schemas for Integration and Translation of structured and semi-structured data” Schema Information for semistructured data, Yannakopoulos Andreas

Example Schema Information for semistructured data, Yannakopoulos Andreas

Properties of ScmDL schemas • Every graph schema has an equivalent ScmDL schema, but not vice versa. • Virtual versions of ScmDL schemas: ScmDL schemas with internal node. • ScmDL schemas correspond to regular grammars • VScmDL schemas correspond to context-free grammars • Intuition: The internal nodes are similar to the internal nodes of derivation trees of context-free grammars Schema Information for semistructured data, Yannakopoulos Andreas

Query result schema • useful, because it prunes/restricts the search space • in the case of ScmDL schemas we’d like to derive schema information for node variables in the query • Find the set of all possible assignments for the variables Schema Information for semistructured data, Yannakopoulos Andreas

Some results for ScmDL querying • If the schema is very loose, it may be the case that each of the variables can be associated with most of the types in the schema • So we pre-compute the possibly type assignments for P, given the type of the current node Schema Information for semistructured data, Yannakopoulos Andreas

Schema information for semistructured data Part II

Schema information for semistructured data Part II

Presentation Transcript

Semistructured Data

Information capacity in schema and data translation

Indexing Semistructured Data

PART II DATA COLLECTION

Cooperative Query Answering for Semistructured data

Cooperative Query Answering for Semistructured Data

Extracting Schema from Semistructured Data

Information theory (part II)

XML: Semistructured Data

Efficient Maintenance of Semistructured Schema

Models and languages for semistructured data

Data Structures - Part II

Semistructured-Data Model

Part II: Data Transmission

Data Structure (Part II)

Semistructured Data and XML

Query Optimization for Semistructured Data

Semistructured-Data Model

Typing semistructured data

Using Information for Health Management; Part II

Typing semistructured data

Models and languages for semistructured data