240 likes | 373 Vues
This document explores the concept of Normal Form for XML documents, focusing on the implications of functional dependencies and Document Type Definitions (DTDs). Through motivating examples, it illustrates how to enforce integrity constraints within XML by ensuring that two students with the same student number must have identical names. Additionally, it highlights the transformation of functional dependencies into a relational representation, aiming to generalize BCNF. This work provides insights into normalizing XML documents and addresses the complexities of XML data structures.
E N D
A Normal Form for XML Documents Marcelo ArenasLeonid Libkin Department of Computer Science University of Toronto
Motivating Example courses course course info @cno student @cno student @sno name student “cs100” . . . “cs225” “123” “Fox” @sno name grade @sno name grade “123” “Fox” “B+” “123” “Fox” “A+”
Our Goal DTD: Integrity Constraints: courses course* course @cno, student* student @sno, name, grade name S grade S ,info* two students with the same @sno value must have the same name. @snois the key ofinfo. info @sno, name
A “Non-relational” Example DBLP conf conf . . . title issue issue “ICDT” @year article article article @year “1999” “2001” author title @year author title @year title @year “Dong” “. . .” “1999” “Jarke” “. . .” “1999” “. . .” “2001”
Problems to Address • Functional dependencies for XML. • Normal form for XML documents (XNF). • Generalizes BCNF. • Algorithm for normalizing XML documents. • Implication problem for functional dependencies.
Framework: DTDs • We do not consider mixed content, IDs and IDREFs. • Paths(D): all paths in a DTD D courses.course courses.course.@cno courses.course.student.name courses.course.student.name.S • We distinguish three kinds of elements: attributes (@), strings (S) and element types.
Framework: XML Trees courses v0 course course v1 . . . @cno student student v2 v5 “cs100” @sno grade @sno grade name name v3 v4 v6 v7 “123” “456” S S S S “Fox” “B+” “Smith” “A-”
Towards FDs for XML • We know how to handle FDs in relational DBs. • We need a relational representation of XML. • We do it by considering tree tuples: atree tuple in a DTD Dis a mapping t : Paths(D) Vertices Strings {} consistent with D: could be extracted from an XML tree conforming to D.
Tree Tuples A tree tuple represents an XML tree: courses t(courses) = v0 t(courses.course) = v1 t(courses.course.@cno) = “cs100” t(courses.course.student) = v2 t(p) = , for the remaining paths v0 course v1 @cno student v2 “cs100”
XML Tree: set of Tree Tuples courses courses v0 v0 course course course course v1 v1 . . . . . . @cno @cno student student student student v2 v5 v5 v2 “cs100” “cs100” @sno @sno grade grade @sno @sno grade grade name name name name v3 v3 v4 v4 v6 v6 v7 v7 “123” “123” “456” “456” S S S S S S S S “Fox” “Fox” “B+” “B+” “Smith” “Smith” “A-” “A-”
Functional Dependencies • Expressions of the form: X Y defined over a DTD D, where X, Y are finite non-empty subsets of Paths(D). • XML tree T can be tested for satisfaction of X Yif: X Y Paths(T) Paths(D) • T X Y if for every pair u, v of tree tuples in T: u.X = v.X and u.X≠ implies u.Y = v.Y
FD: Examples • University DTD: courses course* course @cno, student* student @sno, name, grade • Two students with the same @sno value must have the same name: courses.course.student.@sno courses.course.student.name.S • Every student can have at most one grade in every course: { courses.course, courses.course.student.@sno } courses.course.student.grade.S
Checking FD Satisfaction { courses.course, courses.course.student.@sno } courses.course.student.grade.S courses courses v0 v0 course course course course v1 v1 v5 v5 @cno @cno student student @cno @cno student student “cs100” “cs100” v2 v2 “cs225” “cs225” v6 v6 @sno @sno grade grade @sno @sno grade grade name name name name “123” “123” v3 v3 v4 v4 “123” “123” v7 v7 v8 v8 S S S S S S S S “Fox” “Fox” “B+” “B+” “Fox” “Fox” “A+” “A+”
Checking FD Satisfaction { courses.course, courses.course.student.@sno } courses.course.student.grade.S courses courses v0 v0 course course v1 v1 @cno @cno student student student student “cs100” “cs100” v2 v2 v5 v5 @sno @sno grade grade @sno @sno grade grade name name name name “123” “123” v3 v3 v4 v4 “123” “123” v6 v6 v7 v7 S S S S S S S S “Fox” “Fox” “B+” “B+” “Fox” “Fox” “A+” “A+”
Implication Problem for FD • Given a DTD D and a set of functional dependencies {}: (D, ) if for any XML tree T conforming to D and satisfying , it is the case that T • (D, )+ = { | (D, ) } • Functional dependency is trivial if it is implied by the DTD alone: (D, )
XNF: XML Normal Form • XML specification: a DTD D and a set of functional dependencies . • A Relational DB is in BCNF if for every non-trivial functional dependency X Y in the specification, X is a key. • (D, ) is in XNF if: For each non-trivial FD X p.@l or X p.S in (D, )+, X p is in (D, )+.
Back to DBLP • DBLP is not in XNF: DBLP.conf.issue DBLP.conf.issue.article.@year (D,)+ DBLP.conf.issue DBLP.conf.issue.article (D,)+ • Proposed solution is in XNF.
Normalization Algorithm The algorithm applies two transformations until the schema is in XNF. • If there is an anomalous FD of the form: DBLP.conf.issue DBLP.conf.issue.article.@year then apply the “DBLP example rule”. • Otherwise: choose a minimal anomalous FD and apply the “University example rule”.
Normalizing XML Documents • TheoremThe decomposition algorithm terminates and outputs a specification in XNF. • It does not lose information: Unnormalized Normalized XML document XML Document Q1, Q2 are XQuery core queries. • It works even if we cannot compute (D,)+. If we know (D,)+, the output is better. Q1 Q2
Implication Problem (cont’d) • Typically, regular expressions used in DTDs are rather simple. • Trivial regular expression: s1, s2, …, sn • Each si is either one of ai, ai?, ai+or ai* • For i ≠ j, ai ≠ aj • Dis asimpleDTDif all its productions use “permutations” of trivial regular expressions. Example: (a | b)*is a permutation of a*, b*
Implication Problem (cont’d) • TheoremFor simple DTDs: • The implication problem for FDs is solvable in O(n2). • Testing if a specification is in XNF can be done in O(n3). • Other results: • There is a larger class of DTDs for which these problems are tractable. • There is an even larger class of DTDs for which they are coNP-complete.
Final Comments • The normalization algorithm can be improved in various ways. • Why XNF? • TheoremXNF generalizes BCNF and a normal form for nested relations (called NNF) when those are coded as XML documents. • A complete classification of the complexity of the implication problem for FDs remains open. • Implementation.
Normalization Algorithm • We consider FDs of the form: {q, p1.@l1, …, pn.@ln } p where n 0 and q ends with an element type. • The algorithm applies two transformations until the schema is in XNF. • If there is an anomalous FD with n = 0: DBLP.conf.issue DBLP.conf.issue.article.@year then apply the “DBLP example rule”. • Otherwise: choose a minimal anomalous FD and apply the “University example rule”.