An Information-Theoretic Approach to Normal Forms for Relational and XML Data

An Information-Theoretic Approach to Normal Forms for Relational and XML Data Marcelo Arenas Leonid Libkin University of Toronto

Motivation • What is a good database design? • Well-known solutions: BCNF, 4NF, … • But what is it that makes a database design good? • Elimination of update anomalies. • Existence of algorithms that produce good designs: lossless decomposition, dependency preservation. • Previous work was specific for the relational model. • Classical problems have to be revisited in the XML context. 2

Motivation • Problematic to evaluate XML normal forms. • No XML update language has been standardized. • No XML query language yet has the same “yardstick” status as relational algebra. • We do not even know if implication of XML FDs is decidable! • We need a different approach. • It must be based on some intrinsic characteristics of the data. • It must be applicable to new data models. • It must be independent of query/update/constraint issues. • Our approach is based on information theory. 3

Outline • Information theory. • A simple information-theoretic measure. • A general information-theoretic measure. • Definition of being well-designed. • Relational databases. • XML databases. 4

Information Theory • Entropy measures the amount of information provided by a certain event. • Assume that an event can have n different outcomes with probabilities p1, …, pn. Entropy is maximal if each pi= 1/n : 5

Entropy and Redundancies • Database schema: R(A,B,C), A  B • Instance I: • Pick a domain properly containing adom(I) : • Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4 • Entropy: log 5 ≈ 2.322 • Pick a domain properly containing adom(I) : {1, …, 6} • Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2 • Entropy: log 1 = 0 {1, …, 6} 6

Entropy and Normal Forms • Let  be a set of FDs over a schema S. Theorem(S,) is in BCNF if and only if for every instance of (S,) and for every domain properly containing adom(I),each position carries non-zero amount of information (entropy > 0). • A similar result holds for 4NF and MVDs. • This is a clean characterization of BCNF and 4NF, but the measure is not accurate enough ... 7

Problems with the Measure • The measure cannot distinguish between different types of data dependencies. • It cannot distinguish between different instances of the same schema: R(A,B,C), A  B entropy = 0 entropy = 0 8

A General Measure InstanceI of schema R(A,B,C), A  B : 9

A General Measure InstanceI of schema R(A,B,C), A  B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I)  {1, …, k}. For example, k = 7. 9

A General Measure InstanceI of schema R(A,B,C), A  B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I)  {1, …, k}. For example, k = 7. Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. 9

A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. 9

A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 9

A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/ 9

A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/ For a ≠ 2,P(a | X) = 9

A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/ (48 + 6  42) = 0.16 For a ≠ 2,P(a | X) = 42/ (48 + 6  42) = 0.14 Entropy ≈ 2.8057 (log 7 ≈ 2.8073) 9

A General Measure InstanceI of schema R(A,B,C), A  B : Value : we consider the average over all sets X  Pos(I) – {p}. • Average:2.4558 < log 7(maximal entropy) • It corresponds to conditional entropy. • It depends on the value of k ... 9

A General Measure • Previous value: • For each k, we consider the ratio: • How close the given position p is to having the maximum possible information content. • General measure: 10

Basic Properties • The measure is well defined: For every set of firstorder constraints  defined over a schema S, every I  inst(S,), and every p  Pos(I): exists. • Bounds: 11

Basic Properties • The measure does not depend on a particular representation of constraints. If 1and 2 are equivalent: • It overcomes the limitations of the simple measure: R(A,B,C), A  B 0.875 0.781 12

Well-Designed Databases Definition A database specification (S,) is well-designed if for every I  inst(S,) and every p  Pos(I), = 1. In other words, every position in every instance carries the maximum possible amount of information. We would like to test this definition in the relational world ... 13

Relational Databases  is a set of data dependencies over a schema S: •  = : (S,) is well-designed. •  is a set of FDs: (S,) is well-designed if and only if (S,) is in BCNF. •  is a set of FDs and MVDs: (S,) is well-designed if and only if (S,) is in 4NF. •  is a set of FDs and JDs: • If (S,) is in PJ/NF or in 5NFR, then (S,) is well-designed. The converse is not true. • A syntactic characterization ofbeing well-designed is given in the paper. 14

Relational Databases • The problem of verifying whether a relational schema is well-designed is undecidable. • If the schema contains only universal constraints (FDs, MVDs, JDs, …), then the problem becomes decidable. Now we would like to apply our definition in the XML world ... 15

XML Databases • XML specification: (D,). • Dis a DTD. •  is a set of data dependencies over D. • We would like to evaluate XML normal forms. • The notion of being well-designed extends from relations to XML. • The measure is robust; we just need to define the set of positions in an XML tree T: Pos(T). 16

Positions in an XML Tree DBLP conf conf title issue issue “ICDT” “ICDT” article article article author title @year author title @year title @year “Dong” “Dong” “. . .” “. . .” “1999” “1999” “Jarke” “Jarke” “. . .” “. . .” “1999” “1999” “. . .” “. . .” “2001” “2001” 17

Well-Designed XML Data • We consider k such that adom(T) {1, …,k}. • For each k : • We consider the ratio: • General measure: 18

XNF: XML Normal Form • XNF was proposed in [AL02]. • It was defined for XML FDs: DBLP.conf.@title  DBLP.conf DBLP.conf.issue  DBLP.conf.issue.article.@year • It eliminates two types of anomalies. • One of them is inspired by the type of anomalies found in relational databases containing FDs. 19

XNF: XML Normal Form DBLP conf conf title issue issue “ICDT” @year article article article @year “1999” “2001” author title @year author title @year title @year “Dong” “. . .” “1999” “Jarke” “. . .” “1999” “. . .” “2001” 20

XNF: XML Normal Form • For arbitrary XML data dependencies: Definition An XML specification (D,) is well-designed if for every T  inst(D,) and every pPos(T), = 1. • For functional dependencies: Theorem An XML specification (D,) is in XNF if and only if (D,) is well-designed. 21

Normalization Algorithms • The information-theoretic measure can also be used for reasoning about normalization algorithms. • For BCNF and XNF decomposition algorithms: Theorem After each step of these decomposition algorithms, the amount of information in each position does not decrease. 22

Future Work • We would like to consider more complex XML constraints and characterize good designs they give rise to. • We would like to characterize 3NF by using the measure developed in this paper. • In general, we would like to characterize “non-perfect” normal forms. • We would like to develop better characterizations of normalization algorithms using our measure. • Why is the “usual” BCNF decomposition algorithm good? Why does it always stop? 23

A Normal Form for FDs and JDs • Let  be a set of FDs and JDs over a schema S: • Theorem(S,) is well-designed if and only if for every • R  Sand every nontrivial JD: • implied by , there exists M  {1, ..., m} such that: • For every i,j M,  implies

A Normal Form for FDs and JDs (cont’d) Schema: S = { R(A,B,C) } and  ={ [AB,AC,BC], AB C, AC B }. • (S, ) is not in PJ/NF: {AB  ABC, AC  ABC} does not imply[AB,AC,BC]. • (S, )is not in 5NFR: [AB,AC,BC] is strong-reduced andBCis not a superkey. • (S,)is well-designed.

An Information-Theoretic Approach to Normal Forms for Relational and XML Data

An Information-Theoretic Approach to Normal Forms for Relational and XML Data

Presentation Transcript

Normal Forms for Relational Databases

The Relational Approach to Information Literacy

XML Structures for Relational Data

XML and The Relational Data Model

An Information Theoretic Approach to Bilingual Word Clustering

An Information-theoretic Framework for Visualization

Normal Forms

Normal Forms

Information Theoretic Approach to Whole Genome Phylogenies

Interference: An Information Theoretic View

An Information-theoretic Approach to Network Measurement and Monitoring

Normal Forms

Relational to XML Transformations

An Automata-Theoretic Approach to LTL

Mining Quantitative Correlated Patterns Using an Information-Theoretic Approach

Using XML to View Relational Data

Normal Forms and XML

Database Normalization Revisited: An information-theoretic approach

Normal Forms in Relational Databases 1

A Cost-based Approach For Converting Relational Schemas To XML

Viewing relational data as XML

Normal Forms for CFG’s