Database Normalization Revisited: An information-theoretic approach

Database Normalization Revisited: An information-theoretic approach Leonid Libkin Joint work with Marcelo Arenas and Solmaz Kolahi

Sources • M. Arenas, L. An information-theoretic approach to normal forms for relational and XML data, PODS’03; J.ACM, 2005. • S. Kolahi, L. Redundancy vs dependency-preservation in normalisation: an information-theoretic analysis of 3NF, PODS’06.

Outline • Part 1 - Database Normalization from the 1970s and 1980s • Part 2: Classical theory re-done: new justification for normal forms: • BCNF and relatives (academic, eliminate redundancies) • 3NF (practical, may leave some redundancies) • Part 3: An XML application 2

If you haven’t taught “Intro to DB” lately… • Design: decide how to represent the information in a particular data model. • Even for simple application domains there is a large number of ways of representing the data of interest. • We have to design the schema of the database. • Set of relations. • Set of attributes for each relation. • Set of data dependencies. 3

Normalization Theory Today • Normalization theory for relational databases was developed in the 70s and 80s. • Why do we need normalization theory today? • New data models have emerged: XML. • XML documents cancontain redundant information. • Redundant information in XML documents: • Can be discovered if the user provides semantic information. • Can be eliminated. 15

Designing a Database: An Example • Attributes:number, title, section, room. • Data dependency: every course number is associated with only one title. • Relational Schema: BAD alternative: 4

Problems with BAD:Redundancies and Update Anomalies 5

Deletion Anomaly CSC434 is not given in this term. 6

Deletion Anomaly CSC434 is not given in this term. Additional effect: all the information about CSC434 was deleted. 6

Avoiding Update Anomalies 8

Avoiding Update Anomalies The instance does not store redundant information. 8

Avoiding Update Anomalies CSC434 is not given in this term. 8

Avoiding Update Anomalies CSC434 is not given in this term. The title of CSC434 is not removed from the instance. 8

Normalization Theory • Main idea: a normal form defines a condition that a well designed database should satisfy. • Normal form: syntactic condition on the database schema. • Defined for a class of data dependencies. • Main problems: • How to test whether a database schema is in a particular normal form. • How to transform a database schema into an equivalent one satisfying a particular normal form. 10

BCNF: a Normal Form for FDs • Functional dependency (FD) over R(A1, …, An): X  Y , X, Y  {A1, …, An}. • X  Y: two rows with the same X-values must have the same Y-values. • Number  Title in our example • Key dependency : X  A1 …. An • X is a key: two distinct rows must have distinct X-values. 11

BCNF: a Normal Form for FDs •  is a set of FD over R(A1, …, An). • Relation schema R(A1, …,An),  is in BCNF if for every nontrivial X  Y in , X is a key. • A relational schema is in BCNF if every relation schema is in BCNF. 12

BCNF Decomposition • Relation schema: R(X,Y,Z),  • Not in BCNF: implies X  Y and but not X  A, for every A  Z. • Basic decomposition: replace R(X,Y,Z) by S(X,Y) and T(X,Z). • Example: 13

Lossless Decomposition ∏number, title (R) ∏number, section, room (R) 14

Lossless Decomposition S JoinT 14

How to justify good designs? • What is a good database design? • Well-known solutions: BCNF, 4NF, 3NF… • But what is it that makes a database design good? • Elimination of update anomalies. • Existence of algorithms that produce good designs: lossless decomposition, dependency preservation. 34

Problems with traditional approaches • Many papers tried to justify normal forms. • Problem: tied very closely to the relational model. • Relied on well-defined notions of queries/updates. • These days we want to deal with other data models, in particular XML. • We need an approach that extends to other models, in particular, XML.

Justification of Normal Forms • Problematic to evaluate XML normal forms. • No XML update language has been standardized. • No XML query language yet has the same “yardstick” status as relational algebra. • We do not even know if implication of XML FDs is decidable! • We need a different approach. • It must be based on some intrinsic characteristics of the data. • It must be applicable to new data models. • It must be independent of query/update/constraint issues. • Our approach is based on information theory. 35

Information Theory • Entropy measures the amount of information provided by a certain event. • Assume that an event can have n different outcomes with probabilities p1, …, pn. Entropy is maximal if each pi= 1/n : 36

Entropy and Redundancies • Database schema: R(A,B,C), A  B • Instance I: • Pick a domain properly containing adom(I) : • Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4 • Entropy: log 5 ≈ 2.322 • Pick a domain properly containing adom(I) : {1, …, 6} • Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2 • Entropy: log 1 = 0 {1, …, 6} 37

Entropy and Normal Forms • Let  be a set of FDs over a schema S. Theorem(S,) is in BCNF if and only if for every instance of (S,) and for every domain properly containing adom(I),each position carries non-zero amount of information (entropy > 0). • A similar result holds for 4NF and MVDs. • This is a clean characterization of BCNF and 4NF, but the measure is not accurate enough ... 38

Problems with the Measure • The measure cannot distinguish between different types of data dependencies. • It cannot distinguish between different instances of the same schema: R(A,B,C), A  B entropy = 0 entropy = 0 39

A General Measure InstanceI of schema R(A,B,C), A  B : 40

A General Measure InstanceI of schema R(A,B,C), A  B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I)  {1, …, k}. For example, k = 7. 40

A General Measure InstanceI of schema R(A,B,C), A  B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I)  {1, …, k}. For example, k = 7. Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. 40

A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. 40

A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 40

A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/ 40

A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/ For a ≠ 2,P(a | X) = 40

A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/ (48 + 6 * 42) = 0.16 For a ≠ 2,P(a | X) = 42 / (48 + 6 * 42) = 0.14 Entropy ≈ 2.8057 (log 7 ≈ 2.8073) 40

A General Measure InstanceI of schema R(A,B,C), A  B : Value : we consider the average over all sets X  Pos(I) – {p}. • Average:2.4558 < log 7(maximal entropy) • It corresponds to conditional entropy. • It depends on the value of k ... 40

A General Measure: Relative Information Content (RIC) • Previous value: RICIk(Σ|p) • For each k, we consider the ratio: RICIk(Σ|p) / log k • How close the given position p is to having the maximum possible information content. • General measure (Arenas, L. 2003): RICI(Σ|p) = limk  ∞RICIk(Σ|p) / log k 41

Basic Properties • The measure is well defined: For every set of firstorder constraints Σ,every instance I of Σ, and every position p in I, RICI(Σ|p) exists. • Bounds: 0 ≤RICI(Σ|p) ≤ 1 • Closer to 1 = Less redundancy 42

Basic Properties • The measure does not depend on a particular representation of constraints. • It overcomes the limitations of the simple measure: R(A,B,C), A  B 0.875 0.781 43

Well-Designed Databases Definition A database specification (S,) is well-designed if for every I  inst(S,) and every p  Pos(I), RICI(Σ|p) = 1. In other words, every position in every instance carries the maximum possible amount of information. 44

Relational Databases (Arenas, L.’03)  is a set of data dependencies over a schema S: •  = Ø: (S,) is well-designed. •  is a set of FDs: (S,) is well-designed if and only if (S,) is in BCNF. •  is a set of FDs and MVDs: (S,) is well-designed if and only if (S,) is in 4NF. •  is a set of FDs and JDs: • If (S,) is in PJ/NF or in 5NFR, then (S,) is well-designed. The converse is not true. • A syntactic characterization ofbeing well-designed is given in [AL03]. 45

Decidability Issues • If Σ is a set of First-Order integrity constraints, then the problem of verifying whether a relational schema is well-designed is undecidable. • If Σcontains only universal constraints (FDs, MVDs, JDs, …), then the problem becomes decidable. • High complexity (coNEXPTIME) by reduction to the (complement) of Bernays-Schönfinkel satisfiability. 46

3NF • BCNF is the most popular textbook normal form. • In practice 3NF is much more common. • From Oracle's “General Database Design FAQ”: after defining 1NF, 2NF, and 3NF, it says: that there are other normal forms but “their definitions are of academic concern only, and are rarely required for practical purposes”

Database Normalization Revisited: An information-theoretic approach

Database Normalization Revisited: An information-theoretic approach

Presentation Transcript

Extending ER Diagrams (4.4)

Protecting Sensitive Information with Database Encryption

Game Theoretic Problems in Network Economics and Mechanism Design Solutions

Guide to Oracle 10 g

FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT

Client and Service Information (CSI) Database

CSI Revisited The Science of Forensic DNA Analysis

Probe analysis and data preprocessing

Database Applications GIT 150

Microarray Pre-Processing

Normalization: Kroenke Chapters 3 and 4

National Dong Hwa University, Taiwan

Number-Theoretic Algorithms

CSC 545: Database Systems

Chapter 8

Database Systems

Database Management Systems (DBMS)

Chapter 5 Normalization of Database Tables

Business Analysis ITEC-455 Spring 2010

B. Information Technology (Hons.) CMPB245: Database Design

Database Management Systems