1 / 67

Database Normalization Revisited: An information-theoretic approach

Database Normalization Revisited: An information-theoretic approach. Leonid Libkin Joint work with Marcelo Arenas and Solmaz Kolahi. Sources. M. Arenas, L. An information-theoretic approach to normal forms for relational and XML data , PODS’03; J.ACM, 2005.

Télécharger la présentation

Database Normalization Revisited: An information-theoretic approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database Normalization Revisited: An information-theoretic approach Leonid Libkin Joint work with Marcelo Arenas and Solmaz Kolahi

  2. Sources • M. Arenas, L. An information-theoretic approach to normal forms for relational and XML data, PODS’03; J.ACM, 2005. • S. Kolahi, L. Redundancy vs dependency-preservation in normalisation: an information-theoretic analysis of 3NF, PODS’06.

  3. Outline • Part 1 - Database Normalization from the 1970s and 1980s • Part 2: Classical theory re-done: new justification for normal forms: • BCNF and relatives (academic, eliminate redundancies) • 3NF (practical, may leave some redundancies) • Part 3: An XML application 2

  4. If you haven’t taught “Intro to DB” lately… • Design: decide how to represent the information in a particular data model. • Even for simple application domains there is a large number of ways of representing the data of interest. • We have to design the schema of the database. • Set of relations. • Set of attributes for each relation. • Set of data dependencies. 3

  5. Normalization Theory Today • Normalization theory for relational databases was developed in the 70s and 80s. • Why do we need normalization theory today? • New data models have emerged: XML. • XML documents cancontain redundant information. • Redundant information in XML documents: • Can be discovered if the user provides semantic information. • Can be eliminated. 15

  6. Designing a Database: An Example • Attributes:number, title, section, room. • Data dependency: every course number is associated with only one title. • Relational Schema: BAD alternative: 4

  7. Problems with BAD:Redundancies and Update Anomalies 5

  8. Deletion Anomaly CSC434 is not given in this term. 6

  9. Deletion Anomaly CSC434 is not given in this term. 6

  10. Deletion Anomaly CSC434 is not given in this term. Additional effect: all the information about CSC434 was deleted. 6

  11. Avoiding Update Anomalies 8

  12. Avoiding Update Anomalies The instance does not store redundant information. 8

  13. Avoiding Update Anomalies CSC434 is not given in this term. 8

  14. Avoiding Update Anomalies CSC434 is not given in this term. The title of CSC434 is not removed from the instance. 8

  15. Normalization Theory • Main idea: a normal form defines a condition that a well designed database should satisfy. • Normal form: syntactic condition on the database schema. • Defined for a class of data dependencies. • Main problems: • How to test whether a database schema is in a particular normal form. • How to transform a database schema into an equivalent one satisfying a particular normal form. 10

  16. BCNF: a Normal Form for FDs • Functional dependency (FD) over R(A1, …, An): X  Y , X, Y  {A1, …, An}. • X  Y: two rows with the same X-values must have the same Y-values. • Number  Title in our example • Key dependency : X  A1 …. An • X is a key: two distinct rows must have distinct X-values. 11

  17. BCNF: a Normal Form for FDs •  is a set of FD over R(A1, …, An). • Relation schema R(A1, …,An),  is in BCNF if for every nontrivial X  Y in , X is a key. • A relational schema is in BCNF if every relation schema is in BCNF. 12

  18. BCNF Decomposition • Relation schema: R(X,Y,Z),  • Not in BCNF: implies X  Y and but not X  A, for every A  Z. • Basic decomposition: replace R(X,Y,Z) by S(X,Y) and T(X,Z). • Example: 13

  19. Lossless Decomposition ∏number, title (R) ∏number, section, room (R) 14

  20. Lossless Decomposition S JoinT 14

  21. How to justify good designs? • What is a good database design? • Well-known solutions: BCNF, 4NF, 3NF… • But what is it that makes a database design good? • Elimination of update anomalies. • Existence of algorithms that produce good designs: lossless decomposition, dependency preservation. 34

  22. Problems with traditional approaches • Many papers tried to justify normal forms. • Problem: tied very closely to the relational model. • Relied on well-defined notions of queries/updates. • These days we want to deal with other data models, in particular XML. • We need an approach that extends to other models, in particular, XML.

  23. Justification of Normal Forms • Problematic to evaluate XML normal forms. • No XML update language has been standardized. • No XML query language yet has the same “yardstick” status as relational algebra. • We do not even know if implication of XML FDs is decidable! • We need a different approach. • It must be based on some intrinsic characteristics of the data. • It must be applicable to new data models. • It must be independent of query/update/constraint issues. • Our approach is based on information theory. 35

  24. Information Theory • Entropy measures the amount of information provided by a certain event. • Assume that an event can have n different outcomes with probabilities p1, …, pn. Entropy is maximal if each pi= 1/n : 36

  25. Entropy and Redundancies • Database schema: R(A,B,C), A  B • Instance I: • Pick a domain properly containing adom(I) : • Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4 • Entropy: log 5 ≈ 2.322 • Pick a domain properly containing adom(I) : {1, …, 6} • Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2 • Entropy: log 1 = 0 {1, …, 6} 37

  26. Entropy and Normal Forms • Let  be a set of FDs over a schema S. Theorem(S,) is in BCNF if and only if for every instance of (S,) and for every domain properly containing adom(I),each position carries non-zero amount of information (entropy > 0). • A similar result holds for 4NF and MVDs. • This is a clean characterization of BCNF and 4NF, but the measure is not accurate enough ... 38

  27. Problems with the Measure • The measure cannot distinguish between different types of data dependencies. • It cannot distinguish between different instances of the same schema: R(A,B,C), A  B entropy = 0 entropy = 0 39

  28. A General Measure InstanceI of schema R(A,B,C), A  B : 40

  29. A General Measure InstanceI of schema R(A,B,C), A  B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I)  {1, …, k}. For example, k = 7. 40

  30. A General Measure InstanceI of schema R(A,B,C), A  B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I)  {1, …, k}. For example, k = 7. 40

  31. A General Measure InstanceI of schema R(A,B,C), A  B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I)  {1, …, k}. For example, k = 7. Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. 40

  32. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. 40

  33. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. 40

  34. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 40

  35. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 40

  36. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 40

  37. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 40

  38. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/ 40

  39. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/ For a ≠ 2,P(a | X) = 40

  40. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/ For a ≠ 2,P(a | X) = 40

  41. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/ For a ≠ 2,P(a | X) = 40

  42. A General Measure InstanceI of schema R(A,B,C), A  B : Computation: for everyX  Pos(I) – {p}, compute probability distributionP(a | X),a  {1, …, k}. P(2 | X) = 48/ (48 + 6 * 42) = 0.16 For a ≠ 2,P(a | X) = 42 / (48 + 6 * 42) = 0.14 Entropy ≈ 2.8057 (log 7 ≈ 2.8073) 40

  43. A General Measure InstanceI of schema R(A,B,C), A  B : Value : we consider the average over all sets X  Pos(I) – {p}. • Average:2.4558 < log 7(maximal entropy) • It corresponds to conditional entropy. • It depends on the value of k ... 40

  44. A General Measure: Relative Information Content (RIC) • Previous value: RICIk(Σ|p) • For each k, we consider the ratio: RICIk(Σ|p) / log k • How close the given position p is to having the maximum possible information content. • General measure (Arenas, L. 2003): RICI(Σ|p) = limk  ∞RICIk(Σ|p) / log k 41

  45. Basic Properties • The measure is well defined: For every set of first­order constraints Σ,every instance I of Σ, and every position p in I, RICI(Σ|p) exists. • Bounds: 0 ≤RICI(Σ|p) ≤ 1 • Closer to 1 = Less redundancy 42

  46. Basic Properties • The measure does not depend on a particular representation of constraints. • It overcomes the limitations of the simple measure: R(A,B,C), A  B 0.875 0.781 43

  47. Well-Designed Databases Definition A database specification (S,) is well-designed if for every I  inst(S,) and every p  Pos(I), RICI(Σ|p) = 1. In other words, every position in every instance carries the maximum possible amount of information. 44

  48. Relational Databases (Arenas, L.’03)  is a set of data dependencies over a schema S: •  = Ø: (S,) is well-designed. •  is a set of FDs: (S,) is well-designed if and only if (S,) is in BCNF. •  is a set of FDs and MVDs: (S,) is well-designed if and only if (S,) is in 4NF. •  is a set of FDs and JDs: • If (S,) is in PJ/NF or in 5NFR, then (S,) is well-designed. The converse is not true. • A syntactic characterization ofbeing well-designed is given in [AL03]. 45

  49. Decidability Issues • If Σ is a set of First-Order integrity constraints, then the problem of verifying whether a relational schema is well-designed is undecidable. • If Σcontains only universal constraints (FDs, MVDs, JDs, …), then the problem becomes decidable. • High complexity (coNEXPTIME) by reduction to the (complement) of Bernays-Schönfinkel satisfiability. 46

  50. 3NF • BCNF is the most popular textbook normal form. • In practice 3NF is much more common. • From Oracle's “General Database Design FAQ”: after defining 1NF, 2NF, and 3NF, it says: that there are other normal forms but “their definitions are of academic concern only, and are rarely required for practical purposes”

More Related