Schema Refinement and Normal Forms

Schema Refinement and Normal Forms

Why schema refinement? • Consider relation obtained from Hourly_Emps: • Hourly_Emps (sin,rating,hourly_wages,hourly_worked) • Problems: • Update Anomaly: Can we change the wages only in the 1st tuple? • Insertion Anomaly: What if we want to insert an employee and don’t know the hourly wage for her/his rating? • Insertion Anomaly II: We can only introduce a new rating scheme when we enter an employee with this rating into the relation • Deletion Anomaly: If we delete all employees with rating 5, we lose the information about the wage for rating 5! sin wages hours rat 123-22-3661 8 10 40 231-31-5368 8 10 30 434-26-3151 5 15 37 612-37-2814 8 10 42

Functional Dependencies (FDs) • A functional dependency X  Y (X and Y are sets of attributes; X,Y schema(R)) holds over relation R if for every allowable instance r of R: • Given two tuples t1 and t2 in instance r of R, if the X values of t1 and t2 agree, then the Y values must also agree • t1  r, t2  r, X(t1) = X (t2) implies Y (t1) = Y (t2) • FDs are determined by the semantics of the application • Let schema(R) = (a1,a2,…an); a set of attributes X  schema(R) is a candidate key for a relation R if X R (= X  a1,a2,…an) i.e., X implies all attributes of R(typically, the key K of an entity-set has been chosen exactly because K  R) • Relationship sets (many-one,one-one) deduce FDs • Other example: no two courses can meet in the same room at the same time • {room, time}  {course} • Or short: room, time  course

FDs (contd.) • An FD is a constraint on all allowable instances of a relation R (on data that might appear in R) • Given some instance r of R, we can check if it violates some FD f, but we CANNOT deduce that some FD f’ holds over R only because it holds over r • Goal (very informal): a relation should only contain key induced FDs • We will see that other FDs lead to some form of redundancy • We achieve this goal by decomposing relations into smaller relations

Example FDs • Consider relation obtained from Hourly_Emps: • Hourly_Emps (sin,rating,hourly_wages,hourly_worked) • Notation: we will denote this relation schema by listing the attributes SRWH • Some FDs on Hourly_Emps: • sin is the key: S  SRWH • rating determines hourly_wages: R  W S R H W 123-22-3661 8 10 40 231-31-5368 8 10 30 434-26-3151 5 15 37 612-37-2814 8 10 42

Example Redundancy Anomalies • Update/Insertion/Deletion Anomalies due to R  W • How to solve the problems? • Wages is an entity set by itself with two attributes. H S R S R H W R W 123-22-3661 8 40 231-31-5368 8 30 434-26-3151 5 37 612-37-2814 8 42 123-22-3661 8 10 40 231-31-5368 8 10 30 434-26-3151 5 15 37 612-37-2814 8 10 42 8 10 5 15 +

Reasoning about FDs • When improving relational design, we often need to ask “does this FD hold in this relation” • Given some FDs, we can usually infer additional FDs: • Example: sin --> did, did --> location implies sin --> location • Basic set F of FDs (determined a priori by designer) • An FD f is implied by a set of FDs F if f holds whenever all FDs in F hold • F+= closure of F: set of all FDs that are implied by F • Armstrong’s Axioms (X, Y, Z are sets of attributes,i.e, X,Y,Z  schema(R)): • Reflexivity: If Y  X, then X  Y • Augmentation: If X  Y, then XZ  YZ for any Z • Transitivity: If X  Y and Y  Z, then X  Z These are sound and complete inference rules for FDs (only deduce FDs that hold and deduce all FDs) • Some additional rules (that follow from AA): • Union: If X  Y and X  Z, then X  YZ • Decomposition: If X  YZ, then X  Y and X  Z

Reasoning about FDs (Example) • Example: Contracts(C,S,J,D,P,V) • contract-id, supplier-id, proJect-id, department-id, part-id, value • C is the key: C  CSJDPV • (or more precise: {C}  {C, S, J, D, P, V}) • Application Semantics: Project (J) purchases each part using single contract: JP  C • Application Semantics: Department purchases at most one part from a supplier: SD  P • JP C, C  CSJPDPV imply JP  CSJDPV (transitivity; JP is also key candidate) • SD  P implies SDJ  JP (augmentation) • SDJ  JP, JP  CSJPDPV imply SDJ  CSJDPV (transitivity; SDJ is also key candidate)

Attribute Closure of X • Computing F+ given F is expensive • Size of closure is exponential in #attributes • Attribute closure of X = X+ wrt F: • set of all attributes A such that X  A is in F+ • Basis: X+ = X • Induction: If Z X+ , and Z  A is in F, then add A to X+ • End when X + cannot be changed • Linear time • Example F = {A  B, BC  D} : • A+ = AB, B+ = B, C+ = C, (AB) + = AB • More precise: {A}+ = {A, B}, {B}+ = {B}, {C}+ = {C}, {AB} + = {A,B} • (BC)+ = BCD, • (AC)+ = ABCD • …

Using Attribute Closure • Problem: Is FD X  Y  F+ ? • Is X  Y implied by F? • Solution: • Check whether Y in X+ wrt. F • Does F = {AB, BC, C DE } imply A  E? • i.e, is AE in the closure F+ ? • Equivalently, is E in A+ ?

Finding all key candidates • Sometimes we want to find all the key candidates for a given relation • a set of attributes X is a key candidate if X  R and there is no subset Y  X such that Y  R • Algorithmto determine the set K of all key candidates • For each set of attributes X • Set iskey = true • If X+  R then iskey = false • For each A  X while iskey • If (X-A)+ = R then iskey = false • If iskey then K = K  {X} • Exponential in the #attributes

Example Bottom Up • R(S,C,R,P) and CS  P, CS  R, CR  S • S=student-id, C=course-id, R=ranking, P=examining prof. • CS  P: a student has in a given course only one examining prof. • CS  R: in a given course a student receives exactly one ranking • CR  S: in a given course a specific ranking value can only be assigned to one student (relative evaluation of students) • Key candidates • S+ = {S},C+ = {C}, R+= {R}, P+ = {P} • (SC)+ = {S,C,R,P} • because of CS  P and CS  R • (SR)+ = {S, R}, (SP)+ = {S,P} • (CR)+ = {C,R,S,P} • because of CR  S and then CS  P • (CP)+ = {C,P}, (RP)+ = {R,P} • (SRP)+ = {S, R, P} • Not interesting: • (SCR)+ / (SCP)+ / (CRP)+ / (SCRP)+ = superkey • Informal Reasoning • No attribute alone is primary key • C is not on the right site of any FD, therefore it must be part of any candidate key • CS is key candidate as shown on the left • CR is key candidate as shown on the left • CP is not candidate • No 3-attribute or 4-attribute set is key candidate because • It either does not contain C • Or it is a superset of CS or CR

Finding the closure of F • Exponential in #attributes • Algorithm • For each set of attributes X compute X+ • Add X  A for each A in X+ - X • Ignore or drop some “obvious” dependencies that follow from others • Obvious dependencies • Trivial FDs: right side is a subset of left side • Consequence: no point in computing {}+ or closure of full set of attributes • Drop XY  A if X  A holds • Consequence: If X+ is all attributes then there is no need to compute closure of superset of X • Keep only fully decomposed FDs: keep X  Y and X  Z, ignore X  YZ

Example • F = AB  C, C  D, D  A • A+ = A; B+ = B (nothing) • C+ = ACD (add C  A) • D+ = DA (nothing new) • (AB)+ = ABCD (add AB  D; skip all supersets of AB) • (BC)+ = ABCD (nothing new; skip all supersets of BC) • (BD)+= ABCD (add BD  C; skip all supersets of BD) • (AC)+ = ACD; (AD)+ = AD; CD+ = ACD (nothing new) • All other sets contain AB, BC, or BD, so skip • Thus, the only interesting FDs that follow from F are C  A,AB  D, BD  C

Normalization • Remind the relation between redundancy anomalies and FDs • Only key-induced FDs hold: there is no redundancy • Given A  B and A is not key candidate: several tuples could have the same A value, and if so, they’ll have the same B value! • If a relation is in a certain normal form (BCNF, 3NF) it is known that only certain kind of FDs hold. Hence, some forms of anomalies are avoided or minimized.

BCNF and 3NF • A relation R with FDs F is in Boyce-Codd Normal Form (BCNF) if for all X  A in F+ • A X (trivial FDs) or • X contains a key for R (X is superkey) • In other words, R is in BCNF if the only non-trivial FDs that hold over R are key constraints. • If there are two tuples that agree upon the X value we cannot infer the A value in one tuple from the A value in the other tuple • A relation R with FDs F is in 3NF (3rd Normal Form), if for all X  A in F+ • A  X (trivial FDs) or • X contains a key for R (X is superkey) or • A is part of some key for R. • If R is in BCNF, then it is in 3NF X A x a x ?

Examples • An additional FD • Hourly_Emps (sin,rating,hourly_wages,hourly_worked) • SSRWH (sin key) and R  W (rating determines wages) • R  W violates both BCNF and 3NF conditions • Dependency only on subkey • Works In(sin,did,daddress) • Did  daddress violates both BCNF and 3NF • Hourly_Emps2 and Wages are in BCNF • Hourly_Emps2 (sin,rating,hourly_worked) • Wages (rating, hourly_wages) • In 3NF but not in BCNF • Room time  course, course  room • Key candidates are {room, time} and {time, course} • The FD course  room violates the BCNF condition. • The FD course  room does not violate 3NF since room part of a key candidate • 3NF means “third normal form”. 1NF and 2NF do exist, but only of historical interest Room Time Course 421 T 10-12 1100 421 TH 10-12 1100 305 M 2-4 Aaud

Decomposition • If a relation is not in the desired normal form we decompose it: given relation R with attributes A1 … An. A decomposition of R consists of replacing R by two or more relations such that • Each new relation scheme contains a subset of the attributes of R (and no attributes that do not appear in R), and • Every attribute of R appears as an attribute in one of the new relations • The tuples of R are split such that the corresponding subtuples appear in the new relations. • For instance, decompose Hourly_Emps SRWH into SRH and RW • Tuples of SRH: SRH(SRWH) • Tuples of RW: RW(SRWH) • Disadvantage of Decomposition: • some queries become more expensive since several relations must be joined (salary = wages * hours)

Lossless Join Decompositions • A decomposition of schema R into schemas R1 with attribute set X and R2 with attribute set Y is lossless-join w.r.t. to a set of FDs F, if for any instance r of R satisfying F, • X(r) y(r) = r • I.e., we can reconstruct r by joining the corresponding decomposed relation instances r1 and r2. • Lossless decomposition is crucial! BC(r) = r2 AB(r) = r1 Reconstruction r1 r2 C B A C B A C B B A 1 2 3 4 5 6 7 2 8 1 2 8 7 2 3 Decomp. 1 2 3 4 5 6 7 2 8 2 3 5 6 2 8 1 2 4 5 7 2

Lossless Join Decompositions • The decomposition of R into R1 and R2 is lossless-join wrt F if and only if the closure of F contains • Let {a1, … an} be the set of attributes that are both in R1 and R2: then {a1, … an}  R1 or {a1, .. an}  R2 • In particular, if U  V (non trivial) holds over R then the decomposition of R into UV and R-V is lossless-join (rejoin does not create new tuples because U is key in UV and foreign key in R-V). We denote this as a decomposition along FD U  V H S R S R H W R W 123-22-3661 8 40 231-31-5368 8 30 434-26-3151 5 37 612-37-2814 8 42 123-22-3661 8 10 40 231-31-5368 8 10 30 434-26-3151 5 15 37 612-37-2814 8 10 42 8 10 5 15 +

Decomposition into BCNF • Given relation R with FDs F and X  Y  FD violating BCNF • Compute X+ (cannot be all attributes!) • Decompose R into X+ and (R - X+)  X. • Repeat until all relations in BCNF. • If X+ = {X,Y} we denote this as a decomposition along X  Y • Decomposition is lossless join • Example CSJDPV, key C, JP  C, SD  P, J  S • Determine key candidates: C, JP • SD  P violates BCNF • (SD)+ = {S, D, P} • decompose into SDP, CSJDV along FD SD  P • Only FD in SDP is SD  P; SD key candidate for this relation, so in BCNF • FDs in CSJDV are C  CSJDV and J  S; C key candidate, • J  S violates BCNF for CSJDV • decompose CSJDV into JS and CJDV along J  S • For JS, F+ = {J  S}; J is key candidate for this relation, so in BCNF • For CJDV, F+ = {C  CSJDV}; C key candidate for this relation, so in BCNF • In general, several dependencies may cause violation of BCNF. The order in which we deal with them could lead to very different sets of relations • Note that we only need to look at F, not at F+. X X+ R

BCNF and Dependency Preservation • Dependency preserving decomposition (intuitive) • Give relation R with FDs F. A dependency preserving decomposition allows us to check each dependency in F by looking at one of the decomposed relations. • Not all BCNF decompositions are dependency preserving • E.g.,given CSJDPV, key C, JP  C, SD  P, J  S: • Decompose into SDP, JS and CJDQV (covering SD  P and J  S) • But now checking JP  C requires a join! • Example 2: time room  course, course  room • Keys are {time, room} and {time, course}, but course  room has a left side which is not a superkey • Decompose along course  room: R1(course,room) and R2(course,time) • Problem: you can only check the FD time room  course when joining R1 and R2 • Possible workaround in some situations • E.g., in CSJDPV example above, add JPC to the collection of relations; note that JPC tuples only stored for checking FD on a single relation (Redundancy!)

Formal Definition • Projection: • Let R be a relation schema that is decomposed into two schemas R1 and R2. The projection of F on R1 (denoted FR1) is the set of FDs U  V in the closure F+ (NOT ONLY F) that involve only attributes in R1. Note that a dependency U  V in F+ is in FR1 only if all the attributes in U and V are in R1. • Example: given CSJDPV, key C, JP  C, SD  P, J  S: • Decompose into SDP, JS and CJDQV • FSDP = {SD  P, other trivial}, FJS = {J  S, trivial}, FCJDQV = {C  JDQV, trivial} • Dependency preserving decomposition (formal) • Give relation R with FDs F decomposed into schemas R1, … Rn. The decomposition is dependency-preserving if (FR1  FR2  … FRn)+ = F+. • Example above: lost JP  C

3NF is Dependency Preserving • We can decompose any relation R with an arbitrary set FD, such that all decomposed relations are in 3NF and the decomposition is dependency preserving. However, it might happen that not all dependencies are key induced dependencies • Room time  course, course  room • Key candidates are {room, time} and {time, course} • The FD course  room does not violate 3NF since room part of a key. However, it cannot be guaranteed by the key condition but must be guaranteed by other forms of constraints. • Obviously, the algorithm for decomposition into BCNF can be used to obtain a decomposition into 3NF (typically can stop earlier) but it does not preserve dependencies.

Summary • If a relation is in BCNF, it is free of redundancies that can be detected using FDs. Thus, trying to ensure that all relations are in BCNF is a good heuristic. • If a relation is not in BCNF, we can try to decompose it into a collection of BCNF relations. • Must consider whether all FDs are preserved. If a lossless-join, dependency preserving decomposition into BCNF is not possible (or unsuitable, given typical queries), should consider decomposition into 3NF • Decompositions should be carried out and/or re-examined while keeping performance requirements in mind.

Overall Design Process • We have assumed schema R is given • R could have been generated when converting E-R diagram to a set of tables. • R could have been a single relation containing all attributes that are of interest (called universal relation). • Normalization breaks R into smaller relations. • R could have been the result of some ad hoc design of relations, which we then test/convert to normal form. • Might keep “bad” design with redundancy for performance reasons! • Application must guarantee consistency

ER Model and Normalization • When an E-R diagram is carefully designed, identifying all entities correctly, the tables generated from the E-R diagram should not need further normalization. • However, in a real (imperfect) design there can be FDs from non-key attributes of an entity to other attributes of the entity • E.g. employee entity with attributes department-number and department-address, and an FD department-number  department-address • Good design would have made department an entity • FDs from non-key attributes of a relationship set possible, but rare --- most relationships are binary

Other Design Issues • Some aspects of database design are not caught by normalization • Examples of bad database design, to be avoided: Instead of earnings(company-id, year, amount), use • earnings-2000, earnings-2001, earnings-2002, etc., all on the schema (company-id, earnings). • Above are in BCNF, but make querying across years difficult and needs new table each year • company-year(company-id, earnings-2000, earnings-2001, earnings-2002) • Also in BCNF, but also makes querying across years difficult and requires new attribute each year. • Is an example of a crosstab, where values for one attribute become column names • Used in spreadsheets, and in data analysis tools

Minimal Cover for a Set of FDs • Usage of a minimal cover G for a given set of FDs F • Closure of F = closure of G • Right hand side of each FD in G is a single attribute • If we modify G by deleting an FD or by deleting attributes from an FD in G, the closure changes. • Intuitively, every FD in G is needed, and “as small as possible” in order to get the same closure as F • E.g., A  B, ABCD  E, EF  GH, ACDF  EG has the following minimal cover: A  B, ACD  E, EF  G and EF  H • Rules to create the cover • Put the FDs in standard form: only a single attribute on the right side • Minimize left side of each FD: check whether attribute of left side can be deleted while preserving equivalence • Delete redundant FDs

Dependency Preserving Decomposition into 3NF • Let R be a relation with a set F of FDs that is a minimal cover • decompose R into R1…Rn according to the BCNF algorithm such that each R1…Rn is 3NF. Let Fi be the projection of F onto the attributes in Ri. • Identify the set N of dependencies in F+ that are not preserved, that is, not included in the closure of the union of Fi’s. • For each FD X  A in N, create a relation XA and add it to the decomposition of R. • If N contains several FDs with the same left side X  A1, …, X  An we can createone relation Ri(X,A1,.,.An)

Example • Concert(orchestra, conductor, city, date-of-concert, year-of-birth, opus, year-of-creation, date-of-world-premiere, composer) • e.g., (’Montreal Philharmonic Orchestra’, ‘L. Siegel’, ’Toronto’, ’16-01-2006’, 1937, ’Piano Concert No. 1 b-moll Op. 23’, 1874, ’25-10-1875’, ’P. Tschaikowsky’) • shortcuts: • Or (orchestra), Con (conductor), Ci (city), Doc (date-of-concerts), Yob (year-of-birth), Op (opus), Yoc (year-of-creation), Dow (date-of-world-premiere), Com (composer) • The following functional dependencies hold: • Con --> Yob • Or --> Con • Op --> Yoc, Dow, Com • Yoc, Com, Dow --> Op

Example • ISBN → publisher, title • ISBN, no → author • ISBN, author → no • publisher, title, no → author.

Schema Refinement and Normal Forms