SQL Unit 15 Normalization

SQL Unit 15Normalization
prepared by Kirk Scott

1. Normal Forms 2. First Normal Form 3. Second Normal Form 4. Third Normal Form 5. Boyce-Codd Normal Form 6. Fourth Normal Form 7. Fifth Normal Form 8. Domain Key Normal Form 9. Nulls and Integrity

1. Normal Forms The benefits of relational database theory can be summarized as follows: There is a step-by-step way of arriving at a correct design There is a way of detecting flaws in a design The design process has to do with the problem domain, not with computer-related questions

The database designer and user are protected from questions related to the implementation of the dbms and the hardware it’s running on Finally, if the design is correct, it will be possible to: Store all desired information in it; Update the information on an ongoing basis; Retrieve any/all of the information as needed.

Correct designs are based on what are called normal forms. This section presents the background information to the design process. It also discusses and illustrates the use of normal forms.

Identifying Entities At its most basic level, design of a database depends on determining what you want to store information about. When deciding what the base tables will be, you are trying to identify entities. From a language point of view, this involves identifying nouns which do not modify other things.

Identifying Attributes Identifying the entities leads to identifying their attributes. Attribute names usually end up being nouns too, but you figure out what they are when you try to describe entities, and the descriptions usually involve adjectives. One of the key points of database design is that you only store information about the entities and attributes you need to.

There may be many possible entities All entities may have a long list of potential attributes But you limit yourself to only those things you will need to retrieve information about in the future.

Identifying Keys You are familiar with primary keys and foreign keys. When trying to organize the attributes around entities in the design, the idea is to equate an entity with a primary key field Then group the attributes with the entities that they describe. Relationships between tables are captured by embedding the primary keys of one or more tables as foreign keys in other tables.

Functions, Determination, and Dependency When described in general, the foregoing sounds sensible enough. That’s why the book claims that if you can model successfully, the result will be a correct design In practice it can be difficult to do without formal guidelines. This is what the normal forms provide.

The normal forms are based on and described in terms of an idea taken from math. One field in a table may functionally determine another. Stated in reverse order: The other field depends functionally on the one.

This is an example of a mathematical function: y = f(x), for example, y = x2 y is a function of x. x is in the domain and y is in the range. x functionally determines y Or, y functionally depends on x.

For a mathematical function, you find the dependent value by doing some sort of computation on the determining value. The key point underlying a function is the following: For each value of x, there can only be one corresponding value of y. x uniquely determines y.

The analogy in database design is the following: The primary key of a table should functionally determine the values of the other fields in the table. In other words, the non-key fields should functionally depend on the primary key field. Just as the primary key uniquely identifies a record, it uniquely determines the values of the fields in the record

Take this small table for example: This is its schema: Person(SSN, name, dob)

You don’t find a person’s name or birthdate by doing a computation on their social security number. However, given any one social security number, there is exactly one corresponding name and exactly one corresponding date of birth. It is true that different people with different social security numbers may have the same name and the same date of birth, but this is not a problem.

The point of the primary key field is that it is the unique identifier that makes it possible to distinguish between these two people. This idea came up at the beginning of the course The point now is that the name and date of birth fields functionally depend on the social security number field.

A new notation can be used to indicate this. In this notation, the arrows go from the field that functionally determines another field, to the field that is dependent. This is illustrated on the next overhead.

Normal Forms Some of the normal forms are identified by number, for example 1st, 2nd, and 3rd normal forms. Others are identified by name, for example Boyce-Codd normal form, named after the people who discovered it. These four normal forms are abbreviated 1NF, 2NF, 3NF, and BCNF, respectively. There are also higher normal forms, 4th, 5th, and domain key normal forms (4NF, 5NF, DKNF).

The normal forms have to do with finding dependencies in tables which spring from fields other than the primary key. These dependencies are undesirable and may be referred to as stray dependencies. The normal forms make increasingly strict statements about the kinds of stray dependencies that have to be eliminated from correctly designed tables. Designs containing stray dependencies are said to violate the normal forms.

Eliminating Dependencies The design process using normal forms consists of repetitive steps: Make a design Identify stray dependencies (normal form violations) Redesign to eliminate the dependencies Once you’ve eliminated all occurrences of one type of violation, you will have promoted the design into the next higher normal form Repeat until you’ve reached the highest normal form

The rule of thumb at every stage is to remove stray dependencies in the following way: Make any field which determines other fields the primary key of a new table, and move the fields that depend on that field to the new table. Make sure that the new table is connected to the old table by a primary key, foreign key pair.

Anomalies Design problems that are based on violations of normal forms lead to what are called anomalies. The hallmark of a problematic design is that the same information is stored multiple times. In other words, there is redundancy in the database. Depending on the nature of the redundancy, this can lead to problems when inserting data, when updating data, and when deleting data.

Justifying Normal Forms The use of normal forms may seem unnecessarily theoretical at first. However, they provide a convenient way of identifying problems in designs and then eliminating them. Normal forms are what justify these claims about relation databases: There is a step-by-step way of arriving at a correct design There is a way of detecting flaws in a design

The Plan of Action for the Following Sections Each of the following sections will present a normal form in this way: A definition of the normal form will be given. A scenario for information to be held in a database will be given, with the underlying assumptions given. An example database design which violates the normal form will be given

The violation will be shown using a diagram with the notation indicating functional dependencies. The desired functional dependencies from the primary key will be shown using arrows below the field names. Undesired, stray dependencies, which need to be eliminated in order to correct the design, will be shown using arrows above the field names.

Anomalies resulting from the incorrect design will be discussed. In general, there will be insert, update, and delete anomalies Finally, a corrected design will be given.

Basis for Examples All of the examples will be based on the general topic of cars, salespeople, customers, and car sales. Some of the field names are abbreviated, and some of the fields clearly belong together in some way. Here is a little preliminary explanation regarding the fields that will be in the examples. Not all of the fields will appear in all of the examples.

vin: vehicle identification number. Vehicles have makes, models, and years. spno, spname: Salesperson number and name. custno, custname: Customer number and name. A car sale has a salesprice and a date.

2. First Normal Form 1NF Definition: Formally (Watson): A relation is in first normal form if and only if all columns are single-valued. Informally: Data is stored in flat files; there can be no repeating groups in a record. (This was mentioned in the very first unit.)

The assumptions underlying the design are that a salesperson can sell many cars, but each car can only be sold by one salesperson. In this design, each car is only sold once, so the design captures information about the sales of new cars. These assumptions don’t cause the problem. It is the implementation of them that causes the problem.

Here is the design that violates 1NF: Carsale(spno, spname, {vin, salesprice}) The example design uses {} notation to indicate repeating groups of fields. A diagram with arrows illustrating this design is given on the next overhead

In general, normal form violations have insert, update, and delete anomalies. It would be possible to consider such problems with a 1NF violation However, the general form of anomalies is much clearer if they’re introduced with 2NF and higher normal form violations The repeating group alone is a sufficient problem to make this kind of design incorrect and anomalies will not be discussed in this first case

The solution to all basic normal form violations is the same: Break out the stray dependency out into a separate table In this case, break out the information contained in the repeating group

As stated in the assumptions, one salesperson can sell many cars, but each car is sold only once, so there is a 1-m relationship between the two tables in the resulting design. The primary key of the table containing salesperson information will have to be embedded as a foreign key in the table containing car information.

Here is the corrected design: Salesperson(spno, spname) Carsale(vin, salesprice, spnof.k.)

3. Second Normal Form 2NF Definition: Formally (Watson): A relation is in second normal form if an only if it is in first normal form, and all nonkey columns are dependent on the key Informally: In a table with a concatenated primary key field, there can be no stray dependencies that originate in just part of the primary key field.

The basic idea is that all nonkey fields have to depend on the whole key. Stating it in this way will lead to a useful mnemonic device which will be given later. When you lay it out in this way, you begin to realize that 2NF deals with tables that have concatenated key fields, where a dependency from only one field of the key might be possible.

In this example the underlying assumptions are that the same car can come back to the lot and be sold more than once. It can be sold by the same salesperson more than once, but not on the same day. It can also be sold by different salespeople at different times. Although unlikely, the design is made so that two different salespeople could sell the same car on the same date.

The design doesn’t contain any information about customers, but the scenario would be that one customer brought the car back, and a different salesperson sold it again. It seems unlikely that the same customer would buy the same car twice, whether on the same date or different dates.

In summary, this design works for used car sales and both the date and the salesperson information are needed, along with the car information, to distinguish between different sales.

Here is the design that violates 2NF: Carsale(vin, spno, date, spname) A diagram with arrows illustrating this is given on the next overhead

This faulty design has insert, update, and delete anomalies. Suppose a salesperson has not yet sold a car. In this case, it is not possible to insert information about that salesperson. On the other hand, a salesperson may make many sales. This means that the same information about that salesperson would be stored in more than one record in the table. This is redundancy.

Not only is the redundancy itself wasteful, it leads to the update anomaly. Suppose the salesperson’s name changes. Then it’s necessary to update multiple records to reflect this fact, not just one.

The delete anomaly is related to the insert anomaly. Suppose that as part of the maintenance of the database, on a yearly basis the sales table is cleared. When you delete the last record containing a sale by a particular salesperson, you not only get rid of the sales record, you also lose the salesperson’s name.

As usual, the solution to the problem is to break the stray dependency out into a table of its own. Each car sale has only one salesperson, but each salesperson can be involved in many sales, so this is a 1-m many relationship. The salesperson information is stored in a table by itself, and the primary key of the salesperson table is embedded as a foreign key in the car sale table.

Here is the corrected design: Salesperson(spno, spname) Carsale(vin, date, spnof.k.)

4. Third Normal Form 3NF Definition: Formally (Watson): A relation is in third normal form if an only if it is in second normal form and has no transitive dependencies Informally: There can be no stray dependencies from one non-key field to another.

In this example, for the sake of simplicity, it is assumed that new cars are being sold and they can only be sold once. Information about the customer is also recorded with the sale. Each car can only be bought by one customer. It would be possible for a customer to buy more than one car.

Here is the design that violates 3NF: Carsale(vin, custno, custname, salesprice, date) A diagram with arrows illustrating this is given on the next overhead

This design also has insert, update, and delete anomalies and the pattern of the anomalies is the same as in the previous example. They all stem from the presence of the stray dependency in the design. If you have a potential customer who has not yet bought a car, it is impossible to insert information about that person.

If a customer has bought more than one car, the customer information is stored redundantly. In that case, if the customer’s name changes, it’s necessary to change multiple records. Finally, if the sales table is cleared on a regular basis, when you delete the last sales record for a given customer, you not only get rid of the sales record, you also lose the customer’s name.

There is a situation that can arise in database designs that appears to be a violation of 3NF, but isn’t. The most common example of this situation is a table which includes a city, state, and zip code as part of an address. The postal service has divided up the country into zones which are identified by zip codes.

None of these zones cross city or state boundaries. That means that a zip code, a non key field, determines the city and state. It is not necessary for someone to break this dependency out of their database design. The rule of thumb is that if you are not responsible for maintaining the dependency, then you can ignore it.

The postal service has a table somewhere with zip code as the primary key and all of the descriptive fields about zip code that exist. The post office maintains this. A table not maintained by the postal service can contain addresses with zip codes and completely ignore the fact that there may in reality be a dependency.

5. Boyce-Codd Normal Form BCNF Definition: Formally (Watson): A relation is in Boyce-Codd normal form if and only if every determinant is a candidate key Informally: There can be no stray dependencies from a non-key field to a field in the key

Notice that the formal definition of BCNF differs from that of the foregoing definitions If does not include the phrase “in 3NF and…” BCNF is actually a summation of 1NF through 3NF which covers one other case which is not covered by the previous normal forms. BCNF will be explained by presenting an example which includes this additional case

For the purposes of this example suppose that the same car can be sold by the same salesperson more than once, but only one sale of that car is possible per date. Suppose also that this dealership has a system for assigning prospective customers to specific salespeople, so that each salesperson is associated with an exclusive list of clients.

It would be normal to assume that this system is implemented in some sort of table. Such a table is not shown here—it will become part of the solution to the problem. The point now is to show the problem that this assumption leads to in the table of interest in the original, incorrect design.

Here is the design which violates BCNF: Carsale(vin, spno, date, custno) A diagram with arrows illustrating this is given on the next overhead

It is obvious that there is a stray dependency in the design. Technically we haven’t seen this kind of dependency before, because it goes into the primary key. If you studied the previous normal form definitions, you would discover that they don’t cover this case.

The anomalies in this design are analogous to the anomalies in the previous designs. It is not possible to insert information about the relationship between a given customer and salesperson without a sales record which matches them. If the given customer has bought from the same salesperson many times, their relationship is in multiple records.

An update would require changes in multiple records. Finally, if you’re down to the last record containing information about a particular salesperson-customer pairing, deleting the record would cause the information to be lost. As usual, the solution to the problem is to break out the stray dependency in a separate table.

Here is the corrected design: Carsale(vin, date, custnof.k.) Customer-Salesperson(custno, spno) Which p.k. should be embedded as a f.k. is clear Under the assumptions of the example, one customer can buy many cars, but not vice-versa

The original design was intuitive, but wrong It recorded information linking salespeople directly to their sales, but this turned out to be problematic The new design is correct and solves the problem However, it is not necessarily 100% intuitive

Customers are uniquely associated with a single salesperson The car sale record now tells you who bought the car If you want to know who the salesperson was for a given sale, you have to do a join with the Customer-Salesperson table. To a user who doesn’t understand normalization and the assumptions underlying the database, this will seem like an unnecessary complication The “one table database” mindset is a hard one to break

What is Allowed under BCNF There is another aspect of BCNF that needs to be explained. Consider a design which includes both a university-generated student id number and a social security number. It would seem to violate BCNF as explained above:

Here is the design which seems to violate BCNF: Student(studentIDno, SSN, name) A diagram with arrows illustrating this is given on the next overhead

The additional part of BCNF is that if the stray dependency results from another field which also could have been chosen as a primary key for the table, then it is not a normal form violation. This is what is meant by the formal definition: A relation is in Boyce-Codd normal form if and only if every determinant is a candidate key

In the example, both studentIDno and SSN are valid, unique identifiers of students. You might want to record both. It is simply necessary to choose one of them as the primary key of the field. The presence of the other one in the table does no harm.

A Summary of the Normal Forms up through BCNF Up through BCNF the normal forms can be explained in terms of stray dependencies. An easy way to remember the requirements for these normal forms is the following statement: Every field in a table has to depend on the key, the whole key, and nothing but the key.

Because they are increasingly strict, the normal forms can be thought of as nested. When checking a design, you begin with the lowest normal form, make sure there are no violations, and move on to the following ones. This is what makes the design process step-by-step.

This idea can be represented using a Venn diagram. The idea is that the set of designs which is in some normal form is always a subset of those designs which meet the conditions for a lower normal form. A diagram of this is shown on the following overhead

1NF 2NF 3NF BCNF…

6. Fourth Normal Form It was claimed earlier that there are only three kinds of relationships: 1-1, 1-m, and m-n. This is not entirely true. There may be many-to-many-to-many relationships (relationships between 3 different types of entities at the same time, m-m-m),

In theory there is no reason why there can’t be relationships among 4 or more different types of entities at the same time. Fourth and fifth normal form, 4NF and 5NF, have to do with cases like these.

4NF and 5NF have been previewed earlier by saying they are related to the question of a cyclical design vs. a star design This is still true. Observe that there are essentially two options for capturing a m-m-m relationship: either a cycle or a star.

The higher normal forms are not strictly related to the question of stray dependencies The presentation of them will not follow exactly the same plan as the lower normal forms. However, the same basic ideas will be covered: A definition plus examples and explanations of how to do it right and what can go wrong

Fourth Normal Form 4NF Definition: Formally (Watson): A relation is in fourth normal form if it is in Boyce-Codd normal form and all multi-valued dependencies on the relation are functional dependencies. Informally: An informal definition will have to wait…

The formal definition illustrates why the presentation of the higher normal forms will differ from that of the lower ones. It is not possible to understand the formal definition or give a meaningful informal definition without further explanation.

There are three things to observe about the definition: 1. It goes back to the model of, “in the previous normal form and…” The definition of 4NF depends on the definition of BCNF plus another condition

2. On the other hand, it is worded “if” rather than “if and only if” I don’t know whether the author was daydreaming or whether this change is theoretically important. In any case, it is of no practical importance We can safely consider the two halves of the proposition to be logically equivalent

3. The most important thing in the definition requiring explanation is the introduction of a new concept: multivalued dependencies. The new concept is actually “independent multivalued dependencies”. It’s hard to understand a definition when it’s based on concepts that haven’t been explained yet.

The rest of this section will be organized in this way: 1. A correct example of a m-m-m relationship implemented in a star design will be given. 2. Then the problems that underlie the definition of the normal form will be examined.

Correct Example Suppose that a given car can be sold more than one time. In other words, you’re dealing in used cars. Suppose also that salespeople can sell more than one different car, and customers can buy more than one different car. This means that there are three 1-m relationships: Car to Sale, Salesperson to Sale, and Customer to Sale.

For the three base tables, Car, Salesperson, and Customer, there could be one table in the middle, Carsale, which brought all three together. The idea can be represented using ER modeling. This results in the star shaped design shown on the next overhead. It is reminiscent of the design of the example database.

Car Sales-person Carsale Customer

The relationships are captured by embedding primary keys as foreign keys, and a valid design can be given as follows: Car(vin, make, model, year) Salesperson(spno, spname) Customer(custno, custname) Carsale(vin, spno, custno, date)

What Can Go Wrong? New assumptions are necessary in order to create an example with 4NF problems. The oddness of the assumptions will probably already suggest that the results will be problematic. 1. Assume that it is desirable to keep a record of all prospective customers salespeople have talked to. Under this assumption, also assume that a given customer only deals with one salesperson.

2. Assume also that it’s desirable to keep a record of all cars salespeople have sold. Once the car is sold, it’s not important what customer it was sold to. All that’s of interest is the salesprice, for example, so that commissions can be calculated. 3. Finally, assume that all of this data is to be stored in one table.

This set of assumptions captures the elements of the definition of 4NF. The relationship between salespeople and customers is a multivalued dependency. In other words, it’s a one-to-many relationship, not a functional relationship. If you like notation, it can be indicated with a double-headed arrow: Salesperson—>>Customer

A table capturing this relationship alone would have two columns, one for salesperson and one for customer. Each row would consist of a pairing. In the table overall, a given salesperson could appear more than once.

The relationship between salespeople and cars is also a multivalued dependency. It’s a one-to-many relationship, not a functional relationship. Salesperson—>>Car It too, could be captured in a single table that matched pairs, with salespeople potentially appearing more than once.

The two multivalued dependencies are independent of each other. As spelled out in the assumptions, we’re not keeping a record a salesperson, a car, and the customer that it was sold to. We’re keeping track of cars sold, and independently, prospective customers that have been talked to.

Finally, the kicker assumption is that both of these independent multivalued dependencies are to be capture in a single table. As someone with background in database modeling, you doubtless recognize that there are three entities in the scenario and three tables will be desirable in a solution. Just hang on; we’re not there yet.

An example table is given on the next overhead. This will be followed by explanations.

Salesperson multidetermines the other fields. In other words, there is no functional dependency from salesperson to the other fields. That means that salesperson alone isn’t the key.

The multivalued dependencies are independent of each other. There is no relationship between customers and cars. Therefore, those fields don’t determine each other. Neither one can serve independently as a key.

The conclusion is that the table is all key. Furthermore, there are no dependencies among the constituent parts of the concatenated key. Therefore, it is impossible for this design to violate any of the normal forms up through BCNF, which all depend on stray dependencies.

However, the design is full of anomalies. Look at the data in the table. Salesperson Bob talked to two customers and sold two cars. The result is four rows in the table, each row one of the four possible combinations of customer and car.

Why is this necessary? Because the table is all key, none of the individual fields can be null. In order to record a customer, for example, there has to be a car. Similarly, in order to record a car, there has to be a customer.

Why do every possible pairing? Note that this is incidentally pairing just because the example is based on two values for each field. The reality is that you’re getting a Cartesian product for the multidetermined fields, the number in one, m, times the number in the other, n. Why not do what’s shown on the following overhead instead?

The problem with this is that if you want to delete the customer Jay, you lose the car Mustang—or vice-versa. If you delete the customer Kay, you lose the car Camaro—or vice-versa. So doubling up (doing the Cartesian product) is a consistent way to protect from data loss.

However, the doubling up (Cartesian product) is what causes the classic anomalies. To insert a single customer, you have to insert as many rows as there are cars for that salesperson. To insert a single car, you have to insert as many rows as there are customers for that salesperson.

Now that there are multiple rows for each of the multidetermined fields, updating a single instance means updating every row. Also, deleting a single instance will require deleting many rows.

In short, this design doesn’t violate any of the foregoing normal forms. However, it’s full of awful anomalies. Therefore, 4NF was defined, stating that a design like this is not correct.

The assumptions imply two one-to-many relationships. Deriving a correct design seems pretty straightforward. Have three tables, one each for Salesperson, Customer and Car. Embed the primary key of Salesperson in Customer and in Car.

Books will sometimes tell you that you are unlikely to find a mess like this in practice. Anyone who makes a design by working from 1NF through BCNF, rooting out stray dependencies, will most likely have broken out all entities into their own tables. Even if that hasn’t happened due to normalization, if they reach this point, they are likely to realize intuitively that this is a bad design.

Anyone familiar with database design principles would not try to put three types of entities together in a single table in the first place. On the other hand, people who are unfamiliar with the rules sometimes think that they should try and cram as much information into a single table as possible.

If that happens, then a violation such as this is possible. This is the classic worst-case scenario of the one-table mindset. If you are ever asked to convert a “database” maintained as an Excel spreadsheet into relational form, expect to see nightmares like this.

7. Fifth Normal Form 5NF Definition: Formally (Watson): A relation is in fifth normal form if and only if every join dependency of the relation is a consequence of candidate keys of the relation Informally: See the following commentary and explanation

It is apparent that 5NF is like BCNF in this way: It doesn’t depend on the earlier normal forms. It is a summative definition of what characteristic a correct design should have. Unfortunately, because its definition is so high level and theoretical, what it means is not clear. Like the definition of 4NF, it involves a concept that hasn’t been explained yet.

What is a join dependency? Rather than trying to define this concept, an example will be pursued. After looking at what the example reveals, an informal description of 5NF will be given.

Example Consider the Salesperson, Customer, and Car entities again A different set of assumptions than those given with 4NF will be needed. For the time being, a very general assumption will be stated so that two different designs can be entertained. The assumption is that the three tables aren’t simply related pairwise as shown in the correct design for 4NF

In the 4NF example, there was no direct relationship between Customer and Car But suppose that we’re again interested in capturing some sort of relationship which directly ties together a salesperson, a customer, and a car This could be something like the sale of a car

In the sample database used for SQL practice, this kind of relationship was captured by a star design. The Carsale table was in the middle of the star Each of Car, Customer, and Salesperson were in a one-to-many relationship with it.

You may have realized that there is another way to relate together all three of the base tables. What if each pairing of Salesperson, Customer, and Car were related in an m-m relationship? This results in a design with a cycle in it. This is shown in the ER diagram on the following overhead.

Salesperson-Car Car-Customer Car Sales-person Customer-Salesperson Customer

The design could also be represented in this way: Car(vin, make, model, year) Car-Customer(vinf.k., custnof.k.) Customer(custno, custname) Customer-Salesperson(custnof.k., spnof.k.) Salesperson(spno, spname) Salesperson-Car(spnof.k., vinf.k., date)

In general, designs with cycles in them are difficult to understand. What you want to determine can be stated in two different ways: What assumptions might make a cyclical design rather than a star design correct? What can go wrong with a cyclical design which would make it incorrect for a given situation?

If you traced all of the links in the design with the cycle, you would find that every car is connected to every salesperson is connected to every customer. Put in business terms, the cyclical design for car sales embodies this assumption: At one time or another every salesperson has sold every car to every customer. In other words, all sales that could occur did occur.

Obviously, the foregoing is madness. At this point the car sale example breaks down. It is unrealistic to think that every car would be sold by every salesperson and bought by every customer. It is difficult to imagine a scenario where these new assumptions would hold true. We are dealing with assumptions that simply aren’t realistic for this example.

Cycles vs. Star Models Although not realistic, the scenario is still informative. The cyclical design is contrary to the star design in this way: In the star design the table in the middle captures information for that subset of possible sales that actually occurred.

Finally, this returns to the question of 5NF and what a correct design would be. Suppose you did have a situation where every possible pair of relationships actually does exist. Which design is better, the one with the star or the one with the cycle?

Under this scenario, the table in the middle becomes the 3-way Cartesian product of the primary keys of the three base tables. This harks back to the 4NF example. The problem there was a table consisting of Cartesian products of independent multivalued dependencies

In a sense we now have the converse problem. If we know that each pairwise relationship is valid, why are we creating a table in the middle that forms an unneeded Cartesian product? You can see where this is headed. In a theoretical sense, at least, the design with the cycle is better.

Are there anomalies in the star design under these assumptions? The anomalies aren’t unmanageable roadblocks, but consider this argument: Suppose that Salespeople are related to m Customers and n Cars. Suppose you wanted to add a salesperson to the design with a star.

That would involve adding m x n triplets to the table in the middle. In the cyclical design you would “only” have to manage m additions to the Salesperson-Customer table and n additions to the Salesperson-Car table. In the long run these differences might be of practical significance

A More Realistic Example The book comes up with a reasonably realistic example of where the cyclical design might apply Suppose firms have contracts with consultants Consultants provide advice based on their skill sets And the firms require advice on skills in consultants’ skill sets

Add this assumption to the suppositions above: If a company has a contract with a consultant, and the consultant has a particular skill, and the company requires advice that depends on that skill, then a consultant under contract, by definition, can or will or does provide advice involving the skill. In shorthand, the relationships are “universal” (or promiscuous…)

This is a more general model than saying that a consultant is only contracted to provide advice on a given skill set. Whether it is completely realistic or not, if all of the assumptions hold true, then the cyclical model is the technically correct one.

ER diagrams of this example are given on the following overheads. In the captions of the diagrams, the “rule” referred to is the last assumption given above, which said that the relationships among entities are “universal”

Cycles vs. Linear Models There is another aspect of this question that can be considered. Would it make a difference if you broke the cycle by removing one of the tables in the middle? You could still form a join query that included all of the base tables, but you would definitely lose information from the model.

Suppose you removed “Firm Requires Skill” for example. Now every company would be indiscriminately contracting with every consultant for every skill they had rather than every skill they had that the company required The same kind of argument would apply to removing any of the tables in the middle

This kind of example is rare. However, there are occasionally situations where every entity in every base table is related to every other entity in every other base table. If so, the cyclical design is theoretically better.

A Concrete Illustration Even though the car example isn’t realistic, because it’s familiar, it’s possible to develop a concrete example that illustrates that the cyclical design does ultimately capture the Cartesian product of the base tables The example also illustrates how information is lost if you forget to include one of the tables, making the design a linear one instead of a cyclical one

These tables are shown on the following overeads: Car Customer Salesperson Car-Customer Car-Salesperson Customer-Salesperson The tables in the middle contain the full mXn Cartesian product of the base tables they connect

The following overhead shows a screenshot of the cyclical relationships between the tables.

Here is a join query that goes all the way around the cycle: SELECT * FROM Car, Car-Customer, Customer, Customer-Salesperson, Salesperson, Car-Salesperson WHERE Car.carID = Car-Customer.carID And Car.carID = Car-Salesperson.carID And Customer.customerID = Car-Customer.customerID And Customer.customerID = Customer-Salesperson.customerID And Salesperson.salespersonID = Car-Salesperson.salespersonID And Salesperson.salespersonID = Customer-Salesperson.salespersonID; The results of the query are shown on the following overhead.

The main thing to note about the previous results is that there are 3X3X3 = 27 rows

Here is the same join query with the last joining condition removed: SELECT * FROM Car, Car-Customer, Customer, Customer-Salesperson, Salesperson, Car-Salesperson WHERE Car.carID = Car-Customer.carID And Car.carID = Car-Salesperson.carID And Customer.customerID = Car-Customer.customerID And Customer.customerID = Customer-Salesperson.customerID And Salesperson.salespersonID = Car-Salesperson.salespersonID; The results of the query are shown on the following overhead. They are impossible to read because there are 81 rows.

The query missing one joining condition is the equivalent of a design where one of the tables in the middle is missing In such a design you’ve lost information However, the loss of design information means that when you run the query you get more result records In other words, what’s lost is connecting information that would restrict the results

Finally, an Informal Definition of 5NF Recall the formal definition of 5NF: A relation is in fifth normal form if and only if every join dependency of the relation is a consequence of candidate keys of the relation For the purposes of comparison, also recall the definition of BCNF: A relation is in Boyce-Codd normal form if and only if every determinant is a candidate key

BCNF seems to be directed towards the definition of the correct design of a table internally, without reference to other tables 5NF seems to be directed towards the definition of the correct design of a table in relation or reference to other tables The term “join dependency” hasn’t been defined, but it will be informally defined now

A general description of what 5NF says is the following: A design is correct if two conditions are met: All real relationships between entities are captured by the design; no false relationships between entities are captured by the design.

Empirically, the informal definition can be translated into a statement about queries that involve joining (join dependencies between) 2 or more tables. In a correct design in 5NF, you can write a query that will successfully pull all “real” information (based on the tables and their relationships) out of the db. You can relate this back to the cycle vs. linear model observations. You haven’t left a (pk-fk, join) relationship out of the design.

It is also true that it is not possible to write a query that would pull out results which did not hold true In other words, you haven’t included a“false” (pk-fk, join) relationship in the design) You can relate this back to the cycle vs. star model observations.

If the star was the correct design, you would only be able to pull out those specific triplets that were correct. If you mistakenly implemented a cycle, ultimately you would be able to write a query that in a sense pulled out the full Cartesian product of the base tables. If that is not correct, the model should not make it possible.

General Rules for Normalizing Even if a cycle is theoretically the correct design, for reasons of understandability, you might go with star. The general rule with normal forms is similar to the rule with ER modeling. You take the model to the highest normal form that is desirable. If you haven’t taken it to the highest normal form possible, you document the decision.

The whole sweep of numbered normal forms can be considered now. At the very least you would expect to work a design up through BCNF. 4NF is considered kind of exceptional because when modeling from scratch you would never arrive in that situation.

As noted, you might have to deal with 4NF if you are working with an existing “database” 5NF is the final extreme. Whether you choose to implement 5NF, it is definitely worthwhile to be conscious of the implications of cycles in a final design. If you choose to avoid them, document that decision.

8. Domain Key Normal Form The highest normal form, domain-key normal form (DKNF), is a theoretical statement of which attributes belong in tables and the relationships among the attributes. This form is not numbered. Like BCNF and 5NF, it is a summative normal form. DKNF encompasses all of the other normal forms. A design in DKNF is fully normalized.

DKNF Definition: Formally (Watson): A relation is in domain-key normal form if and only if every constraint on the relation is a logical consequence of the domain constraints and the key constraints that apply to the relation. Informally: Unless you’ve steeped yourself in the relational theory of normalization, this meaning of this statement is virtually impossible to interpret.

DKNF is based on the idea of domains and constraints. These ideas have come up before, but they themselves haven’t been formally defined yet. Practically speaking, even if understood, unlike the numbered normal forms, DKNF does not give you a step-by-step procedure for creating a design, determining whether or not it has violations, and fixing them.

Although this normal form is of little practical use, it raises the ideas of domains and constraints, which are important. They are the basis for correctly capturing relationships between tables. The remainder of this section will consist of remarks on these topics.

Domains and Data Types Domains and data types are not the same thing When a table is created, a complete definition has to tell the data type of each field. Some fields may hold numeric values, some may hold strings of characters, some may hold dates, etc. If a field holds strings of characters, its length, or maximum length also has to be stated.

So, for example, a person’s last name may be defined as containing a maximum of 24 characters. TEXT(24) or CHAR(24) doesn’t tell you that the field holds a name.

The social security number came up earlier illustrating another aspect of this. A social security number has 9 digits. Although the social security number is called a number, it is never used numerically. There is no need to add, subtract, multiply, or divide it, and a good design will prevent that.

The wise choice is to define this field as a character field containing 9 characters where valid characters in this field are limited to digits. The point of this is even that although the name of a field tells you something meaningful about it, the name doesn’t necessarily accurately reflect the preferred data type for it.

Domains Most of the time, the name of a field is descriptive of the kind of information it can hold. So for a “last name” field in a table containing information about people, it is informally clear what this means. In general, a name would consist of a sequence of letters of the alphabet.

Names could come from any language or culture, translated into the English alphabet. Some names do contain numeric information, usually indicated with Roman numerals, for example, John Smith I, John Smith II, etc. It would not be possible to come up with a formula that mathematically defined all possible values. Still, the general idea is clear.

In a technical sense, the term domain refers to the whole set of values that could appear as valid data in that field. However, the domain is a semantic concept. A field’s domain is a description of the meaning of the field and the kind of data it can contain.

This goes back to the description of an attribute and the requirements on it. An attribute captures one characteristic of the entity which it belongs to. Throughout a table, each row of the table should capture the same characteristic of an entity in that field. Different records should not be recording different kinds of information in the same field.

Fields Not on the Same Domain The idea of domains can be further clarified by giving examples of cases where fields are not on the same domain. A person’s last name field may be defined as 24 characters. A city field could also be defined the same way. There may be cases where a person’s name is the same as the name of a city.

There is a city of Lincoln in England. Abraham Lincoln’s ancestors might have came from that area. There is also a city of Lincoln in Nebraska, which was named after Abraham Lincoln. Even though there may be an intersection of the values in the city and last name fields, you could argue that conceptually, city name and person last name are two distinct domains.

On the other hand, you may be maintaining some sort of database where the underlying concept of interest is name itself, whether describing a person or a place. In that case, you could argue that the two fields, city name and person last name, really are on the same domain.

Another example of two fields that are not on the same domain would be social security number and zip code. A full zip code consists of 5 plus 4, or 9 digits, like a social security number. Both might be defined as character fields containing 9 characters.

However, social security numbers and zip codes have nothing in common. There are doubtless cases where someone’s social security number matches some zip code somewhere in the country, but this is purely coincidental. On the other hand, you may be maintaining some sort of database where the underlying concept is random numeric characteristics…

Domains and Primary Key to Foreign Key Relationships The relationship between tables has been explained by the process of embedding the primary key of one table as a foreign key in another. The foreign key table has to have a field with a suitable name, that is defined to hold the same type of data as the primary key field. In other words, the primary key to foreign key relationship requires fields on the same domain.

Fields on Common Domains in General There can be other relationships between tables which are the result of domains, but not the direct result of embedding keys. Going back to one of the earlier examples, a database may distinguish between mothers and children as different kinds of entities, and store them in different tables.

Each of these tables may have social security number fields and last name fields. You would not expect a mother and child to have the same social security number. This would be a mistake.

However, in most cases you would expect mothers and children to have the same last names. The idea is that a social security number is a social security number, regardless of what table it appears in. The idea of a social security number defines a domain.

Similarly, if the last name fields in both the mother and child tables were defined as containing 24 characters, a last name is a last name, regardless of what table it appears in. Incidentally, a last name would be a last name even if the fields were not of the same length

Both social security number and last name define a domains. You might expect to see social security numbers involved in primary key to foreign key relationships. Because names aren’t unique, you don’t expect that of them.

You can have name fields in two different tables. Even though they’re not a pk to fk pair, they are still on the same domain. In theory, if you wanted to, you could write join queries on those fields

The point of this discussion is that a domain is a cross-table concept. Any given database may contain many different fields in its tables, but the database will contain fewer domains than fields because various fields are on the same domain. In a sense, the idea of a domain is more fundamental than the idea of a field. A field is just a manifestation of a domain.

The Definition of DKNY, Again A relation is in domain-key normal form if and only if every constraint on the relation is a logical consequence of the domain constraints and the key constraints that apply to the relation. What are domain constraints? They are constraints on the values that are valid in a field based on the meaning of the field.

Thinking back to SQL, these are some constraints that can be concretely identified in all cases, regardless of the particular field in question: Entity integrity (primary key) Relational integrity (foreign key) Unique Not null

In very general terms, DKNF says that a database is correctly designed if the fields in the tables, the constraints on the individual fields, and the dependencies among the tables are the result of correctly defining the domains and choosing which domain each field is on. It reaches beyond the question of whether one table is correctly designed, because the relationships between tables also depends on domain choice.

The DKNY definition is so high-level and cosmic that it boils down to saying this: A database is correctly designed if and only if it’s correctly designed. If the design stems from and captures the logical meaning of the problem, the design is correct. If the design is correct, then it must be the case that it stems from and captures the logical meaning of the problem…

9. Nulls and Integrity This section is a review of material that was explained earlier. It will not be gone over in class. However, it is provided below in its entirety in case you want to read the overheads yourself.

9. Nulls and Integrity The term “null” refers to the idea that a particular field in a particular record may not have data in it. In general, this is permissible. Cases often arise in practice where the information doesn’t exist or isn’t known.

It would be impractical to insist that all fields always contain data. If that restriction were imposed, people would get around it by putting in bogus values for information that didn’t exist or wasn’t known. However, filling a database with bogus values is not a very good idea.

When a database management system supports null values in fields, it’s important to understand what this does not mean. It does not mean that the fields contain the sequence of characters “null”. It also does not mean that the field contains invisible blanks. Blank spaces themselves are a form of character. What it means is that there is absolutely nothing in the field, and the database management system is able to recognize fields that are in that state.

The term integrity in database management systems refers to the validity and consistency of data entered into a database. The phrase “entity integrity” is the formal expression of a requirement that was stated informally earlier. Entity integrity puts the following requirement on a correctly implemented database: Every table has a primary key field, and no part of the primary key field can be null for any record in the table. Clearly, if all or part of a key were allowed to be null, that would defeat the purpose that the primary key field be the unique identifier for every record in the table.

As seen in the long discussion of normal forms, it is the primary key to foreign key relationships that support the interconnection between related entities that have been separated into different tables by the design process. Once this has been done, it is critically important that the data maintaining the relationships be valid and consistent. The phrase “referential integrity” has the following meaning: Every value that appears in a foreign key field also has to appear as a value in the corresponding primary key field. This can also be stated negatively: There can be no foreign key value that does not have a corresponding primary key value.

The meaning and importance of referential integrity can be most easily explained with a small example showing a violation of it. Consider the tables shown on the following overhead:

Child b, Ann, is shown as having a mother with mid equal to 3. There is no such mother in the Mother table. This is literally nonsense. There is no sense in which this can be correct and this is what referential integrity forbids.

This example also illustrates two other things, which are related to “non-existent” values. Observe that mother 1, Lily does not have any matching records in the Child table. This does not violate referential integrity. It suggests that the Mother table is misnamed, and should be named the Woman table, but it is reasonable to think that you might be recording information about women and children and some women will not have children.

The other thing visible in the table is that child c, June, does not have a mother listed. In other words, the foreign key field is null. This also does not violate referential integrity. As with null in any situation, it may mean that the mother is not known.

Nobody literally doesn’t have a mother, but if the woman table only records information on living women, for example, then for an orphan, the mother “wouldn’t exist”. It is unlikely that you would rename the table “Children and Orphans”—but the idea is that the null value is allowed and this in some sense affects the meaning of what kinds of entities are entered into the table.

Referential integrity leads to one last consideration. The idea behind normalization was to get the stray dependency out of one table and break it into two. The problem with stray dependencies was redundancy and anomalies. By breaking a design into two tables with a primary to foreign key pair, you introduce interrelationship constraints.

Put simply, the question is this: What do you do with foreign key values if the corresponding primary key values in another table are deleted or updated? A fully-featured database management system will enforce referential integrity constraints. The default settings for these constraints are summarized in these two phrases: On delete, restrict; on update, cascade.

If these defaults are implemented, this is a fuller explanation of what they mean in terms of the concrete mother and child example: On delete, restrict: No mother record can be deleted if she has corresponding child records in the other table. To allow the deletion would lead to a referential integrity violation.

On update, cascade: If the primary key value of a mother record is updated, if she has corresponding child records in the other table, the foreign key values in those records is automatically updated to reflect the change. This problem arises less frequently because once a primary key value is assigned to an entity, it is rarely changed.

The End

SQL Unit 15 Normalization

SQL Unit 15 Normalization

Presentation Transcript

SQL Unit 15Normalization

SQL Unit 17 Normalization

SQL Unit 15 Normalization

Unit 15

Database Normalization Sept 15 th 2009

Unit 15

Unit 15

UNIT 15

Unit 15

Unit 15

Unit 15

SQL Normalization Database Design

Unit 3 SQL

Unit-15

UNIT 15

Unit 15

Unit 15

Unit 15

Unit 15

Unit 15

Unit 15