SQL Unit 15 Normalization

SQL Unit 15Normalization
prepared by Kirk Scott

1. Normal Forms 2. First Normal Form 3. Second Normal Form 4. Third Normal Form 5. Boyce-Codd Normal Form 6. Higher Normal Forms 7. Domains 8. Nulls and Integrity

1. Normal Forms The benefits of relational database theory can be summarized as follows: There is a step-by-step way of arriving at a correct design; there is a way of detecting flaws in a design; the design process has to do with the problem domain, not with computer-related questions;

and finally, if the design is correct, it will be possible to: store all desired information in it; update the information on an ongoing basis; and retrieve the information when needed.

Correct designs are based on what are called normal forms. This section presents the background information to the design process. It also discusses and illustrates the use of normal forms.

At its most basic level, design of a database depends on determining what you want to store information about. When deciding what the base tables will be, you are trying to identify entities. From a language point of view, this involves identifying nouns which do not modify other things.

Identifying the entities leads to identifying their attributes. Attribute names usually end up being nouns too, but you figure out what they are when you try to describe entities, and the descriptions usually involve adjectives. One of the key points of database design is that you only store information about the entities and attributes you need to.

There may be many possible entities, and all entities may have a long list of potential attributes, but you limit yourself to only those things you will need to retrieve information about in the future.

You are familiar with primary keys and foreign keys. When trying to organize the attributes around entities in the design, the idea is to equate an entity with a primary key field and then group the attributes with the entities that they describe.

Relationships between tables are captured by embedding the primary keys of one or more tables as foreign keys in other tables. When described in general, this sounds sensible enough. In practice it can be difficult to do without formal guidelines. This is what the normal forms provide.

The normal forms are based on and described in terms of an idea taken from math. One field in a table may functionally determine another. Stated in reverse order: The other field depends functionally on the one. In math, an example of a function such as this might be: y = f(x), for example y = x2

In this case, x is in the domain and y is in the range. y is a function of x. In other words, x functionally determines y, or y functionally depends on x.

For a mathematical function, you find the dependent value by doing some sort of computation on the determining value. The key point underlying a function is the following: For each value of x, there can only be one corresponding value of y.

The analogy in database design is the following: The primary key of a table should functionally determine the values of the other fields in the table. In other words, the non-key fields should functionally depend on the primary key field. Take the small table on the following overhead for example:

Person(SSN, name, dob)

You don’t find a person’s name or birthdate by doing a computation on their social security number. However, assuming that the data are recorded in the table, given any one social security number, there is exactly one corresponding name and exactly one corresponding date of birth. It is true that different people with different social security numbers may have the same name and the same date of birth, but this is not a problem.

The point of the primary key field is that it is the unique identifier that makes it possible to distinguish between these two people. The bottom line is that the name and date of birth fields functionally depend on the social security number field.

A new notation can be used to indicate this. In this notation, the arrows go from the field that functionally determines another field, to the field that is dependent. This is illustrated on the next overhead.

Some of the normal forms are identified by number, for example 1st, 2nd, and 3rd normal forms. Others are identified by name, for example Boyce-Codd normal form, named after the people who discovered it. These four normal forms are abbreviated 1NF, 2NF, 3NF, and BCNF, respectively. There are also higher normal forms, 4th, 5th, and domain key normal forms (4NF, 5NF, DKNF).

The normal forms have to do with finding dependencies in tables which spring from fields other than the primary key. These dependencies are undesirable and are referred to as stray dependencies. The normal forms make increasingly strict statements about the kinds of stray dependencies that have to be eliminated from correctly designed tables. Designs containing stray dependencies are said to violate the normal forms.

It is not hard to come up with examples of how these conditions might be violated, and then use the normal form definition to come up with an improved design which doesn’t contain the violation. The rule of thumb at every stage is to remove stray dependencies in the following way: Make any field which determines other fields the primary key of a new table, and move the fields that depend on that field to the new table. Make sure that the new table is connected to the old table by a primary key, foreign key pair.

Design problems that are based on violations of normal forms lead to what are called anomalies. The hallmark of a problematic design is that the same information is stored multiple times. In other words, there is redundancy in the database. Depending on the nature of the redundancy, this can lead to problems when inserting data, when updating data, and when deleting data.

The normal forms are discussed below. They are defined informally. Then designs with violations are given, the anomalies that result will be pointed out, and the requirements of the normal form will be used to straighten out the design. The use of normal forms may seem unnecessarily theoretical at first. However, they provide a convenient way of identifying problems in designs and then eliminating them.

2. First Normal Form The various normal forms will be presented in the following way: A definition of the normal form will be given. Then a scenario for information to be held in a database will be given, with the underlying assumptions given. Then an example database design which violates the normal form will be given and it will be shown using a diagram with the notation indicating functional dependencies.

The desired functional dependencies from the primary key will be shown using arrows below the field names. Undesired, stray dependencies, which need to be eliminated in order to correct the design, will be shown using arrows above the field names. Anomalies resulting from the incorrect design will be discussed. Then a corrected design will be given.

All of the examples will be based on the general topic of cars, salespeople, customers, and car sales. Some of the field names are abbreviated, and some of the fields clearly belong together in some way. Here is a little preliminary explanation regarding the fields that will be in the examples. Not all of the fields will appear in all of the examples.

vin: vehicle identification number. Vehicles have makes, models, and years. spno, spname: Salesperson number and name. custno, custname: Customer number and name. A car sale has a salesprice and a date.

1NF can be defined as follows: Data is stored in flat files; there can be no repeating groups in a record. This has been explained in detail in a previous section. The example design uses {} to contain repeating groups.

The assumptions underlying the design are that a salesperson can sell many cars, but each car can only be sold by one salesperson. In this design, each car is only sold once, so the design captures information about the sales of new cars.

Here is the design that violates 1NF: Carsale(spno, spname, {vin, salesprice}) A diagram with arrows illustrating this is given on the next overhead

In general, normal form violations have insert, update, and delete anomalies. It would be possible to analyze such problems with a 1NF violation but it’s not necessary. The repeating group alone is a sufficient problem to make this kind of design incorrect.

As with all normal form violations, the solution is to break out the separate dependency, in this case the information contained in the repeating group, into a separate table. As stated in the assumptions, one salesperson can sell many cars, but each car is sold only once, so there is a 1-m relationship between the two tables in the resulting design.

The primary key of the table containing salesperson information will have to be embedded as a foreign key in the table containing car information. Here is the corrected design: Salesperson(spno, spname) Carsale(vin, salesprice, spnof.k.)

3. Second Normal Form 2NF can be defined as follows: In a table with a concatenated primary key field, there can be no stray dependencies that originate in just part of the primary key field. In this example the underlying assumptions are that the same car can come back to the lot and be sold more than once.

It can be sold by the same salesperson more than once, but not on the same day. It can also be sold by different salespeople at different times. Although unlikely, the design is made so that two different salespeople could sell the same car on the same date.

The design doesn’t contain any information about customers, but the scenario would be that one customer brought the car back, and a different salesperson sold it again. It seems unlikely that the same customer would buy the same car twice, whether on the same date or different dates. In summary, this design works for used car sales and both the date and the salesperson information are needed, along with the car information, to distinguish between different sales.

Here is the design that violates 2NF: Carsale(vin, spno, date, spname) A diagram with arrows illustrating this is given on the next overhead

This faulty design has insert, update, and delete anomalies. Suppose a salesperson has not yet sold a car. In this case, it is not possible to insert information about that salesperson. On the other hand, a salesperson may make many sales. This means that the same information about that salesperson would be stored in more than one record in the table. This is redundancy.

Not only is the redundancy itself wasteful, it leads to the update anomaly. Suppose the salesperson’s name changes. Then it’s necessary to update multiple records to reflect this fact, not just one.

The delete anomaly is related to the insert anomaly. Suppose that as part of the maintenance of the database, on a yearly basis the sales table is cleared. When you delete the last record containing a sale by a particular salesperson, you not only get rid of the sales record, you also lose the salesperson’s name.

As usual, the solution to the problem is to break the stray dependency out into a table of its own. Each car sale has only one salesperson, but each salesperson can be involved in many sales, so this is a 1-m many relationship. The salesperson information is stored in a table by itself, and the primary key of the salesperson table is embedded as a foreign key in the car sale table.

Here is the corrected design: Salesperson(spno, spname) Carsale(vin, date, spnof.k.)

4. Third Normal Form 3NF can be defined as follows: There can be no stray dependencies from one non-key field to another. In this example, for the sake of simplicity, it is assumed that new cars are being sold and they can only be sold once. Information about the customer is also recorded with the sale. Each car can only be bought by one customer. It would be possible for a customer to buy more than one car.

Here is the design that violates 3NF: Carsale(vin, custno, custname, salesprice, date) A diagram with arrows illustrating this is given on the next overhead

This design also has insert, update, and delete anomalies and the pattern of the anomalies is the same as in the previous example. They all stem from the presence of the stray dependency in the design. If you have a potential customer who has not yet bought a car, it is impossible to insert information about that person.

If a customer has bought more than one car, the customer information is stored redundantly. In that case, if the customer’s name changes, it’s necessary to change multiple records. Finally, if the sales table is cleared on a regular basis, when you delete the last sales record for a given customer, you not only get rid of the sales record, you also lose the customer’s name.

There is a situation that can arise in database designs that appears to be a violation of 3NF, but isn’t. The most common example of this situation is a table which includes a city, state, and zip code as part of an address. The postal service has divided up the country into zones which are identified by zip codes.

None of these zones cross city or state boundaries. That means that a zip code determines the city and state. It is not necessary for someone to break this dependency out of their database design. The rule of thumb is that if you are not responsible for maintaining the dependency, then you can ignore it.

The postal service has a table somewhere with zip code as the primary key and all of the descriptive fields about zip code that exist. The post office maintains this. A table not maintained by the postal service can contain addresses with zip codes and completely ignore the fact that there may in reality be a dependency.

5. Boyce-Codd Normal Form A formal statement of BCNF would be somewhat theoretical. Once understood, such a definition would make it clear that BCNF is a summation of 1NF through 3NF which covers one other case which is not covered by the previous normal forms. It is easier to explain BCNF by just presenting this special case and explaining it.

BCNF says that there can be no stray dependencies from a non-key field to a field in the key. For the purposes of this example suppose that the same car can be sold by the same salesperson more than once, but only one sale of that car is possible per date. Suppose also that this dealership has a system for assigning prospective customers to specific salespeople, so that each salesperson is associated with an exclusive list of clients.

It would be normal to assume that this system is implemented in some sort of table. Such a table is not shown here—it will become part of the solution to the problem. The point now is to show the problem this assumption leads to in the table of interest.

Here is the design which violates BCNF: Carsale(vin, spno, date, custno) A diagram with arrows illustrating this is given on the next overhead

The anomalies in this design are analogous to the anomalies in the previous designs. It is not possible to insert information about the relationship between a given customer and salesperson without a sales record which matches them. If the given customer has bought from the same salesperson many times, their relationship is in multiple records.

An update would require changes in multiple records. Finally, if you’re down to the last record containing information about a particular pair, deleting the record would cause the information to be lost. As usual, the solution to the problem is to break out the stray dependency in a separate table.

Here is the corrected design: Carsale(vin, date, custnof.k.) Customer-Salesperson(custno, spno)

The point is that if customers are uniquely associated with a single salesperson, if the car sale record tells you who bought the car, you can then look up the salesperson in the Customer-Salesperson table.

There is another aspect of BCNF that needs to be explained. Consider a design which includes both a university-generated student id number and a social security number. It would seem to violate BCNF as explained above:

Here is the design which seems to violate BCNF: Student(studentIDno, SSN, name) A diagram with arrows illustrating this is given on the next overhead

The additional part of BCNF is that if the stray dependency results from another field which also could have been chosen as a primary key for the table, then it is not a normal form violation. In other words, both studentIDno and SSN are valid, unique identifiers of students. You might want to record both. It is simply necessary to choose one of them as the primary key of the field. The presence of the other one in the table does no harm.

Up through BCNF the normal forms can be explained in terms of stray dependencies. An easy way to remember the requirements for these normal forms is the following statement: Every field in a table has to depend on the key, the whole key, and nothing but the key.

Because they are increasingly strict, the normal forms can be thought of as nested. When checking a design, you begin with the lowest normal form, make sure there are no violations, and move on to the following ones. This is what makes the design process step-by-step.

This idea can be represented using a Venn diagram. The idea is that the set of designs which is in some normal form is always a subset of those designs which meet the conditions for a lower normal form. A diagram of this is shown on the following overhead

1NF 2NF 3NF BCNF…

6. Higher Normal Forms It was claimed earlier that there are only three kinds of relationships: 1-1, 1-m, and m-n. This is not entirely true. There may be many-to-many-to-many relationships (relationships between 3 different types of entities at the same time, m-m-m), and in theory there is no reason why there can’t be relationships among 4 or more different types of entities at the same time. Fourth and fifth normal form, 4NF and 5NF, have to do with cases like these.

The presentation of 4NF will be done in the opposite order to the presentation of the earlier normal forms. First an example of a valid design will be given, and then a statement will be made about the nature of a design that violates 4NF.

Suppose that a given car can be sold more than one time. In other words, you’re dealing in used cars. Suppose also that salespeople can sell more than one different car, and customers can buy more than one different car. This means that there are three 1-m relationships.

For the three base tables, Car, Salesperson, and Customer, there could be one table in the middle, Carsale, which brought all three together. The idea can be represented using ER modeling. This results in the star shaped design shown on the next overhead:

Car Sales-person Carsale Customer

The relationships are captured by embedding primary keys as foreign keys, and a valid design can be given as follows: Car(vin, make, model, year) Salesperson(spno, spname) Customer(custno, custname) Carsale(vin, spno, custno, date)

If someone tried to create a design which had information on all three types of entities, cars, salespeople, and customers, in the same table, this would be a 4NF violation. No example of this is given. After working up through BCNF it should be clear that when analyzing such a table you would find more than one stray dependency. By removing each of the stray dependencies in succession, you would solve the problem.

4NF violations like this are not common. Anyone familiar with database design principles would not try to put three types of entities together in a single table in the first place. On the other hand, people who are unfamiliar with the rules sometimes think that they should try and cram as much information into a single table as possible. If that happens, then a violation such as this is possible.

You may have realized that there is another way to relate all three of the base tables together. What if each pair were related in an m-m relationship? The idea can be represented using ER modeling. This results in the design with a cycle in it shown on the next overhead:

Salesperson-Car Car-Customer Car Sales-person Customer-Salesperson Customer

The design could also be represented in this way: Car(vin, make, model, year) Car-Customer(vinf.k., custnof.k.) Customer(custno, custname) Customer-Salesperson(custnof.k., spnof.k.) Salesperson(spno, spname) Salesperson-Car(spnof.k., vinf.k., date)

This design does not violate 4NF like the previous scenario of cramming all of the information into a single table. The question is, does this design correctly capture all of the assumptions stated above? In general, designs with cycles in them are difficult to understand, and in the context of 4NF, the design with the cycle is not desirable, while the design with the star is desirable.

If you traced all of the links in the design with the cycle, you would find that every car is connected to every salesperson is connected to every customer. Put in business terms, if this design is supposed to represent car sales, at one time or another every salesperson has sold every car and every customer has bought every car. This does not agree with the assumptions underlying the star design, where the one table in the middle captures information for that subset of possible sales that actually occurred.

The cyclical design leads to a brief consideration of 5NF. The question now becomes, if every possible pair of relationships actually does exist, which design is better, the one with the star or the one with the cycle? In this case, the design with the cycle is better.

At this point the car sale example breaks down. It is unrealistic to think that every car would be sold by every salesperson and bought by every customer. However, there are occasionally situations where every entity in every base table is related to every other entity in every other base table.

A general description of what 5NF says is the following: A design is correct if two conditions are met: All real relationships between entities are captured by the design; no false relationships between entities are captured by the design.

7. Domains The highest normal form, domain-key normal form (DKNF), is a theoretical statement of how relationships are formed between tables. This form is not numbered, because the theoretical statement encompasses all of the other normal forms. This theoretical statement does not give you a step-by-step procedure for determining whether or not a design has violations and fixing them.

DKNF is based on the idea of domains, which have not been explained yet. Although this normal form is of little practical use, the idea of domains is important for correctly capturing the relationships between tables, and will be explained.

There is a preliminary point to be made before talking about domains. When a table is created, a complete definition has to tell the data type of each field. Some fields may hold numeric values, some may hold strings of characters, some may hold dates, etc. If a field holds strings of characters, its length, or maximum length also has to be stated. So, for example, a person’s last name may be defined as containing a maximum of 24 characters.

A social security number has 9 digits. Although the social security number is called a number, it is never used numerically. There is no need to add, subtract, multiply, or divide it, and a good design will prevent that. The simple way to do so is to define this field as a character field containing 9 characters where valid characters in this field are limited to digits.

A domain is a semantic concept. Most of the time, the name of a field is descriptive of the kind of information it can hold. So for a “last name” field in a table containing information about people, it is informally clear what this means. Formally, the term domain refers to the whole set of values that could appear as valid data in that field.

In general, a name would consist of a sequence of letters of the alphabet. Names could come from any language or culture, translated into the English alphabet. Some names do contain numeric information, usually indicated with Roman numerals, for example, John Smith I, John Smith II, etc. It would not be possible to come up with a formula that mathematically defined all possible values. Still, the general idea is clear.

The idea of domains can be further clarified by giving examples of cases where fields are not on the same domain. A person’s last name field may be defined as 24 characters. A city field could also be defined the same way. There may be cases where a person’s name is the same as the name of a city. There is a city of Lincoln in England.

Abraham Lincoln’s ancestors probably came from that area. There is also a city of Lincoln in Nebraska, which was named after Abraham Lincoln. Even though there may be an intersection of the values in the city and last name fields, conceptually, city name and person last name are two distinct domains.

Another example of two fields that are not on the same domain would be social security number and zip code. A full zip code consists of 5 plus 4, or 9 digits, like a social security number. Both might be defined as character fields containing 9 characters. However, social security numbers and zip codes have nothing in common. There are doubtless cases where someone’s social security number matches some zip code somewhere in the country, but this is purely coincidental.

Up to this point, the relationship between tables has been explained by the process of embedding the primary key of one table as a foreign key in another. This would mean that when defining the second table, it would have a field with a suitable name, on the same domain, that is defined to hold the same type of data as the first field. This is good as far as it goes, but there can be other relationships between tables which are the result of domains, but not the direct result of embedding keys.

Going back to one of the earlier examples, a database may distinguish between mothers and children as different kinds of entities, and store them in different tables. Each of these tables may have social security number fields and last name fields. You would not expect a mother and child to have the same social security number. This would be a mistake.

However, in most cases you would expect mothers and children to have the same last names. The idea is that a social security number is a social security number, regardless of what table it appears in. The idea of a social security number defines a domain.

Similarly, if the last name fields in both the mother and child tables were defined as containing 24 characters, a last name is a last name, regardless of what table it appears in. The idea of a last name defines a domain. The last name field in both tables has the same meaning even though they are not a primary key, foreign key pair.

As you can see, a domain is a cross-table concept. Any given database may contain many different fields in its tables, but the database will contain fewer domains because various fields are on the same domain. The idea of a domain is more fundamental than the idea of a field.

A field is just a manifestation of a domain. In very general terms, DKNF says that a database is correctly designed if the dependencies among the tables are the result of correct choices of domains for all fields, in particular the domains of the primary and foreign keys of the tables.

8. Nulls and Integrity This section is a review of material that was explained earlier. It will not be gone over in class. However, it is provided below in its entirety in case you want to read the overheads yourself.

8. Nulls and Integrity The term “null” refers to the idea that a particular field in a particular record may not have data in it. In general, this is permissible. Cases often arise in practice where the information doesn’t exist or isn’t known.

It would be impractical to insist that all fields always contain data. If that restriction were imposed, people would get around it by putting in bogus values for information that didn’t exist or wasn’t known. However, filling a database with bogus values is not a very good idea.

When a database management system supports null values in fields, it’s important to understand what this does not mean. It does not mean that the fields contain the sequence of characters “null”. It also does not mean that the field contains invisible blanks. Blank spaces themselves are a form of character. What it means is that there is absolutely nothing in the field, and the database management system is able to recognize fields that are in that state.

The term integrity in database management systems refers to the validity and consistency of data entered into a database. The phrase “entity integrity” is the formal expression of a requirement that was stated informally earlier. Entity integrity puts the following requirement on a correctly implemented database: Every table has a primary key field, and no part of the primary key field can be null for any record in the table. Clearly, if all or part of a key were allowed to be null, that would defeat the purpose that the primary key field be the unique identifier for every record in the table.

As seen in the long discussion of normal forms, it is the primary key to foreign key relationships that support the interconnection between related entities that have been separated into different tables by the design process. Once this has been done, it is critically important that the data maintaining the relationships be valid and consistent. The phrase “referential integrity” has the following meaning: Every value that appears in a foreign key field also has to appear as a value in the corresponding primary key field. This can also be stated negatively: There can be no foreign key value that does not have a corresponding primary key value.

The meaning and importance of referential integrity can be most easily explained with a small example showing a violation of it. Consider the tables shown on the following overhead:

Child b, Ann, is shown as having a mother with mid equal to 3. There is no such mother in the Mother table. This is literally nonsense. There is no sense in which this can be correct and this is what referential integrity forbids.

This example also illustrates two other things, which are related to “non-existent” values. Observe that mother 1, Lily does not have any matching records in the Child table. This does not violate referential integrity. It suggests that the Mother table is misnamed, and should be named the Woman table, but it is reasonable to think that you might be recording information about women and children and some women will not have children.

The other thing visible in the table is that child c, June, does not have a mother listed. In other words, the foreign key field is null. This also does not violate referential integrity. As with null in any situation, it may mean that the mother is not known.

Nobody literally doesn’t have a mother, but if the woman table only records information on living women, for example, then for an orphan, the mother “wouldn’t exist”. It is unlikely that you would rename the table “Children and Orphans”—but the idea is that the null value is allowed and this in some sense affects the meaning of what kinds of entities are entered into the table.

Referential integrity leads to one last consideration. The idea behind normalization was to get the stray dependency out of one table and break it into two. The problem with stray dependencies was redundancy and anomalies. By breaking a design into two tables with a primary to foreign key pair, you introduce interrelationship constraints.

Put simply, the question is this: What do you do with foreign key values if the corresponding primary key values in another table are deleted or updated? A fully-featured database management system will enforce referential integrity constraints. The default settings for these constraints are summarized in these two phrases: On delete, restrict; on update, cascade.

If these defaults are implemented, this is a fuller explanation of what they mean in terms of the concrete mother and child example: On delete, restrict: No mother record can be deleted if she has corresponding child records in the other table. To allow the deletion would lead to a referential integrity violation.

On update, cascade: If the primary key value of a mother record is updated, if she has corresponding child records in the other table, the foreign key values in those records is automatically updated to reflect the change. This problem arises less frequently because once a primary key value is assigned to an entity, it is rarely changed.

The End

SQL Unit 15 Normalization

SQL Unit 15 Normalization

Presentation Transcript

SQL Unit 15Normalization

SQL Unit 15 Normalization

SQL Unit 17 Normalization

Unit 15

Database Normalization Sept 15 th 2009

Unit 15

Unit 15

UNIT 15

Unit 15

Unit 15

Unit 15

SQL Normalization Database Design

Unit 3 SQL

Unit-15

UNIT 15

Unit 15

Unit 15

Unit 15

Unit 15

Unit 15

Unit 15