400 likes | 569 Vues
INFYS540. Normalizing Your Database and Why you WANT to do it!. Lesson 7 Chapter 5 Appendix. Data Redundancy Problems. Redundancy breeds errors Same data defined in multiple places is BAD Spelling/typographical error prone Lack of data integrity Inability to perform simple queries
E N D
INFYS540 Normalizing Your Database and Why you WANT to do it! Lesson 7 Chapter 5 Appendix
Data Redundancy Problems • Redundancy breeds errors • Same data defined in multiple places is BAD • Spelling/typographical error prone • Lack of data integrity • Inability to perform simple queries • Inflexibility and inscalability • Impossible to MAINTAIN!
Relational Database PROJECTCHIEF Project Project Chief Computing 333-22-1111 Intranet987-65-4321 Contracting 123-45-6789 CAT 333-22-1111 EMPLOYEES LNameFName SSNDept Jones Mike 123-45-6789 M&B Smith Tony 987-65-4321 C2G Lee Bruce 567-89-1234 MLD Doodle Yankee 333-22-1111 M&B 1 DEPARTMENTS Dept Dept. DirectorRoom MLD 181-94-5676 B115 C2G 987-65-4321 123 M&B 123-45-6789 147 1 What is a candidate key? What is a primary key? What is a foreign key?
Purpose of Normalization • Take advantage of the powerful tools available in a DBMS • There are five levels of Normalization • The higher the Normal Form the “better” and more efficient the database • But, increasing the levels of Normal Form takes time and effort • For most applications, 3rd Normal Form will solve most potential problems with a DB
Normalizing Database • Process of creating well-structured tables. • Improve performance, integrity of data • 5-step process (w/ 2 rules) to achieve Third Normal Form (3NF) • First two steps put DB into a form so you can normalize it
Rule #1 in Databases Never design redundant data into a Database duplicate data is not consistent duplicate data wastes space
Step 1. Primary Keys • A primary key is one or more data fields (columns) that uniquely identify each record in the table • What would the primary key be below? • “table of employees, assigned to a department.” EMPLOYEES LNameFNameSSNDept Jones Mike 123-45-6789 Math Smith Tony 987-65-4321 M&B Lee Bruce 567-89-1234 Science
Step 1. Primary Keys • Answer: The SSN • It is the only “guaranteed” unique column in the table. Names are easily repeated. EMPLOYEES LNameFName SSNDept Jones Mike 123-45-6789 Math Smith Tony 987-65-4321 M&B Lee Bruce 567-89-1234 Science
Step 1. Primary Keys • Now try the following example: • “A table of projects assigned to employees, listing the project name and the employee’s function on the project.” A Counter --The MS Access Default Key EmpProj Counter SSNProject Function 1 123-45-6789 Dining Designer 2 123-45-6789 Computing Designer 3 987-65-4321 Contracting Designer 4 444-55-6666 Intranet Webmaster 5 222-99-7777 Dining Overwatch
Step 1. Primary Keys • It is the combination of the SSN and the Project fields. Why? EMPLOYEES’ PROJECTS Counter SSNProject Function 1 123-45-6789 Dining Designer 2 123-45-6789 Computing Designer 3 987-65-4321 Contracting Designer 4 444-55-6666 IntranetWebmaster 5222-99-7777 Dining Overwatch
Step 1. Primary Keys • Because, you can have the following: EMPLOYEES’ PROJECTS Counter SSNProject Function 1 123-45-6789 Dining Designer 2 123-45-6789 Dining Designer 3 987-65-4321 Intranet Designer 4 444-55-6666 Intranet Webmaster 5 222-99-7777 Dining Overwatch • Redundant records! (Redundancy = BAD)
Rule #2 about Databases NEVER Use a Counter as a Primary Key
Step 2: Eliminate Many-to-Many Relationships • What is wrong with the following table? • “a table of personnel authorized access to a project” PROJECTS QUERY ACCESS Project Access_1 Access_2 Access_3 Dining 222-99-7777 181-94-5676 Computing 222-99-7777 181-94-5676 Intranet 987-65-4321 818-49-6765 123-45-6789
Step 2: Eliminate Many-to-Many Relationships • Here’s essentially what this table looks like within the Access relationships diagram: Employees: SSN Last Name First Name .... Projects: Project Project Chief Department Access_1 Access_2 Access_3 has access to info about
Step 2: Eliminate Many-to-Many Relationships • Here’s how you model it in a database: • Break it up into two one-to-many relationships Projects: Project Project Chief Department .... Employees: SSN Last Name First Name .... 1 1 Access to Project Info: Project SSN
Step 2: Eliminate Many-to-Many Relationships • How to do it: • The primary key of the new table is the composite of the primary keys of the existing tables. • Primary key of Projects = Project Name • Primary key of Employees = SSN • New table primary key of Project Name and SSN
Step 2: Eliminate Many-to-Many Relationships • No artificial restrictions on number of people with access • You can add attributes about the types of access granted • You can easily query who has access to information about each project PROJ QUERY ACCESS Project SSN Dining 222-99-7777 Dining 181-94-5676 Computing 222-99-7777 Computing 181-94-5676 Intranet 987-65-4321 Intranet 818-49-6765 Intranet 123-45-6789 PROJECT ProjectProjectChief Dept Computing 333-22-1111 MATH Intranet987-65-4321 M&B Contracting 123-45-6789 M&B CAT 333-22-1111 Admin EMPLOYEE LNameFName SSN Jones Mike 123-45-6789 Smith Tony 987-65-4321 Lee Bruce 567-89-1234 Doodle Yankee 333-22-1111
What is wrong with the following? “A table of PCs, which are loaded with many different applications, and assigned to a user.” PCSerial# LoadedSoftware Assigned 10291 Word, Powerpoint, ccMail Jones 10301 Word, Powerpoint, Lotus Notes Smith 10311 Word, LotusNotes, Borland C++ Hacker
Step 3: Achieving 1NF:All Data must be Atomic • “Atomic” - the data occupying a field cannot be further broken down. • i.e., no multi-data entries • i.e., “No attributes can have more than one value for a single instance of an entity” PCSerial# LoadedSoftware Assigned 10291 Word, Powerpoint, ccMail Jones • If not atomic, updating is complex and error prone • If not atomic, can not easily query the database
Step 3 Answer PCSerial# LoadedSoftware Assigned 10291 Word Jones 10291 Powerpoint Jones 10291 ccMail Jones 10301 Word Smith 10301 Powerpoint Smith 10301 LotusNotes Smith 10311 Word Hacker 10311 LotusNotes Hacker 10311 Borland C++ Hacker
Step 3. Achieving 1NF:All Data must be Atomic Another source of redundancy: calculated fields TotalYTD Age DaysRemaining Solution: Use a Query! Remove all calculated fields from table and create a query ...then use the query whenever you need up-to-date data
Step 4. Achieving 2NF:Eliminate Partial Dependencies • What is a partial dependency? • Look at the table. What’s redundant? • “A table of functions an employee is assigned to for a project, and the project chief.” EMPLOYEES’ PROJECTS SSNProject Function Project Chief 123-45-6789 Dining Designer 222-99-7777 123-45-6789 Computing Designer 333-88-5656 123-45-6789 Intranet Member 987-65-4321 987-65-4321 Intranet Designer 987-65-4321 444-55-6666 Intranet Webmaster 987-65-4321 222-99-7777 Dining Overwatch 222-99-7777
EMPLOYEES’ PROJECTS SSNProject Function ProjectChief 123-45-6789 Dining Designer 222-99-7777 123-45-6789 Computing Designer 333-88-5656 123-45-6789 Intranet Member 987-65-4321 987-65-4321 Intranet Designer 987-65-4321 444-55-6666 Intranet Webmaster 987-65-4321 222-99-7777 Dining Overwatch 222-99-7777 Step 4. Achieving 2NF:Eliminate Partial Dependencies • Function depends on the entire primary key: SSN and Project. • ProjectChief is dependent on just a portion of the primary key
Step 4. Achieving 2NF:Eliminate Partial Dependencies • Why is this bad? • Well, what’s wrong with the following? EMPLOYEES’ PROJECTS SSNProject Function Project Chief 123-45-6789 Dining Designer 222-99-7777 123-45-6789 Computing Designer 333-88-5656 123-45-6789 Intranet Member 987-65-4321 987-65-4321 Intranet Designer 987-65-4321 444-55-6666 Intranet Webmaster 222-99-7777 222-99-7777 Dining Overwatch 222-99-7777
Step 4. Achieving 2NF:Eliminate Partial Dependencies • A partial dependency (PD) occurs when a non-key field depends on only a part of the primary key, and not the whole primary key. • PDs are a relation. So, we need a new table..... EMPLOYEES’ PROJECTS SSNProject Function Project Chief 123-45-6789 Dining Designer 222-99-7777 123-45-6789 Computing Designer 333-88-5656 123-45-6789 Intranet Member 987-65-4321 987-65-4321 Intranet Designer 987-65-4321 444-55-6666 Intranet Webmaster 987-65-4321 222-99-7777 Dining Overwatch 222-99-7777
Step 4. Achieving 2NF:Eliminate Partial Dependencies • Here’s how it should look...... EMPLOYEES’ PROJECTS SSNProject Function 123-45-6789 Dining Designer 123-45-6789 Computing Designer 123-45-6789 Intranet Member 987-65-4321 Intranet Designer 444-55-6666 Intranet Webmaster 222-99-7777 Dining Overwatch PROJECTS Project Project Chief Dining 222-99-7777 Computing 333-88-5656 Intranet 987-65-4321
Step 5: Achieving 3NF:Eliminate Transitive Dependencies • What is wrong with the following table? PROJECTS Project Project Chief Dept. Dept. Director Room Dining 222-99-7777 Admin 181-94-5676 B115 Computing 333-88-5656 Admin 181-94-5676 B115 Intranet987-65-4321 M&B 818-49-6765 123 Contracting 187-87-8787 M&B 818-49-6765 123 CAT 333-22-1111 Grounds 123-45-6789 147
Step 5: Achieving 3NF:Eliminate Transitive Dependencies • We have fields dependent on a non-key field: • The Director and Room fields clearly relate to the Dept., and have nothing to do with the project. (Dept is a “determinant” that is not a candidate key) PROJECTS Project Project Chief Dept. Dept. Director Room Dining 222-99-7777 Admin 181-94-5676 B115 Computing 333-88-5656 Admin 181-94-5676 B115 Intranet987-65-4321 M&B 818-49-6765 123 Contracting 187-87-8787 M&B 818-49-6765 123 CAT 333-22-1111 GRND 123-45-6789 147
Step 5: Achieving 3NF:Eliminate Transitive Dependencies • A transitive dependency occurs when a non-key field depends on another non-key field. • Why is this bad?. • A typo appeared in the Contracting line. A database without the transitive dependency would not have allowed this to happen. PROJECTS Project Project Chief Dept. Dept. Director Room Dining 222-99-7777 Admin 181-94-5676 B115 Computing 333-88-5656 Admin 181-94-5676 B115 Intranet987-65-4321 M&B 818-49-6765 123 Contracting 187-87-8787 M&B 818-49-6765 124 CAT 333-22-1111 GRND 123-45-6789 147
Step 5: Achieving 3NF:Eliminate Transitive Dependencies • How to do it: a. Which fields are dependent on a non-key field in the table? (Director, Room) b. Which fields are these dependent on? (Dept) c. Create a new table with (b) as the primary key. d. Put (a) in the new table. e. Remove (a) from the old table.
Step 5: Achieving 3NF:Eliminate Transitive Dependencies • Here are the new tables. PROJECTS Project Project ChiefDept. Dining 222-99-7777 Admin Computing 333-88-5656 Admin Intranet987-65-4321 M&B Contracting 187-87-8787 M&B CAT 333-22-1111 GRND DEPARTMENTS Dept. NameDept. DirectorRoom Admin 181-94-5676 B115 M&B 818-49-6765 123 GRND 123-45-6789 147
Data Analysis: Normalization • An entity is in first normal form (1NF) if there are no attributes that can have more than one value for a single instance of the entity. • An entity is in second normal form (2NF) if it is already in 1NF, and if the values of all non-primary key attributes are dependent on the full primary key – not just part of it. • An entity is in third normal form (3NF) if it is already in 2NF, and if the values of its non-primary key attributes are not dependent on any other non-primary key attributes.
Conclusion • Rule1: Never design redundant data into a database • Rule2: Never use a counter as Primary Key • Identify proper primary keys (1NF) • Break up many-to-many relationships (1NF) • 1NF: Break all data into atomic components • 2NF: Identify/eliminate partial dependencies • 3NF: Eliminate transitive dependencies • Common sense test
What is a Good Data Model? • A good data model is simple. • As a general rule, the data attributes that describe an entity should describe only that entity. • A good data model is essentially non-redundant. • This means that each data attribute, other than foreign keys, describes at most one entity. • A good data model should be flexible and adaptable to future needs. • We should make the data models as application-independent as possible to encourage database structures that can be extended or modified without impact to current programs.
Data and Referential Integrity • There are at least three types of data integrity that must be designed into any database - key integrity, domain integrity and referential integrity. • Key Integrity: • Every table should have a primary key (which may be concatenated). • The primary key must be controlled such that no two records in the table have the same primary key value. • The primary key for a record must never be allowed to have a NULL value.
Data and Referential Integrity • Domain Integrity: • Appropriate controls must be designed to ensure that no field takes on a value that is outside of the range of legal values. • Referential Integrity: • A referential integrity error exists when a foreign key value in one table has no matching primary key value in the related table.
Referential Integrity: • Referential integrity is specified in the form of deletion rules as follows: • No restriction. • Any record in the table may be deleted without regard to any records in any other tables. • Delete:Cascade. • A deletion of a record in the table must be automatically followed by the deletion of matching records in a related table. • Delete:Restrict. • A deletion of a record in the table must be disallowed until any matching records are deleted from a related table. • Delete:Set Null. • A deletion of a record in the table must be automatically followed by setting any matching keys in a related table to the value NULL.
Database Capacity Planning • A database is stored on disk. • The database administrator will want an estimate of disk capacity for the new database to ensure that sufficient disk space is available. • Database capacity planning can be calculated with simple arithmetic as follows. • For each table, sum the field sizes. • This is the record size for the table. • For each table, multiply the record size times the number of entity instances to be included in the table. • This is the table size.
Database Capacity Planning • Database capacity planning can be calculated with simple arithmetic as follows. (continued) • Sum the table sizes. • This is the database size. • Optionally, add a slack capacity buffer (e.g., 10%) to account for unanticipated factors or inaccurate estimates above. • This is the anticipated database capacity.