Fundamentals of Relational Database Design andDatabase Planning J.Trumbo Fermilab CSS-DSG
Outline • Definitions • Selecting a dbms • Selecting an application layer • Relational Design • Planning • A very few words about Replication • Space
DefinitionsWhat is a database? A database is the implementation of freeware or commercial software that provides a means to organize and retrieve data. The database is the set of physical files in which all the objects and database metadata are stored. These files can usually be seen at the operating system level. This talk will focus on the organize aspect of data storage and retrieval. Commercial vendors include MicroSoft and Oracle. Freeware products include mysql and postgres. For this discussion, all points/issues apply to both commercial and freeware products.
DefinitionsInstance A database instance, or an ‘instance’ is made up of the background processes needed by the database software. These processes usually include a process monitor, session monitor, lock monitor, etc. They will vary from database vendor to database vendor.
DefinitionsWhat is a schema? A SCHEMA IS NOT A DATABASE, AND A DATABASE IS NOT A SCHEMA. A database instance controls 0 or more databases. A database contains 0 or more database application schemas. A database application schema is the set of database objects that apply to a specific application. These objects are relational in nature, and are related to each other, within a database to serve a specific functionality. For example payroll, purchasing, calibration, trigger, etc. A database application schema not a database. Usually several schemas coexist in a database. A database application is the code base to manipulate and retrieve the data stored in the database application schema.
Definitions Cont.Primary Definitions • Table, a set of columns that contain data. In the old days, a table was called a file. • Row, a set of columns from a table reflecting a record. • Index, an object that allows for fast retrieval of table rows. Every primary key and foreign key should have an index for retrieval speed. • Primary key, often designated pk, is 1 or more columns in a table that makes a record unique.
Definitions Cont.Primary Definitions • Foreign key, often designated fk, is a common column common between 2 tables that define the relationship between those 2 tables. • Foreign keys are either mandatory or optional. Mandatory forces a child to have a parent by creating a not null column at the child. Optional allows a child to exist without a parent, allowing a nullable column at the child table (not a common circumstance).
Definitions Cont.Primary Definitions Entity Relationship Diagram or ER is a pictorial representation of the application schema.
Definitions Cont.Primary Definitions Constraints are rules residing in the database’s data dictionary governing relationships and dictating the ways records are manipulated, what is a legal move vs. what is an illegal move. These are of the utmost importance for a secure and consistent set of data.
Definitions Cont.Primary Definitions Data Manipulation Language or DML, sql statements that insert, update or delete database in a database. Data Definition Language or DDL, sql used to create and modify database objects used in an application schema.
Definitions Cont.Primary Definitions A transaction is a logical unit of work that contains one or more SQL statements. A transaction is an atomic unit. The effects of all the SQL statements in a transaction can be either all committed (applied to the database) or all rolled back (undone from the database), insuring data consistency.
Definitions Cont.Primary Definitions • A view is a selective presentation of the structure of, and data in, one or more tables (or other views). A view is a ‘virtual table’, having predefined columns and joins to one or more tables, reflecting a specific facet of information.
Definitions Cont.Primary Definitions Database triggers are PL/SQL, Java, or C procedures that run implicitly whenever a table or view is modified or when some user actions or database system actions occur. Database triggers can be used in a variety of ways for managing your database. For example, they can automate data generation, audit data modifications, enforce complex integrity constraints, and customize complex security authorizations. Trigger methodology differs between databases.
Definitions Cont.Primary Definitions Replication is the process of copying and maintaining database objects, such as tables, in multiple databases that make up a distributed database system. Backups are copies of the database data in a format specific to the database. Backups are used to recover one or more files that have been physically damaged as the result of a disk failure. Media recovery requires the restoration of the damaged files from the most recent operating system backup of a database. It is of the utmost importance to perform regularly scheduled backups.
Definitions Cont. Mission Critical Applications An application is defined as mission critical, imho, if 1. there are legal implications or financial loss to the institution if the data is lost or unavailable. 2. there are safety issues if the data is lost or unavailable. 3. no data loss can be tolerated. 4. uptime must be maximized (98%+).
Definitions Cont. ‘large’ or ‘very large’ or ‘a lot’ Seems odd, but ‘large’ is a hard definition to determine. Vldb is an acronym for very large databases. Its definition varies depending on the database software one selects. Very large normally indicates data that is reaching the limits of capacity for the database software, or data that needs extraordinary measures need to be taken for operations such as backup, recovery, storage, etc.
Definitions Cont. Commercial databases do not a have a practical limit to the size of the load. Issues will be backup strategies for large databases. Freeware does limit the size of the databases, and the number of users. Documentation on these issues vary widely from the freeware sites to the user sites. Mysql supposedly can support 8T and 100 users. However, you will find arguments on the users lists that these numbers cannot be met.
Selecting a DBMS Many options, many decisions, planning, costs, criticality. For lots of good information, please refer to the urls on the last slides. Many examples of people choosing product.
Selecting a DBMSHow do I Choose? Which database product is appropriate for my application? You must make a requirements assessment. Does you database need 24x7 availability? Is your database mission critical, and no data loss can be tolerated? Is your database large? (backup recovery methods) What data types do I need? (binary, large objects?) Do I need replication? What level of replication is required? Read only? Read/Write? Read/Write is very expensive, so can I justify it?
Selecting a DBMSHow do I Choose? Cont. If your answer to any of the above is ‘yes’, I would strongly suggest purchasing and using a commercial database with support. Support includes: • 24x7 assistance with technical issues • Patches for bugs and security • The ability to report bugs, and get them resolved in a timely manner. • Priority for production issues • Upgrades/new releases • Assistance with and use of proven backup/recovery methods
Selecting a DBMSThe Freeware Choice Freeware is an alternative for applications. However, be fore warned, support for these databases is done via email to a ad hoc support group. The level of support via these groups may vary over the life of your database. Be prepared. Also expect less functionality than any commercial product. See http://www-css.fnal.gov/dsg/external/freeware/
Selecting a DBMSThe Freeware Choice Freeware is free. Freeware is open source. Freeware functionality is improving. Freeware is good for smaller non-mission critical applications.
Selecting an Application Layer Again, planning takes center stage. In the end you want stability and dependability. • How many users need access? • What will the security requirements be? • Are there software licensing issues that need consideration? • Is platform portability a requirement? • Two tier or three tier architecture?
Selecting an Application Layer • Direct access to the database layer? (probably should be avoided) • Are you replicating? How? Where? With what? • There are no utilities that will port data from 1 database to another (i.e., postgres to mysql). if database portability is a requirement, an independent code must be written to satisfy this requirement.
Selecting an Application Layer Cont. Application maintenance issues • People availability, working with users as a team, talent, and turnover? (historically a huge issue) • A ‘known’ or ‘common’ language? • Freeware? Bug fixes, patches…are they important and timely? • Documentation? Set standards, procedures, code reviews making sure the documentation exists and is clear. • Is the application flexible enough to easily accommodate business rule changes that mandate modifications? • The availability of an ER diagram at this stage is invaluable. We consider it a must have. • There are no utilities to port data from 1 type of db to another. This lack of portability means a method to move data between databases • must be written independently.
Selecting an Application Layer Misc. application definitions… This presentation is not an application presentation, but I will mention a few terms you may hear. Sql the query language for relational databases. A must learn. ODBC, open database connectivity. The software that allows a database to talk to an application. JDBC, java database connectivity.
Relational Design The design of the application schema will determine the usability and query ability of the application. Done incorrectly, the application and users will suffer until someone else is forced to rewrite it.
Relational DesignThe Setup The database group has a standard 3 tier infrastructure for developing and deploying production databases and applications. This infrastructure provides 3 database instances, development, integration and production. This infrastructure is applicable to any application schema, mission critical or not. It is designed to insure development, testing, feedback, signoff, and an protected production environment. Each of these instances contain 1 or more applications.
Relational DesignThe Setup The 3 instances are used as follows: • Development instance. Developers playground. Small in size compared to production. Much of the data is ‘invented’ and input by the developers. Usually there is not enough disk space to ever ‘refresh’ with production data.
Relational Design Cont.The Setup 2. The integration instance is used for moving what is thought to be ‘complete’ functionality to a pre production implementation. Power users and developers work in concert in integration to make sure the specs were followed. The users should use integration as their sign off area. Cuts from dev to int are frequent and common to maintain the newest releases in int for user testing.
Relational Design Cont.The Setup 3. The production instance, real data. Needs to be kept pure. NO testing allowed. Very few logons. The optimal setup of a production database server machine has ~3 operating system logons, root, the database logon (ie oracle), and a monitoring tool. In a critical 24x7 supported database, developers, development tools, web servers, log files, all should be kept off the production database server.
Relational Design Cont.The Setup Let’s talk about mission critical & 24x7 a bit. • To optimize a mission critical 24/7 database, the database server machine should be dedicated to running the database, nothing else. • All software products need maintenance and downtime. Resist putting software products on the db server machine so that their maintenance does not inhibit the running of the database. Further, if the product breaks, it could inhibit access to the database for a long period. Example, a logging application, monitoring users on the db goes wild, fills all available space and halts the database. If this logging app. were not on the dbserver machine, the db would be unaffected by the malfunction.
Relational Design Cont.The Setup 3. All database applications and database software require modifications. Most times these modification require down time because the schema or data modifications need to lock entire tables exclusively. If you are sharing your database instance with other many other applications, and 1 of those applications needs the database for an upgrade, all apps may have to take the down time. Avoid this by insuring your 24/7 database application is segregated from all other software that is not absolutely needed. In that way you insure any down times are specific to your cause.
Our 1st relational example A cpu can house 1 or more databases schema applications in d0ofprd1 (sam, runs, calib) Databases on d0ora2 (d0ofprd1, d0ofint1) CPU (d0ora2) schema applications in d0ofint1 (sam, runs, calib) An database can accommodate 1 or more instances An instance may contain 1 or more application schemas
What is a schema? One implements a schema by running scripts. These scripts can be run against multiple servers and should be archived.
Relational DesignGetting Started Using your design tool, you will begin by relating objects that will eventually become tables. All the other schema objects will fall out of this design. You will spend LOADS of time in your design tool, honing, redoing, reacting to modifications, etc. The end users and the designers need to be working almost at the same desk for this process. If the end user is the designer, the end user should involve additional users to insure an unbiased and general design. It is highly suggested that the design be kept up to date for future documentation and maintainers. Tables are related, most frequently in a 0 to many relationship. Example, 1 run will result in 0 or more events. Analyzing and defining these relationships results in an application schema.
What will a good schema design buy you? I am afraid the 80% planning 20% implementation rule applies. Gather requirements. • Discovery of data that needs to be gathered. • Fast query results • Limited application code maintenance • Data flexibility • Less painful turnover of application to new maintainers. • Fewer long term maintenance issues.
Write a requirements document. You will not be able to anticipate all requirements, but a document will be a start. A well designed schema naturally allows for additional functionality. Who are the users? What is their mission? Identify objects that need to be stored/tracked. Think about how objects relate to each other. Do not be afraid to argue/debate the relationships with others. Relational DesignLet’s get started
Relational DesignSo how do you get there? Design tools are available, however, they do not think for you. They will give you a clue that you are doing something stupid, but it won’t stop you. It is highly recommended you use a design tool. A picture says 1000 words. Create ER, entity relationship, diagrams. Get a commitment from the developer(s) to see the application through to implementation. We have seen several applications redone multiple times. A string of developers tried, left the project, and left a mess. A new developer started from scratch because there was no documentation or design.
Relational DesignHow do I get there? Adhere to the recommendations of your database vendor for setup and architecture. Don’t be afraid to ask for help or to see other examples. Don’t be afraid to pilfer others design work, if it is good, if it closely fits your requirements, then use it. Ask questions, schedule reviews with experts and users. Work with your hardware system administrators to insure you have the hardware you need for the proposed job to be done.
Relational DesignCommon Mistakes Mistakes we see ALL the time • Do not design your schema around your favorite query. A relational design will enable all queries to be speedy, not only your favorite. • Don’t design the schema around your narrow view of the application. Get other users involved from the start, ask for input and review.
Relational DesignCommon Mistakes • Create a relational structure, not a hierarchical structure. The ER diagram should not necessarily resemble a tree or a circle. It is the logical building of relationships between data. Relationships flow between subsets of data. The resulting ER diagram’s ‘look’ is not a standard by which one can judge the quality of the design.
Relational DesignCommon Mistakes • Do not create 1 huge table to hold 99% of the data. We have seen a table with 1100+ columns…unusable, unqueryable, required an entire application rewrite, took over a year, made 80 tables from the 1 table. • Do not create separate schemas for the same application or functions within an application. • Use indices and constraints, this is a MUST!
Relational DesignExamples of Common Mistakes • Using timestamp as the primary key assumes that within a second, no other record will be inserted. Actually this was not the case, and an insert operation failed. Use database generated sequences as primary keys and NON-UNIQUE index on timestamp. • A table with more than 900 columns. Such design will cause chaining since each record is not going to fit in one block. One record spanning many blocks, thus chaining, hence bad performance.
Relational DesignExamples of Common Mistakes • Do not let the application control a generated sequence. Have seen locking issues, and duplicate values issues when the application increments the sequence. Have the database increment/lock/constrain the sequence/primary key. That is why the databases have sequence mechanisms, use them. • Use indices! An Atlas table with 200,000 rows, halted during a query. Reason? No indices. Added a primary key index, instantaneous query response. Indices are not wasted space!
Relational DesignExamples of Common Mistakes USE DATABASE CONSTRAINTS!!!!!! Have examples where constraints were not used, but ‘implemented’ via the api. Bugs in the api allowed data to be deleted that should not have been deleted, and constraints would have prevented the error. Have also seen apis error with ‘cannot delete’ errors. They were trying to force an invalid delete, luckily the database constraints saved the data.