SQL: Queries, Schemas, and Joins- OH MY!!!

SQL: Queries, Schemas, and Joins- OH MY!!! Geoff Noel

You may ask yourself – Well, How did I get here…. David Byrne – Once in a lifetime

Databases Overview Databases come in all different shapes and sizes. They can be flat files of ASCII data (like Access or Q&A) or complex binary tree structures (Oracle or Sybase). In any form, a database is a data store, or a place that holds data. If a database is simply a collection of data, then what keeps track of changes to this data? That is the job of the database management system, or DBMS. Some DBMSs are relational. Those are RDBMS. The relational part refers to the fact that separate collections of data within the reaches of the RDBMS can be looked at together in unison. The RDBMS is responsible for ensuring the integrity of the database. Sometimes, things will get out of whack and the RDBMS will keep all that data in line.

What is a database and database management system. A database is a collection of related data. A database management system (DBMS) is a collection of programs that enables users to create and maintain a database. Why use a database management system? To control redundancy. To restrict unauthorized access. To provide persistent storage for program objects and data structures. To permit inference and actions using rules. To provide multiple user interfaces. To representing complex relationships among data. To enforce integrity constraints. To provide backup and recovery. What Is a Database?

Database Client/Server Methodology The evolution of relational data storage began in 1970 with the work of Dr. E. F. Codd, who proposed a set of 12 rules for identifying relationships between pieces of data. Codd's rules formed the basis for the development of systems to manage data. Today, Relational Database Management Systems (RDBMS) are the result of Codd's vision. Data in an RDBMS are stored as rows of distinct information in tables. A structured language is used to query (retrieve), store and change the data. The Structured Query Language (SQL) is an ANSI standard, and all major commercial RDBMS vendors provide mechanisms for issuing SQL commands.

Single-Tier Database Design The early development of RDBMS applications utilized an integrated model of user interface code, Application code and database libraries. This single binary model ran only on a local machine, typically a mainframe. The applications were simple but inefficient and did not work over LANs. The model did not scale the application and user interface code was tightly coupled to the database libraries. The monolithic single-tier database design

Two-Tier Database Design Two-tier model appeared with the advent of server technology. Communication-protocol development and extensive use of local and wide area networks allowed the database developer to create an application front end that accessed data through a connection (socket) to the back-end server. A two-tier database design, where the client software is connected to the database through a socket connection. The two-tier database design Client programs (applying a user interface) send SQL requests to the database server. The server returns the appropriate results, and the client is responsible for the formatting and display of the data. Clients still use a vendor-provided library of functions that manage the communication between client and server. Most of these libraries are written in either the C language or Perl. Commercial database vendors realized the potential for adding intelligence to the database server. They created proprietary techniques that allowed the database designer to develop macro programs for simple data manipulation. These macros, called stored procedures, can cause problems relating to version control and maintenance. Because a stored procedure is an executable program living on the database, it is possible for the stored procedure to attempt to access named columns of a database table after the table has been changed. For example, if a column with the name id is changed to cust_id, the meaning of the original stored procedure is lost. The advent of triggers, which are stored procedures executed automatically when some action (such as insert) happens with a particular table or tables, can compound these difficulties when the data returned from a query are not expected. Again, this can be the result of the trigger reading a table column that has been altered.

Limitations of Two-Tier Database Design • The vendor-provided library limits them. Switching from one database vendor to another requires a rewrite of a significant amount of code to the client application. • Version control is an issue. When the vendor updates the client-side libraries, the applications that utilize the database must be recompiled and redistributed. • Vendor libraries deal with low-level data manipulation. Typically, the base library only deals with queries and updates of single rows or columns of data. This can be enhanced on the server side by creating a stored procedure, but the complexity of the system then increases. • All of the intelligence associated with using and manipulating the data is implemented in the client application, creating large client-side runtimes. This drives up the cost of each client set.

Three-Tier Database Design In a multi-tier design, the client communicates with an intermediate server that provides a Layer of abstraction from the RDBMS. The intermediate layer is designed to handle multiple Client requests and manage the connection to one or more database servers. There does not have to be just three tiers, but conceptually this is the next step. Three-Tier Database Design

Advantages of Three-Tier Design • It is multithreaded to manage multiple client connections simultaneously. • It can accept connections from clients over a variety of vendor-neutral protocols (from HTTP to TCP/IP), then hand off the requests to the appropriate vendor-specific database servers, returning the replies to the appropriate clients. • It can be programmed with a set of "business rules" that manage the manipulation of the data. Business rules could include anything from restricting access to certain portions of data to making sure that data is properly formatted before being inserted or updated. • It prevents the client from becoming too heavy by centralizing process-intensive tasks and abstracting data representation to a higher level. • It isolates the client application from the database system and frees a company to switch database systems without having to rework the business rules. • It can asynchronously provide the client with the status of a current data table or row.

Different RDBMS’ - History MS SQL Server Sybase/ SQL Server 6.0 6.5 7.0 2000 2005 Oracle 7.x 8.0.x 8.1.x ( aka 8i) 9.2 10G Other Databases • DB2/UDB • NCR Teradata • Ingres • Informix • Sybase • MySQL • Gupta/Centura -SQLbase • DBase • Paradox . . . Many others

Various Methods of Connection ODBCJDBCNativeOLE/ADOBCPSQL-NETSQL-LOADERHPL

Tools Enterprise ManagerDB ArtisanMS QueryAccessCrystalData DictionaryERD (Entity Relationship Diagram) I-SQL SQL WorksheetEnterprise ManagerToadDB ArtisanQuery AnalyzerWin SQLO-SQLSQL-Plus

Schema – By Object Type

Schema – By Owner

SQLbase – SQL Talk utility

SQLbase – SQL COnsole

What is a Schema ? Pronounced -> skee-ma. The structure of a database system, described in a formal language supported by the database management system (DBMS). In a relational Database (RDBMS), the schema defines the tables, the fields in each table, and the Relationships between fields and tables. Schemas are generally stored in a data dictionary. Although a schema is defined in text database language, the term is often used to refer to a graphical depiction of the database structure.

DDL- data definition Tables Columns Data-types Indexes Primary key Foreign key Views Table-spaces Partitions Constraints Synonyms DML- data manipulation Packages Triggers Procedures Functions Constraints Sequences Database Objects

Database actions • S – Select …….. > Queries • U - Update • I - Insert • D - Delete

Table Basics • A relational database system contains one or more objects called tables. The data or information for the database are stored in these tables. Tables are uniquely identified by their names and are comprised of columns and rows. Columns contain the column name, data type, and any other attributes for the column. Rows contain the records or data for the columns. Here is a sample table called "weather". • city, state, high, and low are the columns. The rows contain the data for this table:

Primary Key (PK) A primary key is a table column that can be used to uniquely identify every row of the table. Any column that has this property will do -- these columns are called candidate keys. A table can have many candidate keys but only one primary key. The primary key cannot be null.

Composite Primary Key • A is a primary key consisting of more than one column. In the above example, the combinations (RecordNo,FirstName), (RecordNo,Lastname), (RecordNo,FirstName,Lastname), and (FirstName,LastName) are all candidate keys. Any combination including Age is not a candidate key because it contains a null. • Often, database designers add an extra column to their table designs, a column defined as an integer, which will hold a number. In Microsoft Access, this is an autonumber, in MySQL it's an auto-increment, in Oracle it's a sequence, and in SQL/Server it's an identity column. As these names suggest, this integer is automatically assigned by the database, usually incrementally, sometimes using an initial value and increment that you can specify. Some databases allow these numbers to be generated randomly. • The purpose of this type of automatically generated number is to act as the surrogate primary key, usually in those situations similar to the above where candidate keys are multi-column. The awkwardness of a multi-column candidate key becomes apparent as soon as you define a foreign key on it.

Foreign Key (FK) A foreign key is a column, or combination of columns, that contain values that are found in the primary key of some table (including, possibly, itself). A foreign key may be null, and almost always is not unique.That last statement may be counterintuitive, so let's take another example. Here we have two tables that are related via a foreign key – As you can see, the ClrFK column in the second table is a foreign key to the ClrPK primary key in the first table. Notice that the ClrPK values are unique and not null, but the ClrFK values may be null and often repeat. A null foreign key means that that particular row does not participate in the relationship. The fact that many foreign key values repeat simply reflects the fact that it's a one-to-many relationship. In a one-to-many relationship, the primary key has the "one" value, and the foreign key has the "many" values. The trick to remembering this is to keep in mind that the primary key must be unique.

It Ain’t nothing – it’s NULL Null means either "don't know" or "not applicable" -- it's not really the same as zero or any other default value, but more importantly, null is treated quite differently from other values in SQL, because it literally has no value. Here's an example of a "don't know" null -> As you can see, Fred's value is null, which you could interpret as meaning that Fred didn't take the test (maybe he has a medical exemption and will take the test another day). It would be wrong to assign a zero, because that would be interpreted as Fred having taken the test and not getting a single answer right!

Now consider the following query – SELECT AVG(TestResult) FROM Students; Aggregate functions like AVG() and SUM() ignore nulls, so this query will return (87+73+56+92)/4=77, which is certainly better than 87+73+56+0+92)/5=61.6 which you'd get using a zero default. Often a default value is just wrong for a column where you expect to take aggregates. An example of a column that would take a "not applicable" null is Date Terminated in a human resources database, where the value would be null for all active employees. To test for nulls, you can filter them out in the WHERE clause – SELECT EmployeeID , (DateTerminated - DateHired) AS LengthOfService FROM EmployeeTable WHERE DateTerminated IS NOT NULL which would give results only for terminated employees. If you didn't have the WHERE clause, the above query would return null for every active employee, because any expression involving a null yields a null result. Alternatively, you can use the COALESCE function to supply a non-null value – SELECT EmployeeID , ( COALESCE(DateTerminated,GETDATE()) - DateHired) AS LengthOfService FROM EmployeeTable where GETDATE() returns today's date and therefore provides an accurate measure for the length of service of active employees. So for terminated employees, DateTerminated is not null, and the calculation is the same as above, while for active employees, DateTerminated is null so COALESCE uses today's date instead.

SQL or SEQUEL ? SQL (pronounced "ess-que-el") The acronym SQL stands for Structured Query Language SQL is used to communicate with a database. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems. SQL statements are used to perform tasks such as update data on a database, or retrieve data from a database. Some common relational database management systems that use SQL are: Oracle, Sybase, Microsoft SQL Server, Access, Ingres, etc. Although most database systems use SQL, most of them also have their own additional proprietary extensions that are usually only used on their system. However, the standard SQL commands such as "Select", "Insert", "Update", "Delete", "Create", and "Drop" can be used to accomplish almost everything that one needs to do with a database. Some people claim that this is a bad name, because SQL isn't (properly) structured it's more than just queries (e.g. insert, update, delete) it isn't a real computing language (Very Debatable) In any case, SQL is a database query language that was adopted as an industry standard in 1986. It has undergone two important revisions, SQL2 (also called SQL-92), and SQL3 (also called SQL-99). What SQL?

Selecting Data • The select statement is used to query the database and retrieve selected data that match the criteria that you specify. Here is the format of a simple select statement: select "column1" [,"column2",etc] from "tablename" [where "condition"]; [ ] = optional • The column names that follow the select keyword determine which columns will be returned in the results. You can select as many column names that you'd like, or you can use a "*" to select all columns. • The table name that follows the keyword from specifies the table that will be queried to retrieve the desired results. • The where clause (optional) specifies which data values or rows will be returned or displayed, based on the criteria described after the keyword where. • Conditional selections used in the where clause:

select first, last, city from empinfo; select last, city, age from empinfo where age > 30; • select first, last, city, state from empinfo where first LIKE 'J%'; • select * from empinfo; • select first, last, from empinfo where last LIKE '%s'; • select first, last, age from empinfo where last LIKE '%illia%'; • select * from empinfo where first = 'Eric';

select first, last, city from empinfo; select last, city, age from empinfo where age > 30;

select first, last, city, state from empinfo where first LIKE 'J%';

select * from empinfo;

select first, last, from empinfo where last LIKE '%s';

select first, last, age from empinfo where last LIKE '%illia%';

select * from empinfo where first = 'Eric';

The LIKE pattern matching operator can also be used in the conditional selection of the where clause. Like is a very powerful operator that allows you to select only rows that are "like" what you specify. The percent sign "%“ can be used as wild card to match any possible character that might appear before or after the characters specified. For example: select first, last, city from empinfo where First LIKE 'Er%'; This SQL statement will match any first names that start with 'Er'. Strings must be in single quotes. Or you can specify, select first, last from empinfo where last LIKE '%s'; This statement will match any last names that end in a 's'. select * from empinfo where first = 'Eric'; This will only select rows where the first name equals 'Eric' exactly.

The JOIN concept JOIN is a query clause that can be used with the SELECT, UPDATE, and DELETE data query statements to simultaneously affect rows from multiple tables. There are several distinct types of JOIN statements that return different data result sets. Joined tables must each include at least one field in both tables that contain comparable data. For example, if you want to join a Customer table and a Transaction table, they both must contain a common element, such as CustomerID column, to serve as a key on which the data can be matched. Tables can be joined on multiple columns so long as the columns have the potential to supply matching information. Column names across tables don't have to be the same, although for readability this standard is generally preferred. When you do use like column names in multiple tables, you must use fully qualified column names. This is a “dot” notation that combines the names of tables and columns. For example, if I have two tables, Customer and Transaction, and they both contain the column CustomerID, I’d use the dot notation, as in Customer.CustomerID and Transaction.CustomerID, to let the database know which column from which table I’m referring. Now that we’ve examined the basic theory, let’s take a look at the various types of joins and examples of each.

The basic JOIN statement A basic JOIN statement has the following format: SELECT Customer.CustomerID, TransID, TransAmt FROM Customer JOIN Transaction ON Customer.CustomerID = Transaction.CustomerID; In practice, you'd never use the example above because the type of join is not specified. In this case, SQL Server assumes an INNER JOIN. You can get the equivalent to this query by using the statement: SELECT Customer.CustomerID, TransID, TransAmt FROM Customer, Transaction; However, the example is useful to point out a few noteworthy concepts: • TransID and TransAmt do not require fully qualified names because they exist in only one of the tables. You can use fully qualified names for readability if you wish. • The Customer table is considered to be the “left” table because it was called first. Likewise, the Transaction table is the “right” table. • You can use more than two tables, in which case each one is “naturally” joined to the cumulative result in the order they are listed, unless controlled by other functionality such as “join hints” or parenthesis. • You may use WHERE and ORDER BY clauses with any JOIN statement to limit the scope of your results. Note that these clauses are applied to the results of your JOIN statement. • SQL Server does not recognize the semicolon (;), but I use it in the included examples to denote the end of each statement, as would be expected by most other RDBMSs. Another addition to your SQL toolbox Although the JOIN statement is often perceived as a complicated concept, you will see that it’s a powerful timesaving resource that’s relatively easy to understand. Use this functionality to get related information from multiple tables with a single query and to skillfully reference normalized data. Once you’ve mastered JOINs, you can elegantly maneuver within even the most complex database.

Inner join In relational databases, a join operation matches records in two tables. The two tables must be joined by at least one common field. That is, the join field is a member of both tables. Typically, a join operation is part of a SELECT query. select * from A, B where A.x = B.yThe column names (x and y in this example) are often, but not necessarily, the same.

Outer Join (database)outer join - A less commonly used variant of the inner joinrelational database operation. An inner join selects rows from two tables such that the value in one column of the first table also appears in a certain column of the second table. For an outer join, the result also includes all rows from the first operand ("left outer join", "*="), or the second operand ("right outer join", "=*"), or both ("full outer join", "*=*"). A field in a result row will be null if the corresponding input table did not contain a matching row.For example, if we want to list all employees and their employee number, but not all employees have a number, then we could say (in SQL):SELECT employee.name, empnum.number WHERE employee.id *= empnum.idThe "*=" means "left outer join" and means that all rows from the "employee" table will appear in the result, even if there is no match for their ID in the empnum table.

The notorious CROSS JOIN The CROSS JOIN has earned a bad reputation because it’s very resource intensive and returns results of questionable usefulness. When you use the CROSS JOIN, you're given a result set containing every possible combination of the rows returned from each table. Take the following example: SELECT CustomerName, TransDate, TransAmt FROM Customer CROSS JOIN Transaction; With the CROSS JOIN, you aren’t actually free to limit the results, but you can use the ORDER BY clause to control the way they are returned. If the tables joined in this example contained only five rows each, you would get 25 rows of results. Every CustomerName would be listed as associated with every TransDate and TransAmt. I really did try to come up with examples where this function was useful, and they were all very contrived. However, I’m sure someone out there is generating lists of all their products in all possible colors or something similar, or we wouldn’t have this wonderful but dangerous feature.

The INNER JOIN drops rows When you perform an INNER JOIN, only rows that match up are returned. Any time a row from either table doesn’t have corresponding values from the other table, it is disregarded. Because stray rows aren’t included, you don’t have any of the “left” and “right” nonsense to deal with and the order in which you present tables matters only if you have more than two to compare. Since this is a simple concept, here’s a simple example: SELECT CustomerName, TransDate FROM Customer INNER JOIN Transaction ON Customer.CustomerID = Transaction.CustomerID; If a row in the Transaction table contains a CustomerID that’s not listed in the Customer table, that row will not be returned as part of the result set. Likewise, if the Customer table has a CustomerID with no corresponding rows in the Transaction table, the row from the Customer table won’t be returned.

The OUTER JOINcan include mismatched rows OUTER JOINs, sometimes called “complex joins,” aren’t actually complicated. They are so-called because SQL Server performs two functions for each OUTER JOIN. The first function performed is an INNER JOIN. The second function includes the rows that the INNER JOIN would have dropped. Which rows are included depends on the type of OUTER JOIN that is used and the order the tables were presented. There are three types of an OUTER JOIN: LEFT, RIGHT, and FULL. As you’ve probably guessed, the LEFT OUTER JOIN keeps the stray rows from the “left” table (the one listed first in your query statement). In the result set, columns from the other table that have no corresponding data are filled with NULL values. Similarly, the RIGHT OUTER JOIN keeps stray rows from the right table, filling columns from the left table with NULL values. The FULL OUTER JOIN keeps all stray rows as part of the result set. Here is your example: SELECT CustomerName, TransDate, TransAmt FROM Customer LEFT OUTER JOIN Transaction ON Customer.CustomerID = Transaction.CustomerID; Customer names that have no associated transactions will still be displayed. However, transactions with no corresponding customers will not, because we used a LEFT OUTER JOIN and the Customer table was listed first. In SQL Server, the word OUTER is actually optional. The clauses LEFT JOIN, RIGHT JOIN, and FULL JOIN are equivalent to LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN, respectively.

SQL Subquery • It is possible to embed a SQL statement within another. When this is done on the WHERE or the HAVING statements, we have a subquery construct. What is subquery useful for? First, it can also be used to join tables. Also, there are cases where the only way to correlate two tables is through a subquery. • The syntax is as follows: • SELECT "column_name1" FROM "table_name" WHERE "column_name2" [Comparison Operator] (SELECT "column_name1" FROM "table_name"WHERE [Condition]) • [Comparison Operator] could be equality operators such as =, >, <, >=, <=. It can also be a text operator such as "LIKE." • Let's use the same example as we did to illustrate SQL joins:

SQL: Queries, Schemas, and Joins- OH MY!!!