Databases and SQL

Databases and SQL

Storing Data • The simplest way to store data is in a “flat” file. That is, a file written in plain ASCII text, that you can open and read. For relatively small amounts of data, this simple approach is a good idea. • However, flat files are not very efficient: they must be searched linearly, from beginning to end. Also, flat files take up a lot of space due to redundancy and to blank spaces used to make them more readable. • A better approach for moderate to large amounts of data is to use a relational database such as MySQL (“my ess-que-ell”) or Oracle.

Basic Database Theory • In 1970, IBM researcher E. F. Codd published the seminal paper “A relational model of data for large shared databases”, which specified the basic design principles still used today for designing databases. • Database design can be very complicated: it is a whole specialized field that we are going to touch on only very lightly. • Databases are composed of tables of data. Each row is an “entity”, a single object such as a gene. Entities are also called “records”.Each column is an “attribute”, a property that the entities have, for example the length of the gene or the protein it produces. Attributes are also called “fields”.

Normalization • The basic principle in database design is that we want to enter each unique piece of data a single time only. That way, if it changes we need to only change a single copy instead of having multiple copies that will get out of synch with each other. The process of designing tables to accomplish this is called “normalization”. • There are 3 “normal forms”, or normalization principals.

First Normalization Principal • A database is said to be in “first normal form” if every attribute has exactly one value. “NULL” is an acceptable value. • For example, if you are creating a database of chromosomes, you cannot have a column (attribute) named “genes”, because there are many genes on each chromosome. Instead, you would need to create a separate “genes” table. The genes table would have a column named “chromosome”.

Skip a bunch of theory...

Accessing MySQL Databases • At the command line prompt, type “mysql -u z012345 -p” . The “-u” expects your MySQL user name, and “-p” causes it to prompt you for your password. Note that your MySQL user name and password are not necessarily the same as your regular biolinx password. If all goes well, MySQL will then give you some kind of welcome message. • You need to use a specific database. For the examples here, we will be using the “GL” database or the “bios546db” database. You also have your own personal database, called something like “z012345db”. It takes special permission to create a new database, so I have done that for you. • To access a particular database type “use GL” (or whatever) at the MySQL prompt. • MySQL commands are NOT case-sensitive. Often, command keywords are written in all caps, just to distinguish them from your user-specific values such as table and column names. However, this isn’t necessary. • MySQL commands have to end with a semicolon (;). If you forget this, the system will put an arrow on the next line: ->, which means it is waiting for you to finish the command.

Accessing Data Tables • A database consists of one or more tables. To see which tables are present in the database, type “SHOW TABLES;” for the current database, or “SHOW TABLES FROM bios546db;” if you want to see the tables in bios546db and are in another database. Note that all commands must end in a semi-colon. • The GL database has tables “pets”, “clients”, and “charges”. • Tables contain columns of data. To see the columns and the type of data they contain, type “SHOW COLUMNS FROM table_name;”. Or, you can use the equivalent command “DESCRIBE table_name;”. • The “pets” table has columns pet_id, owner_id, name, species, age, sex, breed, color, temperment, and notes. Note that pet_id is listed as the primary key. This means that pet_id is a unique identifier for each entity (row, individual pet) in this table.

Accessing Data • The basic keyword used to access data is “SELECT”. • The basic syntax is: “SELECT column1, column2, ... FROM table_name WHERE qualifying_clause;” • A typical select statement is something like: “SELECT name FROM pets WHERE owner_id = 3;” This extracts data from the “name” column in the “pets” table, and only gets the data rows in which the “owner_id” column has the value 3. • The wildcard character is “*”, which will give you all the data. Thus, “SELECT * FROM pets;” extracts all data from the table.

WHERE clauses • SELECT statements are based on properly qualifying which data you want to extract. • The equals sign ( = ) matches rows. Automatic conversion of types occurs, so strings and numbers can all be compared with the same symbol. • Not equals is either <> or != • <, <=, >, >= all work • AND or &&, OR or ||, NOT or !: MySQL is very flexible. • You can also use LIKE instead of =, and use “%” as a wildcard character. For example, SELECT name FROM pets WHERE name LIKE ‘B%’; will extract all pet names that start with a B and are followed by any number of other characters. Similarly, SELECT name FROM pets WHERE name LIKE ‘%b’; will find any name that ends in a “b”. • All string values must be quoted. I usually use single quotes, but double quotes also work. In contrast, numerical values are not quoted. On the third hand (the gripping hand), database names, table names, and column names are not quoted. Very annoying, and MySQL is very picky about such things.

Functions in SELECT statements • The WHERE clause can do mathematical, string, and logical operations on either side of the comparison operator (equals sign). • Parentheses can be used to group statements appropriately. • Simple math operations: +, -, *, /, %. For example: SELECT name FROM pets WHERE owner_id % 2 = 0; will give you all the pets whose owners have ID numbers that are even. • MySQL has quite a number of built in functions as well. A few examples: ABS (absolute value), CONCAT (concatenate strings), CEILING (smallest integer larger than the operand), GREATEST (returns largest value from the list following the operator), LENGTH (number of characters in string), LOG (natural logarithm), RIGHT (returns rightmost characters in the string). See the list in the book handout. • IF statements can also be used as part of the WHERE clause. The format is: IF(test, value1, value2). If the “test” is true, value1 is returned; if false, value2 is returned.

Aggregate Functions • SELECT statements can also be used to perform operations on the returned data. The function is written into the list of columns to be used, with the column name in parentheses. • For example: SELECT AVG(age) FROM pets WHERE species = ‘feline’; gives a single value, the average age of all the cats. • Other useful aggregate functions: COUNT, MAX, MIN, SUM, STD (standard deviation). • COUNT is especially useful is determining the size of the database: SELECT COUNT(*) FROM pets;

Joining Data from Two of More Tables • The whole point of relational databases is to separate data into different tables so updates need only be done in one place. • The basic syntax needed to join data from 2 tables is how to refer to a specific column in a specific table. This is done by writing the table name followed by a dot followed by the column name, For example, “pets.name” refers to the “name” column in the “pets” table. • If a column name is unambiguous (i.e. only found in 1 table), you can usually get away with not qualifying it with the table name. • You need to list all the tables that you are using in the FROM clause. • An example: SELECT pets.name, clients.first_name, clients.last_name FROM pets, clients WHERE pets.owner_id = clients.client_id AND species = ‘canine’; The WHERE clause compares every entry in the pets.id column to the entries in clients.client_id, and returns only those rows with a match. MySQL chooses which table to read first based on efficiency. • There are other, fancier ways to join tables, such as the “LEFT JOIN”. I’m not going to talk about them.

Creating Tables • Creating a new database requires special permission, so I have created you each your own database, with name something like “z012345db”. • The basic commands for tables are: CREATE, ALTER, and DROP. • For creating a table, the syntax is: CREATE TABLE name (column1_name column1_type, column2_name column2_type, ...); Note the position of the commas in the column list: each column is given a name and a type, followed by a comma, then the next column name and type. • The statement in parentheses following the table name is the “create clause”, It consists of the column name followed by the column type, followed optionally by modifiers to the type.

Column Types • Column types determine what kind of data is stored and how much space is allocated for it. The basic types are integer, floating point numbers, dates, and text. There are a number of variations of each of these, depending on how long the values are expected to be. The table in the book helps. • In many cases you need to add the length (number of characters) in parentheses following the type. For instance CHAR(20) would set aside 20 characters of space for each data row. Note that for numerical values, the decimal point, minus sign, and “E” (for exponent) each take up a character.

More Column Types • You can get a long way on just a few types. • INT will cover all but very large integers ( between -2,147,483,648 and 2,147,483,647). • the optional parameter “length” gives the number of characters displayed, right-justified and padded with spaces. Thus, INT(5) would print the number 17 as “ 17”. • FLOAT or DOUBLE will cover floating point numbers (single or double-precision, respectively). • you can add optional parameters in the form (M, D), where M is the total number of digits displayed, and D is the number after the decimal point. Note that, unlike printf, the M parameter doesn’t count the decimal point or a “-” at the beginning. Thus, FLOAT(5,2) would print out a value such as “ 4.32”. • CHAR(length) works for text. By default it is case-insensitive. If you want case-sensitive fields, use BINARY(length) instead. The length can be up to 255 characters. • TEXT and BLOB are variable length text fields, case-insensitive and case-sensitive, respectively. They hold up to 64 KB of data. If you need more, MEDIUMTEXT (or BLOB) holds up to 16 MB, and LONGTEXT (or BLOB) holds up to 4 GB. • DATE stores dates in a YYYYMMDD format: 4 digits of the year, followed by a 2 digit month, followed by a 2 digit day. • TIMESTAMP stores the last time that data row was modified. It has quite a few formats. The default stores the time to the second, which I find worthless. I like TIMESTAMP(8), which stores YYYYMMDD format. • As usual, there are other data types--see the book.

Primary Key • Each table needs a primary key, a column with a unique entry for each row. Although you can use anything you like for this, it is usually easiest and safest to use sequentially numbered integers as the ID numbers of the primary key. • The primary key is designated as such in the “create clause”, using the keyword “PRIMARY KEY” after the column type. Also, if you use “AUTO_INCREMENT” after the column type, MySQL will automatically put in the next available value. • For example: CREATE TABLE table_name (id INT PRIMARY KEY AUTO_INCREMENT, name CHAR(20), age INT(3) );

Drop • If you want to remove a table, use DROP TABLE table_name;

Altering Tables • “ALTER TABLE table_name” can be used to add, change, or remove the columns. Various keywords follow the table name: • Adding a new column: ALTER TABLE name ADD (create clause); Use the same type of clauses as in CREATE: column_name column_type. • Dropping a column: ALTER TABLE name DROP column_name; • Changing a column type: ALTER TABLE name MODIFY column_name new_type • Changing a column name: ALTER TABLE name CHANGE old_name new_name column_type. Note that you have to put in the column type as well as the new name, even if the type stays the same! • So, MODIFY is just for changing the column type, leaving the column name the same. CHANGE is for changing the column name as well as its type: with CHANGE you must specify the type as well as the name.

Adding, Altering, and Removing Data • The basic commands for individual data items are INSERT, UPDATE, and DELETE. • If you want to insert a value into every column of a table in the proper order, the command is INSERT INTO table_name VALUES (col1_value, col2_value, col3_value, ...); The values are simply inserted left to right at the next available position in the table. • If your primary key column is an INT with AUTO_INCREMENT (which I highly recommend), use “null” as the value for that column when inserting data. This tells MySQL to invoke the AUTO_INCREMENT function and put in the next integer. • Similarly, the value of a TIMESTAMP column should be “null”, so MySQL will automatically insert the current time. • The “INTO” word isn’t really necessary.

More INSERTs • You can insert values into multiple rows by separating the values for each row (enclosed in parentheses) with commas: INSERT INTO table VALUES (val1a, val2a, val3a), (val1b, val2b, val2c), (val3a, val3b, val3c); • You can take the results of a SELECT statement and insert them into a table: INSERT INTO table (column_name1, column_name2, ...) SELECT statement; • You can insert into specific columns: the values of the other columns will be set to NULL or to a pre-determined default value. There are two ways of doing this: • INSERT INTO table_name (col1_name, col2_name) VALUES (col1_value, col2_value); • INSERT INTO table_name SET col1_name=col_value, col2_name=col2_value;

UPDATE • You can alter data in specific rows using UPDATE. It uses a WHERE clause like the ones used in SELECT statements, and it uses the “SET” keyword as in the previous INSERT statement. • UPDATE table_name SET col3_name=col3_value WHERE col2_name = whatever;

DELETE • Data rows are removed with the DELETE command. WHERE clauses are used here as well. • DELETE FROM table_name WHERE col1_name = some_value; • Note that the entire row is deleted. IF you just want to remove the value of a particular column, use UPDATE and set that column to NULL.

LOAD • To load large amounts of data into a database, the LOAD command can be used. It reads from an external file. • By default, the external file should have each data field (column) separated by tabs and each data entity (row) on a separate line. • You can modify this default behavior by adding the phrases “FIELDS TERMINATED BY ‘termination character’ “, or FIELDS ENCLOSED BY ‘enclosing character’ “, or LINES TERMINATED BY ‘termination character’ “. Note that you can only use one character in these phrases. I often use FIELDS TERMINATED BY ‘,’, a comma as termination character, instead of the default tab character. • Basic syntax: LOAD DATA LOCAL INFILE ‘path_to_file’ INTO TABLE table_name; • Be sure you have permission to read the file!

Indexes • The whole point of using a database, as opposed to a flat file, is to speed up searches. One way to markedly increase search speed is to make an index based on your most common search criteria. • For instance, I have a table of rice genes, containing the gene name, the chromosome it is on, the start and end positions on that chromosome, and its orientation. nearly all searches are based on chromosome and start position. Thus my main index uses these two criteria. • You can make as many indexes as you like. • An index is made from one or more columns, using the syntax CREATE INDEX index_name ON table_name (column_name1, column_name2, ...). • If you use more than one column, you automatically get additional indexes based on all subsets of the columns form left to right. For example, if your created an index with the columns name, rank, and serial_num, you also get an index based on name only, and another one based and name and rank. • When you create a table, an index based on the primary key is automatically constructed.

Commands Used in Class • The commands all are written with ">" in front of them. They relate to the “bios546db” database. • >show databases; • >create table student (id int auto_increment primary key, name char(30), year char(9), gender char(1), up_date timestamp(8) ); • >show columns from student; • >create table bios308 (id int not null, name char(30), quiz1 int, quiz2 int, grade char(1) ); • show columns from bios308; • >alter table student change year class char(9); • >alter table student modify class char(10); • >alter table bios308 add quiz3 int;

More Commands Used in Class • > insert into student values (null, 'fred flintstone', 'junior', 'm' , null); • >select * from student; • >insert into student values (null, 'wilma flintstone', 'senior', 'f', null); • >insert into student values (null, 'betty rubble', 'sophomore', 'f', null), (null, ‘barney rubble’, 'sophomore', 'm', null); • > insert into student (name, gender) values ('bam bam', 'm'); • >update student set name = 'bamm bamm' where name = 'bam bam'; • > update student set class = 'freshman' where name = 'bamm bamm'; • >update student set class = 'freshperson' where class = 'freshman'; • >alter table student modify class char(11); • update student set class = 'freshperson' where class = 'freshperso'; • > delete from student where class = 'senior'; • >insert into bios308 (id, name) select id, name from student where class='junior' OR class='sophomore'; • > update bios308 set quiz1 = 14 where name= 'fred flintstone'; • > update bios308 set quiz2 = 25, quiz3 = 31, grade = 'C' where name = 'fred flintstone'; • > update bios308 set quiz1 = 34, quiz2 = 27, quiz3 = 38, grade = 'B' where name like ‘%tty%'; • LOAD DATA LOCAL INFILE 'mysql_new_students.txt' INTO TABLE student;

The Perl DBI • SQL is a whole language unto itself, and database specialists become masters of it. On the other hand, we are trying to give you the basics of SQL only, and then have you use Perl to deal with many of the fancier manipulations that an SQL master could perform. • The Perl database interface (DBI) uses object-oriented syntax. You need to create objects, mainly database and statement handles, then use their associated methods with the arrow (->) operator. Note this arrow uses the hyphen ( - ). It is different from the => operator that substitutes for the comma when creating hashes. • Although object-oriented programming is a whole philosophy, it can easily be used without any further knowledge if you just consider it to be a slightly altered syntax.

Connecting to MySQL from Perl • You need to use the DBI module: put “use DBI;” at the top of your program, under “use strict;”. • You need to create a database handle, which we will call $dbh. This is a scalar variable that holds a reference to the data in the database object. my $dbh = DBI->connect(database, name, password); • Here, the “database” must be fully qualified: “dbi:mysql:bios546db” would connect you to the bios546db database. • Your personal MySQL databases don’t have a password, so just use “” for that field. • So, to access the z01235db database through Perl, use the command: my $dbh = DBI->connect(“dbi:mysql:z012345db”, “z012345”, “”); • After you are done using the database, you should disconnect from it with the command $dbh->disconnect;

Do Statements • For all commands except SELECT and SHOW, the “do” statement can be used. The syntax is: $dbh->do(“your SQL statement”); .Note that the SQL statement inside the parentheses doesn’t need to end in a semicolon. As usual, the line of Perl code must still end in a semicolon. • For example: $dbh->do(“INSERT INTO student VALUES (null, ‘Mr. Ed., ‘horse’, ‘m’, null)”); The return value of the statement is the number of rows affected. The NULL values here are for the auto_increment primary key and for the timestamp. • MySQL is annoyingly picky about quotes. Table and column names are NOT quoted, but string values are quoted. Also, numerical values are not quoted.

SELECT Statements • For each new SELECT statement you need to prepare a statement handle, then execute it, then fetch the results. You can use the same statement handle repeatedly once it is prepared. my $sth = $dbh->prepare(“SELECT * from student”); $sth->execute; while (my @row = $sth->fetchrow_array) { print “@row\n”; } • The fetchrow_array command returns an array consisting of all the fetched values in the data row. It does this repeatedly for each row that matches the select statement. You can of course access individual elements of the array.

More on SELECT • A quick alternative to fetchrow_array is “dump_results”, which just prints all the fetched data at once, with each data row on a new line. $sth->dump_results; • Statement handles should be closed when you are done with them: $sth->finish; • SHOW statements work the same way as SELECT statements: first prepare them, then execute, then fetch with fetchrow_array or dump_results.

Binding Parameters • It is possible to prepare a statement without knowing all the values in advance. You can add the values when you execute it. With this technique, different values can be put into the same statement handle during different executions. • To do this, use question marks (?) in place of the values in the “prepare” statement. • Then, put the necessary values as a list of parameters in the execute statement. The values must appear in the same order that the question marks are in. • This is the easiest way to do multiple INSERT statements into the same table. Instead of using “do”, prepare a statement handle for the INSERT statement once, then execute it repeatedly for different values of the data. • For example, assume that the INFILE consists of lines of data, separated by commas: my $sth = $dbh->prepare(“INSERT INTO student VALUES (null, ?, ?, ?, null)”); while (my $line = <INFILE>) { my ($name, $class, $gender) = split /,/, $line; $sth->execute($name, $class, $gender); }

Databases and SQL