Advanced Topics in Information Systems - Spring 2013
This course, focusing on cutting-edge topics in information systems, emphasizes conceptual understanding rather than coding skills. Students are encouraged to have a good knowledge of relational databases and SQL, with required readings from "NoSQL Distilled" and "Seven Databases in Seven Weeks." Assessments include two exams and multiple small projects, including building an application to handle large video and audio objects. Class announcements and assignments will be communicated via a dedicated blog, fostering an interactive learning environment.
Advanced Topics in Information Systems - Spring 2013
E N D
Presentation Transcript
Focus of class • Cutting edge topics in information systems • Conceptual material, as opposed to coding • If you don’t have a good knowledge of relational databases and SQL, now is a good time to study it… • A good MySQL book: murachmysql
Class blog & reaching me • http://rogerking.me • All assignments and announcements will be on the blog. • I will not send out class email. • Only read blog entries starting on December 30. • To contact me • buzzking@ymail.com • Do not use my CU mail, please. It is inundated with Spam. • 303 437 7419. • To talk to me, see me after class, or send email for an appointment. • rogerbuzzking on Skype, but I do not stay logged into Skype; you must call or email first.
Required books • books: • NoSQL Distilled and • Seven Databases in Seven Weeks
Assignments and exams • Two exams – material will be from • 2 required books • Slides posted on blog, along with corresponding lectures • Handouts posted on blog • Exam and projects • Midterm – 35 % of final grade • On March 11 • Second exam – 35 % of final grade • This will be on the last schedule day of class, NOT at the final slot • i.e., it will be on May 1 • Small projects/assignments – 30% of final grade total
The First Assignment • Build a simple application using Mongo • Groups of 1 or 2 • Due date to be posted on blog • The application will hold very large video and audio objects • More later…
Suggestions… • Please come to class. • Feel free to suggest references and/or topics. • Speak up at any time – it would be nice to have an interactive class.
Likely and possible topics • Post relational database systems • Full text processing • “Big” data • Databases and XML • Media and databases (images, video, audio, animation) • Personal information systems (not just databases) • Integrating diverse data sets • ???
Review of relational databases • Schema based technology • Small schema • Lots of data • SQL, set-based data manipulation • Multi-language application development • Strict 2 phase ACID transactions • Rigidly structured, identically structured, simply structured data • User defined types extensions for “objects in the small”
Relational basics • Tables (relations) • Columns (attributes) • Domains • Rows (tuples) • Primary and candidate keys • Foreign keys • Null values • Triggers • SQL as DDL and DML
Relational applications • A lot of data, but in amounts no longer considered huge • Transactions manipulating sets of rows • Complete accuracy is required • Not real time or interactive • If distributed, only a modest number of centrally controlled servers • If distributed, there is a single, unified schema • Complex objects built by connecting tables • Logic of data is mostly in schema and SQL, not app • Mix of reads and writes, without writes being minimal • SQL programs (run as transactions) compiled once, and optimized, run many times
Important facts about SQL DBs • Elegant, set theoretic underpinnings • Originally defined with an algebra: • Projection, join, union/difference/intersection, selection • Closed under this algebra • SQL is a tuple calculus based on the algebra • Declarative • Far more compact than algebra • Equally elegant algebra for optimization • Query results are legal view tables • Serious impedance mismatch between o-o languages and SQL
Syntax of SQL SELECT select_list FROM table_sourceWHERE search_condition
Example SELECT invoice_total, payment_total, credit_total, invoice_total - payment_total - credit_total AS balance_due FROM invoices WHERE NOT (invoice_total >= 5000 OR NOT invoice_date <= '2011-08-01')
Example SELECT 'Paid' AS source, invoice_number, invoice_date, invoice_total FROM paid_invoices WHERE invoice_date >= '2011-06-01' ORDER BY invoice_total DESC
Manipulating multiple tables • You can use the Where clause, T.color = R.color • You can use an explicit join • Equijoin, outerjoin SELECT vendor_name, invoice_number, invoice_total FROM vendors LEFT JOIN invoices ON vendors.vendor_id = invoices.vendor_id ORDER BY vendor_name
Aggregates AVG([ALL|DISTINCT] expression) SUM([ALL|DISTINCT] expression) MIN([ALL|DISTINCT] expression) MAX([ALL|DISTINCT] expression) COUNT([ALL|DISTINCT] expression) COUNT(*) – counts nulls
Example SELECT COUNT(*) AS number_of_invoices, SUM(invoice_total – payment_total – credit_total) AS total_due FROM invoices WHERE invoice_total – payment_total – credit_total > 0
Strict 2 phase transactions • Each SQL program is within a begin and end transaction pair • Each transaction has its own workspace for DB items it is going to update • Any transactions that overlap in execution time will appear to have run in some serial order • This is done by transactions requesting read and write locks (also known as shared and exclusive locks) • Read locks can be shared with other readers • Write locks cannot be shared with readers or writers • All locks held until the end of the transaction • They are released and then the changes that a transaction has made are moved to the DB
Serializability & transactions • The net effect is that the transactions that overlap in execution time appear to have run in some serial order • Transactions can be undone by throwing away the local store (conceptually, at least) • The write period at the end of the transaction must be atomic • The two phases: • Request read, write, upgrade locks (and wait on locks) and process • Release locks and move updates to the DB • There is a notion of “serializability” which means that the actual schedule of executed steps corresponds to some serial order of running the transactions
Major concept: DB “state” • Changes at the end of each transaction (potentially) • Relies on 2 Phase transactions • We tradeoff throughput in favor of having safe state changes
Normalization • Start with an entity model • Map to tables • Create PKs and FKs • Create other constraints • Normalize tables • Goals • Minimize redundant data • Minimize “update anomalies” • Minimize misinterpreted db by documenting more constraints • Problems • Skinny tables and more joins • Increases impedance mismatch
Dependencies • FD • We say that ai FD-> aj • Or “ai functionally determines aj” • MVD-> • We say that ai MVD-> aj • Or “ai multivalued determines aj” • Note: the right side of an FD or an MVD can be a set of attributes
Third normal form • First (1NF) The value stored at the intersection of each row and column must be a scalar value, and a table must not contain any repeating columns. • Second (2NF) Every non-key column must depend on the entire primary key. • Third (3NF) Every non-key column must depend only on the primary key.
Fixed 3NF and 4NF • Boyce-Codd (BCNF) A non-key column can’t be dependent on another non-key column. • Fourth (4NF) A table must not have more than one multivalued dependency, where the primary key has a one-to-many relationship to non-key columns.
Example Customer ID Address ZIP 18 112 First 80304 17 123 Ash 80303 16 123 Ash 80303 Break into two tables: Customer ID Address Address Zip
Example Mothers_PhoneFathers_PhoneChild_Name Break into: Mothers_PhoneChild_Name 3030000000 Sue 3031111111 Sue AndFathers_PhoneChild_Name 3032222222 Sue 3033333333 Sue Note: both fields needed for PK
Stored programs • Stored procedures (can be called by an application) • Stored functions (can be called by an SQL program) • Triggers (tied to an operation like INSERT) • Events (tied to a clock)