Advanced Database Management: Relational Models and Practical Applications

Lecture 1: Introduction AnHai Doan CS 511 Fall 05

Welcome to CS 511, Advanced Database Management! Instructor: AnHai Doan, anhai@cs • 2118 Siebel • Office hours: WF 3:15-4:15 (right after each lecture) • Won't have office hour today Home page: google for "cs511 uiuc" Texts and readings: • Hellerstein and Stonebraker: Readings in Database Systems, 4th ed. • Can be ordered online • Supplementary papers (will be linked via schedule)

More Administrivia • TAs • Yoonkyong Lee (also the I2CS TA) • Gorvind Kabra • Their office hours will be announced in newsgroup • class.cs511 is very important • all announcements will appear there • Slides • integrated from those of Kevin Chang, Zack Ives, Jeff Naughton • will try to have them posted before each lecture

What to Do if You are Confused? • Admin issues: • ask the newsgroup • ask your team members • ask your TA (I2CS students ask Yoonkyong) • ask me • Academic issues: • same thing

Course Objectives • Study fundamental lessons in relational data management • Examine how to adapt those lessons to other settings • data integration, data mining, IR/Web search • Why? • the painter and the flower girl • Can't learn lessons without knowing gory details

Prerequisites • Need cs 411-equivalent background • first homework will evaluate this • Strong programming skills

Course Format I will lecture twice a week before each lecture you read a paper, send a brief review to newsgroup (can miss up to 3) attending lecture is required (can miss up to 3) participating in discussion is required Homework 1: evaluate your cs-411 background (individual) Homework 2: programming on DBlife (team) Programming project: team, on DBlife Each team presents 1-2 papers in Nov. Take-home final At the end, you should be equipped to do research in this field, or to take ideas from databases and apply them to your field

Grading • Participation: 10% • reviews, discussion • attendance (not required for I2CS) • Two homeworks: 20% • Project: 35% • Presentation: 10% • Final exam: 25%

Rough Schedule • September • relational model, 2 homeworks • October • data integration • November • data mining • IR/Web search

For the rest of this lecture: Lets talk about relational models and its lessonsAs a sample opinion: see Zack's slides following this.

So What Is This Course About? Not how to build an Oracle-driven Web site… … nor even how to build Oracle…

What Is Unique about Data Management? • It’s been said that databases and data management focus on scalability to huge volumes of data • What is it that makes this possible – and what makes the work interesting if NOT at huge scale? • Why are data management techniques useful in situations where scale isn’t the bottleneck?

The Key Principle: Data Independence • Most methods of programming don’t separate the logical and physical representations of data • The data structures, access methods, etc. are all given via interfaces! • The relational data model was the first model for data that is independent of its data structures and implementation

What Is Data Independence? • Codd points out that previous methods had: • Order dependence • Index dependence • Access path dependence • What might you be able to do in removing those?

The Relational Data Model More than just tables! • True relations: sets of tuples • The only data representation a user/programmer “sees” • Explicit encoding of everything in values Additional integrity constraints • Key constraints, functional dependencies, … General and universal means of encoding everything! • (Semantics are pushed to queries) A secondary concept: views • Define virtual, derived relations that are always “live” • A way of encapsulating, abstracting data

Constraints and Normalization • Fundamental idea: we don’t want to build semantics into the data model, but we want to be able to encode certain constraints • Functional dependencies, key constraints, foreign-key constraints, multivalued dependencies, join dependencies, etc. • Allows limited data validation, plus opportunities for optimization • The theory of normalization (see CSE 330, CIS 550) makes use of known constraints • Idea: eliminate redundancy, in order to maintain consistency in the presence of updates • (Note that there’s no reason for normalization of data in views!) • Ergo, XML???

Relational Completeness(Plus Extensions): Declarativity What is special about relational query languages that makes them amenable to scalability? • Limited expressiveness – particularly when we consider conjunctive queries (even with recursion) • Guaranteed polytime execution in size of data • Can reason about containment, invert them, etc. • “Magic sets” • (What about XQuery’s Turing-completeness???) • Equivalence between relational calculus and algebra • Calculus  fully declarative, basis of query languages • Algebra  imperative but polytime, basis of runtime systems • Predictability of operations  cost models • Ability to supplement data with auxiliary structures for performance

Concurrency and Reliability(Generally requires full control) • Another key element of databases – ACID properties • Atomicity, Consistency, Isolation, Durability • Transaction : an atomic sequence of database actions (read/write) on data items (e.g. calendar entry) • Recoverability via a log: keeping track of all actions carried out by the database • How do distributed systems, Web services, service-oriented architectures, and the like affect these properties?

Other Data Models • Concepts from the relational data model have been adapted to form object-oriented data models (with classes and subclasses), XML models, etc. • But doesn’t this result in some loss of logical-physical independence? • GMAP and answering queries using views?

What Is a Data Management System? • Of course, there are traditional databases • The focus of most work in the past 25 years • “Tight loops” due to locally controlled data • Indexing, transactions, concurrency, recovery, optimization • But…

80% of the World’s Data is Not in Databases! Examples: • Scientific data (large images, complex programs that analyze the data) • Personal data • WWW and email (some of it is stored in something resembling a DBMS) • Network traffic logs • Sensor data • Are there benefits to declarative techniques and data independence in tackling these issues? • XML is a great way to make this data available • Also need to deal with data we don’t control and can’t guarantee consistency over

An Example of Data Management with Heterogeneity: Data Integration A layer above heterogeneous sources, to combine them under a unified logical abstraction • Some of these are databases over which we have no control • Some must be accessed in special ways • Data integration system translates queries over mediated schema to the languages of the sources; converts answers to mediated schema “Mediated Schema” XML

Other Interesting Points Data streams and sensor data How do we process infinite amounts of data? Peer-to-peer architectures What’s the best way of finding data here? Personal information management Can we use integration-style concepts and a bit of AI to manage associations between our data? Web search What’s the back-end behind Google? Semantic Web How do we semantically interrelate data to build a better Web?

Layers of a Typical Data Management System API/GUI (Simplification!) Query Optimizer Stats Physical plan Exec. Engine Logging, recovery Schemas Catalog Data/etc Requests Access Methods Data/etc Requests Buffer Mgr Red = logical Blue = physical Pages Pages Physical retrieval Data Requests Source

Query Answering in a Data Management System • Based on declarative query languages • Based on restricted first-order logic expressions over relations • Not procedural – defines constraints on the output • Converted into a query plan that exploits properties; run over the data by the query optimizer and query execution engine • Data may be local or remote • Data may be heterogeneous or homogeneous • Data sources may have different interfaces, access methods, etc. • Most common query languages: • SQL (based on tuple relational calculus) • Datalog (based on domain relational calculus, plus fixpoint) • XQuery (functional language; has an XML calculus core)

Hash STUDENT Merge COURSE Takes by cid by cid Processing the Query Web Server / UI / etc Execution Engine Optimizer Storage Subsystem SELECT * FROM STUDENT, Takes, COURSE WHERE STUDENT.sid = Takes.sID AND Takes.cID = cid

DBMSs in the Real World • Big, mature relational databases • IBM, Oracle, Microsoft • “Middleware” above these • SAP, PeopleSoft, dozens of special-purpose apps • “Application servers” • Integration and warehousing systems • Current trends: • Web services; XML everywhere • Smarter, self-tuning systems • Stream systems

Our Agenda this Semester • Reading the canonical papers in the data management literature • Some are very systems-y • Some are very experimental • Some are highly algorithmic, complexity-oriented • Gaining an understanding of the principles of building systems to handle declarative queries over large volumes of data

Advanced Database Management: Relational Models and Practical Applications