Introduction to Databases: Relational and XML Models and Languages

Introduction to Databases:Relational and XML Models and Languages Instructors: Bertram Ludaescher Kai Lin

Overview • 09:15-10:20 Relational Databases (1h05’) • 10:20-10:30 BREAK (10’) • 10:30-11:50 Relational Databases (1h20’) • 11:50-13:15 LUNCH (1h25’) • 13:15-13:45 Demo & Hands-on (30’) • 13:45-15:10 XML: Basics (1h25’) • 15:10-15:30 BREAK (20’) • 15:30-16:30 XML: Querying (1h) • 16:30-17:00 Demo & Hands-on (30’)

Scope • Today: Introduction to Databases, in particular • Relational database model • Relational Operations and Queries • Constraints • XML “data” model • Querying and transforming XML • Some demos & simple hands-on exercise • Tomorrow: • Introduction to Knowledge Representation and Ontologies • But first: … déjà vu …

What is a Database System? • Database (system) = • Database Instance (set of tables of rows) • Database Management System (DBMS) • Origins in the commercial world: • to organize, query, and manipulate data more effectively, efficiently, and independently • Scientific databases • often special features: • spatial, temporal, spatiotemporal, GIS, units, uncertainty, raw & derived data, …

Why not just use files as “databases”? • For some applications: yeah… why not? • But in general: • scanning & ’grep’ing large files can be veryinefficient • no language support for selecting desired data, joining them, etc. • cannot express the kinds of questions/queries you’d like to ask • ‘grep’ is no substitute for a query language • redundant and/or inconsistent storage of data • no transaction management and concurrency control among multiple users • no security • no recovery • no data independence (application data) • no data modeling support • …

Features of a Database System • A data model(relational, object-oriented, XML) prescribes how data can be organized: • as relations (tables) of tuples (rows) • as classes of (linked) objects • as XML trees • A (database) schema (stored in the “data dictionary”) defines the structure of a specific database instance: • Relational schema • OO schema • XML Schema (or XML DTD)

Features of a Database System • Data is treated uniformly and separately from the application • Efficient data access • Queries and views are expressed over the schema • Integrity constraints (checking and enforcement) • Transactions combine sets of operations into logical units (all-or-nothing) • Synchronization of concurrent user transactions • Recovery (after system crash) • not to be confused w/ backup • instead: guarantee consistency by “roll-back” of partially executed transactions (how? Hint: logging) • …

DB features, e.g., Concurrency Control • Concurrent execution of simultaneous requests • long before web servers where around... • transaction management guarantees consistency despite concurrent/interleaved execution • Transaction (= sequence of read/write operations) • Atomicity: a transaction is executed completely or not at all • Consistency: a transaction creates a new consistent DB state, i.e., in which all integrity constraints are maintained • Isolation: to the user, a transaction seems to run in isolation • Durability: the effect of a successful (“committed”) transaction remains even after system failure

Levels of Abstraction: Architecture Overview User Conceptual … Level View 1 View 2 View n Export schemas logical data independence ER-Model (Entity-Relationship) OO Models (Classes…) part of DB design  conceptual design … often lost in the process… Logical (“conceptual”) level Tables physical data independence Physical level Index structures DB instances

Name Name since Manager Salary Database Design: Entity-Relationship (ER) Model • Entities: • Relationships: • Attributes: • ER Model: • initial, high-level DB design (conceptual model) • easy to map to a relational schema (database tables) • comes with more constraints (cardinalities, aggregation) and extensions: EER (is-a => class hierarchies) • related: UML (Unified Modeling Language) class diagrams Employee Department works-for

The Relational Model Employee • Relation/Table Name: • employee, dept • Attributes = Column Names: • Emp,Salary,DeptNo, Name, Mgr • Relational Schema: • employee(Emp:string, Salary:integer, DeptNo:integer), ... • Tuple = Row of the table: • (“tom”, “60000”, “1”) • Relation = Set of tuples: • {(...), (...), ...} EmpSalaryDNo tom 60k 1 tim 57k 1 sally 45k 3 carol 30k 1 carol 35k 2 …. FK: foreign key, pointing to another key Department DNo Name Mgr 1 Toys carol 2 Comp. carol 3 Shoes sam

Ex: Creating a Relational Database in SQL CREATE TABLE employee ( ssn CHAR(11), name VARCHAR(30), deptNo INTEGER, PRIMARY KEY (ssn), FOREIGN KEY (deptNo) REFERENCES department ) CREATE TABLE department ( deptNo INTEGER, name VARCHAR(20), manager CHAR(11), PRIMARY KEY (deptNo), FOREIGN KEY (manager) REFERENCES employee(ssn) )

What is a Query? • Intuitively: • An “executable question” in terms of a database schema • Evaluating a query Q against a database instance D yields a set of answer objects: • Relational tuples or XML elements • Example: • Who are the employees in the ‘Toys’ dept.? • Who is (are) the manager(s) of ‘Tom’? • Show all pairs (Employee, Mgr) • Technically: • A mapping from an input schema (the given table schemas) to a result schema (the new columns you are interested in) defined in some query language

Why (Declarative) Query Languages? “If you have a hammer, everything looks like a nail.” ,,Die Grenzen meiner Sprache bedeuten die Grenzen meiner Welt.” “The limits of my language mean the limits of my world.” Ludwig Wittgenstein, Tractatus Logico-Philosophicus • Things we talk and think about in PLs and QLs … • Assembly languages: • registers, memory locations, jumps, ... • C and the likes: • if-then-else, for, while, memory (de-)allocation, pointers, ... • Object-oriented languages: • C++: C plus objects, methods, classes, ... • Java: objects, methods, classes, references, ... • Smalltalk: objects, objects, objects, ... • OQL: object-query language

Why (Declarative) Query Languages? • Things we talk and think about in PLs and QLs … • Functional languages (Haskell, ML): • (higher-order) functions, fold(l|r), recursion, patterns, ... => Relational languages (SQL, Datalog) • relations (tables), tuples (rows); conceptual level: ER • relational operations: , , , , ..., ,,,,,..., , , |X| => Semistructured/XML (Tree) & Graph Query Languages • trees, graphs, nodes, edges, children nodes, siblings, … • XPath, XQuery, … • Also: • Focus on what, and not how!

Employee Department result EmpSalaryDeptNo anne 62k 2 john 60k 1 DeptNoMgr 1 anne 2 anne EmpMgr john anne anne anne Example: Querying a Relational Database input tables join SELECT e.Emp, d.Mgr FROM Employee e, Department d WHERE e.DeptNo =d.DeptNo SQL query (or view def.) we don’t say how to evaluate this expression answer (or view)

Example Query: SQL vs DATALOG • “List all employees and their managers” • In SQL: SELECT e.name, d.manager FROM Employee e, Department d WHERE e.deptNo = d.deptNo • In DATALOG: q(E, M) :- employee(E, S, D), department(D, N, M). a “join” operation

Important Relational Operations • select(R, Condition) • filter rows of a table wrt. a condition • project(R, Attr) • remove unwanted columns; keep rest • join(R1, A2, R2, A2, Condition) • find “matches” in a “related” table • e.g. match R1.foreign key = R2.primary key • cartesian product(R1, R2) • union (“OR”), intersection (“AND”) • set-difference (“NOT IN”)

condition Y1=Y2Y independent same multiple rules  union results Relational Operations (in DATALOG) (query) output:– (query) input

Demo Relational Queries in DATALOG

Queries, Views, Integrity Constraints • … can all be seen as “special queries” • Query q(…) :- … ad-hoc queries • View v(…) :- … exported views; • Integrity Constraints • ic (…) :- …. MgrSal < EmpSal … • say what shouldn’t happen • if it does: alert the user (or refuse an update, …)

Query Evaluation vs Reasoning • Query evaluation • Given a database instance D and a query Q, run Q(D) • What databases do all the time • Reasoning (aka “Semantic Query Optimization”) • Given a query Q and a constraint C, “optimize” Q&C (e.g., given C, Q might be unsatisfiable) • Given Q1 and Q2 decide whether Q1 Q2 • Given Q1,Q2, C decide whether Q1 Q2 | C • Note: we are NOT given a database instance D here; just the schema and the query/IC expressions

Natural Hoin: same attribute name  add condition that values must match Summary QLs for Relational Databases

Relational Algebra

Hands-on Part DBDesigner 4

DBDesigner 4 • An open source (GPL) database design tool • Goto http://www.fabforce.net/dbdesigner4/ • Download and install (5 min) • Open the example schema Order (File -> Open -> Order), and examine the relations between the tables forumtopic and forumpost. Find the foreign key used in the relation postHasTopic (5 min) • Connect to a sample MySQL database host: geon07.sdsc.edu port: 3306 database: summer_institute username: root password: [blank] (5 min) • Select all records in the table forumtopic (2 min) • Select all records in the table forumpost, and sort the result according to their idforumpost (3min) • Find all forum posts with the topic Cars (5 min) • Insert a record into the table forumpost (5 min)

Additional Material (not presented)

Non-Relational Data Models • Relational model is “flat”: atomic data values • nesting is modeled via “pointers” (foreign keys) and “Skolem-ids” • extension: nested relational model (“tables within tables”, cf. nested HTML tables) • values can be nested lists {...}, tuples (...), sets [...] • ISO standard(s): SQL • identity is value based • Object-oriented data model: • complex (structured) objects with object-identity (oid) • class and type hierarchies (sub-/superclass, sub-/supertype) • OODB schema may be very close to “world model” (no translation into tables) (+) queries fit your OO schema (-) (new) queries that don’t fit nicely • ODMG standard, OQL (Object Query Language)

Example: Object Query Language (OQL) • Q: what does this OQL query compute? • Note the use of path expressions like e.manager.children => Semistructured/Graph Databases SELECT DISTINCT STRUCT( E: e.name, C: e.manager.name, M: ( SELECT c.name FROM c IN e.children WHERE FOR ALL d IN e.manager.children: c.age > d.age ) ) FROM e IN Employees;

A Graph Database

Querying Graphs with OO-Path Expressions ?- dblp."Inf. Systems".L."Michael E. Senko". Answer: L="Volume 1, 1975”; L="Volume 5, 1980". ?- dblp."Inf. Systems".L.P, substr("Volume",L), P : person.spouse[lives_in = P.lives_in].

Constructs for Querying Graphs Example: ?- dblp . any* . (if(vldb)| if(sigmod))

Keys, Keys, and more Keys • A key is a minimal set of attributes that: • uniquely identify a tuple • determine every other attribute value in a tuple • There may be many keys for a relation; we designate one as the primary key • The phrase candidate key is used in place of “key” where “the key” denotes the primary key • A superkey is a superset of a key (i.e., not necessarily minimal)

Normalization of Relations Example of “good” design Employee(EName, SSN, BDate, Address, DNumber) Department(DName, DNumber, DMgrSSN) Example of “bad” design (why is it bad?) Emp(EName, SSN, BDate, Address, DNum, DName, DMgrSSN)

What’s Wrong? Emp(EName, SSN, BDate, Address, DNum, DName, DMgrSSN) • The description of the department (DName, DMgrSSN) is repeated for every employee that works in that department. • Redundancy! • The department is described redundantly. • This leads to update anomalies! (… and wastes space) • Digression (for experts only): • redundancy can be used to increase performance; e.g. “materialized views”)

Update Anomalies (caused by redundancy) Insertion Anomalies If you insert an employee Need to know which department he/she works Need to know the description information for that department If you want to insert a department, you can’t … until there is at least one employee Deletion Anomalies: if you delete an employee, is that dept. gone? was this the last employee in that dept? * Modification Anomalies: if you change DName, for example, it needs to be changed everywhere!

Null values also cause problems Null values might help in special cases, but are not a general solution to update anomalies For example, they may: • Waste space • Make it hard to specify and understand joins • Make it hard to aggregate (count, sum, etc.) • Have different meanings: • Attribute does not apply to this tuple • Attribute value is unkown • Value is known but absent (not yet recorded) • Causes problems with queries (can’t interpret query answers)

Why worry about normalization? • To … • … reduce redundancy & update anomalies • … reduce the need for null values • Last not least: it’s a solved problem: • all algorithms & proofs have been worked out Here: Normalization based on FDs (there’s more…)

Functional Dependencies Statement that if two tuples agree on attributes A they must agree on attributes B AB If the value of the first attribute(s), A, is known, then the value of the second attribute(s), B, is known (read “A determines B”) We want to know if it is always true in the application

Functional Dependencies Examples of functional dependencies: social-security-number  employee-name course-number  course-title Examples that are NOT functional dependencies course-number - book course-number - car-color

What is a functional dependency? Remember what it means to be a function: x f(x) x g(x) x h(x) 1 2 1 2 1 10 1 3 2 2 2 20 2 5 3 5 3 30 3 5 f is not a function for x=1, f(x) is not unique we are looking for functionalrelationships (that must occur in a relation) among attribute values

What are the FDs? EMP(ENAME, SSN, BDATE, ADDRESS, DNUM, DNAME, DMGRSSN) EMP_PROJ(SSN, PNUM, HOURS, ENAME, PNAME, PLOCATION)

What are the FDs? EMP(ENAME, SSN, BDATE, ADDRESS, DNUM, DNAME, DMGRSSN) EMP_PROJ(SSN, PNUM, HOURS, ENAME, PNAME, PLOCATION) 5 2 1 3 6 4 9 10 7 8

Why do we care? If all FDs are “implied by the key” it means that the DBMS only enforces keys (not FDs) and the DBMS is going to enforce the keys anyway otherwise, we have to perform expensive operations to maintain consistency (code or check statements) EMPLOYEE (SSN, NAME, SALARY, JOB_DESC)

Decomposition Main refinement technique: decomposition (replacing ABCD with, say, AB and BCD, or ACD and ABD) based on the projection operator. Decomposition should be used judiciously: • Is there reason to decompose a relation? (via Normal Forms) • What problems (if any) does the decomposition cause? (lost information or dependencies?)

Introduction to Databases: Relational and XML Models and Languages