380 likes | 561 Vues
CS 519: Big Data Exploration and Analytics. Relational Query Languages. Announcements . The due data for selecting papers is tomorrow. Your top 3 choices First come, first server Some project ideas will be posted on the course Web site tomorrow. Presentations .
E N D
CS 519: Big Data Exploration and Analytics Relational Query Languages
Announcements • The due data for selecting papers is tomorrow. • Your top 3 choices • First come, first server • Some project ideas will be posted on the course Web site tomorrow.
Presentations • 1 or 2 papers: 45 - 50 minutes presentation. • 20 - 25 minutes discussion • 5 minutes quiz. • Posted articles on how to effectively read scientific papers. • Which parts to read more carefully than others? • Read and analyze the paper critically.
Presentations • The story: • Clear definition of the problem and its applications: Is it important? • What makes this paper different from others • Different version of the problem. Should we care? • Novel solution: Is it sufficiently novel? • Solution • Give a general overview. • Highlight the interesting and important parts.
Presentations • Evaluation, evaluation, evaluation! • Setting: Is it realistic? • Results: Does the paper cover both advantages and disadvantages? • Conclusion and discussion • Strong points (3 or more). How to use them in problems? • Weak points (3 or more) How to address the weak points? • Possible future directions? • Connections with your project/ research?
Presentations • David Paterson’s guide on “How to give a bad presentation” http://www.cs.berkeley.edu/~pattrsn/talks/BadTalk.pdf
Relational model and query languages • Relational model defines data organization • Relational query languages define data retrieval/manipulation operations.
Relational model Relation name Attribute names Book: Title Price Category Year MySQL $102.1 computer 2001 Cell biology $201.69 biology 1954 French cinema $53.99 art 2002 NBA History $63.65 sport 2010 tuples
Relational Model • Attributes • Atomic values • Domain: string, integer, real, • Keys: no duplicate values • Each relation must have keys • A relation does not contain duplicate tuples.
Database Schema vs. Database Instance • Schema of a Relation • Names of the relation and their attributes. • E.g.: Person (Name, Address, SSN) • Domains of / constraints on the attributes. • Schema of the database • Set of relation schemata • E.g.: Person (Name, Address, SSN) Employment(Company, SSN)
Database Schema vs. Database Instance • Schema: S:Book(Title, Price, Category, Year) • Instance: • Values of each attribute A in I: active domain of A, adom(A) Title Price Category Year MySQL $102.1 computer 2001 Cell biology $201.69 biology 1954 French cinema $53.99 art 2002 NBA History $63.65 sport 2010
SQL • A declarative language for querying data stored in relational databases. • Much easier to use than procedural languages. • Say what instead of how SELECT returned attribute(s) FROM table(s) WHERE conditions on the tuples of the table(s)
SQL Example Movie(id, title, year, total-gross) Actor(id, name, b-year) Plays(mid,aid) What movies are made in 1998? SELECT title FROM movie WHERE year = 1998;
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) SQL Example • Find actors who played in a movie whose total gross is more than $2,000,000. SELECT * FROM Actor, Movie, Plays WHERE Movie.id= Plays.midAND Plays.aid= Actor.id AND total-gross > 2000000;
Formal Relational Query Languages • Formal languages that express queries over relational schemas. • Relational Algebra • Datalog (recursion-free with negation) • Relational calculus • Used to explore the properties of relational model. • Easier to use than SQL in some application domains.
Named versus unnamed perspective • Named perspective Movie(id, title, year, total-gross) Actor(id, name, b-year) Plays(mid,aid) • Unnamed perspective Movie: arity 4 Actor: arity 3 Plays: arity 2
Relational Algebra (RA) • Used by RDBMS to execute queries • Six operators • Selection σ • Projection Π • Join ∞ • Union • Difference – • Renaming ρ (for named perspective)
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Relational Algebra • Selection σ • Named perspective σtitle=‘Fargo’ (Movie) • Unnamed perspective σ2=‘Fargo’ (Movie) • Projection Π • Named perspective Πb-year (Actor) • Unnamed perspective Π3 (Actor) • Join ∞ • Named perspective Movie ∞id=mid Plays • Unnamed perspective Movie ∞1=1Plays
Datalog • First created to support recursive queries over relational databases. • Easier to use than relational algebra. • Used extensively in research and industry • Data integration, networking, logic programming, learning, distributed processing, … • We talk about the recursion-free datalog.
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Datalog • Each tuple in database is a fact Movie(236878, ‘Godfather I’, 1972, 40000000) Movie(879900, ‘Godfather II’, 1974, 3900000) Actor(090988,’Robert De Niro’, 1943) • Each query is a rule Movies that were produced in 1998 and made more than $2,000. Q1(y):- Movie(x,y,1998,z),z > 2000.
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Datalog Example • Actors who played in a movie whose total gross is more than $2,000. Q2(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f), f > 2000. • Actors who played in a movie whose total gross is more than $2,000 and a movie made in 1998. Q3(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f), f > 2000, Plays(g,x), Movie(g,l,1998,h).
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Datalog Q2(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f), f > 2000. y: head variable; x, z, t : existential variables • Extensional Database Predicates (EDB) • Movie, Actor, Plays • Intentional Database Predicate (IDB) • Q2 atom atom body head
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Datalog programs • A collection of rules: union • Actors who played in a movie with gross of more than $2,000 or a movie made after 1990. Q4(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f), f > 2000. Q4(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f), w > 1990.
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Views • Similar to views in SQL • Actors who played in a movie with gross of more than $2000 and in a movie with ‘Robert De Niro’. V(x,y,z):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f), f > 20000. Q5(y):- V(x,y,z),Plays(t,x),Plays(t,f),Actor(f,’Robert De Niro’,g, h). • Unfolding Q5(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f), f > 20000,Plays(u,x),Plays(u,f),Actor(f,’Robert De Niro’,g, h). Definition of V
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Datalog with negation • All actors who did not play in a movie with ‘Robert De Niro’. U(x,y,z):- Actor(x,y,z),Plays(t,x),Plays(t,f), Actor(f,’Robert De Niro’,g). Q6(y):- Actor(x,y,z), not U(x,y,z).
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Safe datalog rules • Unsafe rules: V(x,y,z):- Actor(x,y,1998), z > 200. W(x,y,z):- Actor(x,y,z), not Plays(t,x). • A datalog rule is safe if every variable appear in at least one positive predicate
Datalog to SQL • Non-recursive datalog with negation represents the core functionalities of SQL • We can translate each non-recursive datalog program to a core SQL query and vice versa.
Relational calculus • First order logic or predicate calculus • Two approaches: • Domain Relational Calculus (DRC): Unnamed perspective. • Our focus in this lecture. • Tuple Relational Calculus (TRC): Named perspective.
Relational calculus • Each relational predicate P is: • Atom • P ∧ P • P ∨ P • P P • not (P) • Each query Q(x1, …, xn) is a predicate P.
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Relational calculus • Actors who played in a movie with total gross of $2,000. • Actors who played only in movies produced in 1990.
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Relational calculus • Actors who played in some movies with only one actor.
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) Domain independent RC • Similar to datalog, one may write unsafe queries in RC, called domain dependent. • RC queries should be domain independent.
Equivalency Theorem • RA, non-recursive datalog with negation and RC express the same set of queries. • Relational queries.
Conjunctive queries (CQ) • One datalog rule. • SELECT-DISTINCT-FROM-WHERE. • Select/project/join (σ, Π, ∞) fragment of RA. • Existential/ conjunctive fragment of RC • There is not any comparison operator (<, ≠, …) in CQ. If used the family is called CQ<, CQ≠, …
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid,aid) CQexamples Actors who played in “LTR”. Q7(y):- Actor(x,y,z),Plays(t,x),Movie(t,’LTR’,w,f). Actors who played in a movie with total gross of $2000. Non-CQ: Actors who played in some movies with only one actor.
Query equivalency and containment • Interesting and long standing problems in database management. • Queries q1 and q2 are equivalent if and only if for every database instance I, q1(I) = q2(I) • Shown as q1 q2 • Query q1is contained in q2if and only if for every database instance I, q1(I) q2(I) • Shown as q1 q2
Containment examples Is q1 q2? q1(x):- R(x,y),R(y,z),R(z,w). q2(x):- R(x,y),R(y,z). q1(x):-R(x,y),R(y,’Joe’). q2(x):-R(x,y),R(y,z). q1(x):- R(x,y),R(y,z),R(z,x). q2(x):- R(x,y),R(y,x).