Unstructured Data Management Systems in Action - Study Attempts and Solutions

Query Optimization Lecture #5

On Homework #2 • Will study attempts to build systems that manage unstructured data • text, email, HTML pages, etc. • Apply what you have learned to • understand their limitations • propose potentially better solutions • Examples • UIMA, DBlife, iTrack, Business Intelligence systems, etc.

Example: DBlife

Entities/Relationships Raw Data The Core DBlife System • The back-endruns periodically (once a day) • Collect raw data from the web • Infer an ER graph over it • The front-endallows users to access the data and metadata • Search (e.g. via keyword search) • Browse (e.g. via hyperlinks) • Alerts, RSS, etc. (future work)

User Webpages Mentions Metadata for each mention PC-member Citation PC-chair Entities Superhomepages for each entity Metadata for each entity Datastreams XML Indexing Relationships between entities The Current DBlife System Back End Front End Crawler Get Mentions Keyword Search Get Mention Metadata Browse Compile Superhomepages Get Entities Get Entity Metadata Get Relationships

Obtaining Raw Data • Manually identify relevant data sources • Identify website • Mark up easy-to-find meta-data • Define procedure to obtain data • Example: Name = SIGMOD 2005 URL = http://http://cimic.rutgers.edu/~sigmod05/ Type = conference website Prodedure = crawl to depth 3 • Crawler downloads and archives a snapshot of the data once a day

Extracting Mentions • Given a document, find mentions of entities • A mention is a span of text referring to an entity • Many sophisticated techniques developed • But we can exploit domain knowledge to do a better job

Extracting Mentions with Root Names • Many online communities have a relatively complete datasource containing the names of relevant entities • IMDB for movies • DBLP for the DB research community • For DBlife, we exploit DBLP, an online bibliography for the DB community • Extract full names from DBLP e.g. “John Smith” • Generate variations of the full name e.g. “John Smith”, “J. Smith”, “Smith, J.”, “Dr. Smith”, etc. • Search for the variations in the data

Timestamps – When the mention first appeared Explanations – Why the mention is there Inferring Mention Metadata • For each mention, infer metadata about it: • Others?

Inferring Entities • Given a set of mentions, infer the set of real-world entities they refer to • Cluster the mentions so that a cluster corresponds to an entity • Fundamental challenge: determine if two mentions refer to same entity “John Smith” = “J. Smith”? “Dave Jones” = “David Jones”? • Active research topic for many years in the Database, Web, and AI communities • Current solution: group mentions by the root name used to extract them • Better solution: make it a CS 511 project?

Inferring Entity Metadata • We not only want to know what entities exist, but also metadata about them • Researchers: contact information, institution, research interests, year of graduation, etc. • Publications: topic, year, journal/conference, other publications citing it • Conferences: location, date, acceptance rate, number of tracks • Challenge: Lack of structured data • e.g. Is “284-3838” a phone number? A fax number? To whom does it belong to? • Current DBlife system has very limited entity metadata • Possible 511 project?

Publication Conference Researcher Inferring Relationships • We not only want to find what entities exist, but also the relationships between them • Challenge: Lack of structured data • “Co-author” relationship easier because there is some regularity in citations • “Advise” relationship harder because it can be expressed in many ways • Current DBlife has very limited relationship inference • Only one relationship: “related people” based on mention co-occurrence • Possible 511 project? Advise, Coauthor Cite Write Serve Appear

Building DBlife: Design Principles • Modularity: each component should be separated • Easy to maintain and modify • Extensibility: should be easy to add more features without effecting existing ones • E.g. should be easy to incorporate certain class projects • Reliability: should be robust to system crashes and easy recover in the event of failure • Especially critical for DBlife because 1) It processes temporal data 2) It has end-users using the system (e.g. Citeseer) • Automaticity: should require as little manual administration as possible

Optimization • Search • How does system R handle search • Subsequent efforts

Query or update User/ Application Query optimizer Query execution plan Execution engine Record, index requests Index/record mgr. Page commands Buffer manager Read/write pages Storage manager disk

Logical/Physical Query Plans SELECT P.buyer FROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘urbana’ AND Q.phone > ‘5430000’ buyer  City=‘urbana’ phone>’5430000’ • Query Plan: • logical tree • implementation choice at every node • scheduling of operations. (Simple Nested Loops) Buyer=name Person Purchase (Table scan) (Index scan) A logical plan = a set of physical plans

Conceptually What’s Going On? • Can be viewed as a search problem • Given a query Q • This + system capabilities determine a space S of physical query plans p1, p2, ... • Given a cost model M • The notion of optimal plan p* is well defined • Find p* or a close-to-optimal plan

Optimization as Search • Three fundamentals • search space S • cost measure M • exploration strategy E: brute force, hill climbing, dynamic programming, etc. • Other issues to consider • is the cost measure decomposable? • does the search space S cover all possible plans? • how plans or sets of plans are represented and how representations are manipulated

An Example • Start with user query Q • Compile it into a logical query plan g • g  “better” logical query plan g’, using “heuristics” • g’ g’’, by ordering joins using system-R style dynamic programming • g’’  “optimal” physical query plan p • Execute p

Compile from SQL to Logical Plans Select a1, …, an From R1, …, Rk Where C Pa1,…,an(sC(R1 R2 … Rk)) Select a1, …, an From R1, …, Rk Where C Group by b1, …, bl Pa1,…,an(gb1, …, bm, aggs (sC(R1 R2 … Rk)))

Converting Nested Queries Selectdistinct product.name From product Where product.maker in (Select company.name From company where company.city=“Urbana”) Selectdistinct product.name From product, company Where product.maker = company.name AND company.city=“Urbana”

Converting Nested Queries Selectdistinct x.name, x.maker From product x Where x.color= “blue” AND x.price >= ALL (Select y.price From product y Where x.maker = y.maker AND y.color=“blue”) How do we convert this one to logical plan ?

Converting Nested Queries Let’s compute the complement first: Selectdistinct x.name, x.maker From product x Where x.color= “blue” AND x.price < SOME (Select y.price From product y Where x.maker = y.maker AND y.color=“blue”)

Converting Nested Queries Select distinct x.name, x.maker From product x, product y Where x.color= “blue” AND x.maker = y.maker AND y.color=“blue” AND x.price < y.price This one becomes a SFW query: This returns exactly the products we DON’T want, so…

Converting Nested Queries (Select x.name, x.maker From product x Where x.color = “blue”) EXCEPT (Select x.name, x.maker From product x, product y Where x.color= “blue” AND x.maker = y.maker AND y.color=“blue” AND x.price < y.price)

Optimization: the Logical Query Plan • Now we have one logical plan • Can apply laws/rules that • give a semantically equivalent logical plan • but with “seemingly” lower cost • Can also do this in a systematic way (avoiding heuristics)

Algebraic Laws • Commutative and Associative Laws • R  S = S  R, R  (S  T) = (R  S)  T • R ∩ S = S ∩ R, R ∩ (S ∩ T) = (R ∩ S) ∩ T • R ⋈ S = S ⋈ R, R ⋈ (S ⋈ T) = (R ⋈ S) ⋈ T • Distributive Laws • R ⋈ (S  T) = (R ⋈ S)  (R ⋈ T)

Algebraic Laws • Laws involving selection: • sC AND C’(R) = sC(sC’(R)) = sC(R) ∩ sC’(R) • sC OR C’(R) = sC(R) U sC’(R) • sC (R ⋈ S) = sC (R) ⋈ S • When C involves only attributes of R • sC (R – S) = sC (R) – S • sC (R  S) = sC (R) sC (S) • sC (R ∩ S) = sC (R) ∩ S

Algebraic Laws • Example: R(A, B, C, D), S(E, F, G) • sF=3 (R ⋈D=E S) = ? • sA=5 AND G=9 (R ⋈D=E S) = ?

Algebraic Laws • Laws involving projections • PM(R ⋈ S) = PN(PP(R) ⋈PQ(S)) • Where N, P, Q are appropriate subsets of attributes of M • PM(PN(R)) = PM,N(R) • Example R(A,B,C,D), S(E, F, G) • PA,B,G(R ⋈ S) = P ? (P?(R) ⋈P?(S))

Algebraic Laws Laws involving grouping and aggregation: • (A, agg(B)(R)) = A, agg(B)(R) • A, agg(B)((R)) = A, agg(B)(R) if agg is “duplicate insensitive” • Which of the following are “duplicate insensitive” ?sum, count, avg, min, max • A, agg(D)(R(A,B) ⋈B=C S(C,D)) = A, agg(D)(R(A,B) ⋈B=C (B, agg(D)S(C,D))) • Why is this true ? • Why would we do it ?

Heuristic Based Optimizations • Query rewriting based on algebraic laws • Result in better queries most of the time • Heuristics number 1: • Push selections down • Heuristics number 2: • Sometimes push selections up, then down

Predicate Pushdown pname pname sprice>100 AND city=“Urbana” maker=name maker=name city=“Urbana” price>100 Product Company Company Product The earlier we process selections, less tuples we need to manipulate higher up in the tree (but may cause us to loose an important ordering of the tuples, if we use indexes).

Predicate Pushdown Select y.name, Max(x.price) From product x, company y Where x.maker = y.name GroupBy y.name Having Max(x.price) > 100 Select y.name, Max(x.price) From product x, company y Where x.maker=y.name and x.price > 100 GroupBy y.name Having Max(x.price) > 100 • For each company, find the maximal price of its products. • Advantage: the size of the join will be smaller. • Requires transformation rules specific to the grouping/aggregation • operators. • Won’t work if we replace Max by Min.

Pushing predicates up Bargain view V1: categories with some price<20, and the cheapest price Select V2.name, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V2 AS Select y.name, x.category, x.price From product x, company y Where x.maker=y.name

Query Rewrite:Pushing predicates up Bargain view V1: categories with some price<20, and the cheapest price Select V2.name, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price AND V1.p < 20 Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V2 AS Select y.name, x.category, x.price From product x, company y Where x.maker=y.name

Query Rewrite:Pushing predicates up Bargain view V1: categories with some price<20, and the cheapest price Select V2.name, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price AND V1.p < 20 Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V2 AS Select y.name, x.category, x.price From product x, company y Where x.maker=y.name AND V1.p < 20

An Example • Start with user query Q • Compile it into a logical query plan g • g  “better” logical query plan g’, using “heuristics” • g’ g’’, by ordering joins using system-R style dynamic programming • g’’  “optimal” physical query plan p • Execute p

Join Trees • R1 ⋈ R2 ⋈ …. ⋈ Rn • Join tree: • A plan = a join tree • A partial plan = a subtree of a join tree R3 R1 R2 R4

Types of Join Trees • Left deep: R4 R2 R5 R3 R1

Types of Join Trees • Bushy: R3 R2 R4 R5 R1

Types of Join Trees • Right deep: R3 R1 R5 R2 R4

Problem • Given: a query R1 ⋈ R2 ⋈ … ⋈ Rn • Assume we have a function cost() that gives us the cost of every join tree • Find the best join tree for the query

Dynamic Programming • Idea: for each subset of {R1, …, Rn}, compute the best plan for that subset • In increasing order of set cardinality: • Step 1: for {R1}, {R2}, …, {Rn} • Step 2: for {R1,R2}, {R1,R3}, …, {Rn-1, Rn} • … • Step n: for {R1, …, Rn} • It is a bottom-up strategy • A subset of {R1, …, Rn} is also called a subquery

Dynamic Programming • For each subquery Q ⊆ {R1, …, Rn} compute the following: • Size(Q) • A best plan for Q: Plan(Q) • The cost of that plan: Cost(Q)

Dynamic Programming • Step 1: For each {Ri} do: • Size({Ri}) = B(Ri) • Plan({Ri}) = Ri • Cost({Ri}) = (cost of scanning Ri)

Dynamic Programming • Step i: For each Q ⊆ {R1, …, Rn} of cardinality i do: • Compute Size(Q) (later…) • For every pair of subqueries Q’, Q’’ s.t. Q = Q’  Q’’compute cost(Plan(Q’) ⋈ Plan(Q’’)) • Cost(Q) = the smallest such cost • Plan(Q) = the corresponding plan

Dynamic Programming • Return Plan({R1, …, Rn})

Dynamic Programming To illustrate, we will make the following simplifications: • Cost(P1 ⋈ P2) = Cost(P1) + Cost(P2) + size(intermediate result(s)) • Intermediate results: • If P1 = a join, then the size of the intermediate result is size(P1), otherwise the size is 0 • Similarly for P2 • Cost of a scan = 0

Dynamic Programming • Example: • Cost(R5 ⋈ R7) = 0 (no intermediate results) • Cost((R2 ⋈ R1) ⋈ R7) = Cost(R2 ⋈ R1) + Cost(R7) + size(R2 ⋈ R1) = size(R2 ⋈ R1)

Unstructured Data Management Systems in Action - Study Attempts and Solutions

Unstructured Data Management Systems in Action - Study Attempts and Solutions

Presentation Transcript

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization