Query Reformulation & Processing in Data Integration

Query Reformulation & Processing in Data Integration

Recall: Data Integration Challenge Find Olympus cameras on sale and their reviews (Brand, Cameras) (Olympus, C-3000) (Cameras, Reviews) (C-3000, review-article-1) TARGET.COM EPINIONS.COM WAL-MART.COM DPREVIEW.COM CONSUMER REPORTS.ORG BESTBUY.COM

Architecture of a Data Integration System Find Olympus cameras on sale and their reviews TARGET.COM EPINIONS.COM TARGET.COM DPREVIEW.COM TARGET.COM CONSUMERREPORTS.COM WAL-MART.COM EPINIONS.COM BESTBUY.COM CONSUMERREPORTS.COM Query Reformulator logical query plans Query Optimizer physical query execution plans Answers = UNION of outputs of all logical query plans Execution Engine Must execute multiple plans!

Query Reformulation • Generate good logical query plans quickly • Recent works • bucket algorithm [Levy et. al.] • inverse rule algorithm [Duschka et. al.] • Minicon algorithm [Pottinger et. al.] • iDRIPS/Streamer [Doan et. al.] • Will illustrate with the bucket alg. and Streamer • Strongly encouraged to read others

The Bucket Algorithm [Levy et al., VLDB-96] • Collect sources into buckets • sources in a bucket can return answer to a certain part of query • Take cross product of buckets • to form logical query plans Find Olympus cameras on sale and their reviews Bucket B1 Bucket B2 V1: TARGET V4: EPINIONS V2: WAL-MART V5: DPREVIEW V3: BESTBUY V6: CONSUMERREPORT V1V4 V1V5 ... V3V5 V3V6

Streamer Reformulation Algorithm • Goal: order logical query plans • Time to & quality of first answers is important! • executing all plans is expensive or infeasible • plans tend to vary significantly in their utility • coverage, execution time, monetary cost, ... • Solution • find query plans in decreasing order of utility • execute best plans first • abort query execution as soon as • satisfactory answer is found, or • resource limits have been reached

Contributions of Streamer Work • Formally defined plan-ordering problem • does not assume any specific utility measure • models dependencies among plans • Developed three efficient solutions • GREEDY: exploits utility monotonicity • iDRIPS: exploits source similarity • STREAMER: exploits source similarity, plan independenceutility-diminishing returns • work with a broad range of utility measures • find the best plans very fast

Problem Definition • Utility measure • plan coverage: number of new answers returned by a plan • execution time, monetary fee • plan utility depends on plans previously executed! • Plan-ordering problem • modify query reformulator so thatgiven user query and utility measure, it outputs • best plan p1 • next best plan p2, assuming p1 has been executed • next best plan p3, assuming p1 & p2 have been executed, ... • focus on finding first few best plans • Similar to top-k, but with correlation between plans

Recall: Bucket Algorithm [Levy et al., VLDB-96] • Collect sources into buckets • sources in a bucket can return answer to a certain part of query • Take cross product of buckets • to form logical query plans Find Olympus cameras on sale and their reviews Bucket B1 Bucket B2 V1: TARGET V4: EPINIONS V2: WAL-MART V5: DPREVIEW V3: BESTBUY V6: CONSUMERREPORT V1V4 V1V5 ... V3V5 V3V6

GREEDY Algorithm • Properties • linear run time • broadly applicable • many practical utility measures are monotonic [Yerneni et al., EDBT-98] • Utility monotonicity • if replacing a source by a “better” source yields a better plan • e.g., cost(ViVj) = cost(Vi) + cost(Vj) • Finds best plan • by local comparison of sources B1 B2 V1 V4 V2 V5 V1 V5 V3 V6 • Removes best plan & finds next best plan, ...

Source Similarity • Two sources are similar • if replacing one by the other changes plan utility very little • Large domains often have many similar sources • similar in monetary fee, access time, coverage, etc • Key idea • similar sources can be grouped and treated as a single source V1 V4: time = 3, fee = 5 V2 V4: time = 4, fee = 6 V1: time = 2, fee = 3 V2: time = 3, fee = 4 V4: time = 1, fee = 2 utility(V1V4) = 0.5 utility(V2V4) = 0.7 Abstract source Abstract plan V12: time = [2,3], fee = [3,4] V12 V4: time = [3,4], fee = [5,6] utility(V12 V4) = [0.4,0.7]

Grouping Sources to Find Best Plan:DRIPS Algorithm [Haddawy et al., UAI-95] V123 V456 B1 B2 V1 V4 V2 V5 V3 V4 V12 V56 V3 V6 V1 V2 V5 V6 Source Grouping Branch & Bound Search Dominance graph V123 V456 V3 V4 [0.5, 0.8] [0.1, 0.7] V12 V456 V3 V456 V1 V456 V3 V56 V1 V456 V2 V456 V3 V4 V3 V56 V2 V456 0.8 [0.6, 0.7] [0.4, 0.6] [0.1, 0.3]

Extending DRIPS: iDRIPS & STREAMER • iDRIPS (iterative DRIPS) • applies DRIPS to find best plan • removes best plan, re-groups sources • applies DRIPS to find second best plan, ... • Observation • iDRIPS may re-establish dominance relations many times • Challenge: recycle dominance relations • Solution: STREAMER • applicable when utility-diminishing returns holds • exploits plan independence V3 V4 V1 V456 V3 V56 V2 V456

The STREAMER Algorithm First Iteration Second Iteration V1 V4 V1 V5 V1 V6 V3 V4 V1 V456 V3 V56 V1 V456 V3 V56 V2 V456 V2 V456 V2 V4 V2 V5 V2 V6 still true if utility-diminishing returns holds + V3V4 is independent of V1 V456

Adaptive Query Processing

Motivations for Adaptive Query Processing Many domains where cost-based query optimization fails: Complex queries in traditional databases: estimation error grows exponentially with # joins [IC91] Querying over the Internet: unpredictable access rates, delays Querying external data sources: limited information available about properties of this source Monitor real-world conditions, adapt processing strategy in response

Can We Get RDBMS-Level Optimizationfor Data Integration, without Statistics? Multiple remote sources • Described and mapped “loosely” • Data changes frequently Generally, would like to support same kinds of queries as in a local setting Results Query Data Integration System Mediated Schema Source Catalog Schema Mappings Remote, Autonomous Data Sources

What Are the Sources of Inefficiency? • Delays – we “stall” in waiting for I/O • We’ll talk about this on Monday • Bad estimation of intermediate result sizes • The focus of the Kabra and DeWitt paper • No info about source cardinalities • The focus of the eddies paper – and Monday’s paper • The latter two are closely related • Major challenges: • Trading off information acquisition (exploration) vs. use (exploitation) • Extrapolating performance based on what you’ve seen so far

Kabra and DeWitt • Goal: “minimal update” to a traditional optimizer in order to compensate for bad decisions • General approach: • Break the query plan into stages • Instrument it • Allow for re-invocation of optimizer if it’s going awry

Elements of Mid-Query Re-Optimization • Annotated Query Execution Plans • Annotate plan with estimates of size • Runtime Collection of Statistics • Statistics collectors embedded in execution tree • Keep overhead down • Dynamic Resource Re-allocation • Reallocate memory to individual operations • Query Plan Modification • May wish to re-optimize the remainder of query

Annotated Query Plans • We save at each point in the tree the expected: • Sizes and cardinalities • Selectivities of predicates • Estimates of number of groups to be aggregated

Statistics Collectors • Add into tree • Must be collectable in a single pass • Will only help with portions of query “beyond” the current pipeline

Resource Re-Allocation • Based on improved estimates, we can modify the memory allocated to each operation • Results: less I/O, better performance • Only for operations that have not yet begun executing, i.e., not in the pipeline

Plan Modification • Only re-optimize part not begun • Suspend query, save intermediate in temp file • Create new plan for remainder, treating temp as an input

Re-Optimization When to re-optimize: • Calculate time current should take (using gathered stats) • Only consider re-optimization if: • Our original estimate was off by at least some factor 2 and if • Topt, estimated < 1Tcur-plan,improved where 1  5% and cost of optimization depends on number of operators, esp. joins • Only modify the plan if the new estimate, including the cost of writing the temp file, is better

Low-Overhead Statistics Want to find “most effective” statistics • Don’t want to gather statistics for “simple” queries • Want to limit effect of algorithm to maximum overhead ratio,  • Factors: • Probability of inaccuracy • Fraction of query affected How do we know this without having stats?

Inaccuracy Potentials The following heuristics are used: • Inaccuracy potential = low, medium, high • Lower if we have more information on table value distribution • 1+max of inputs for multiple-input selection • Always high for user-defined methods • Always high for non-equijoins • For most other operators, same as worst of inputs

More Heuristics • Check fraction of query affected • Check how many other operators use the same statistic • The winner: • Higher inaccuracy potentials first • Then, if a tie, the one affecting the larger portion of the plan

Implementation • On top of Paradise (parallel database that supports ADTs, built on OO framework) • Using System-R optimizer • New SCIA (Stat Collector Insertion Algorithm) and Dynamic Re-Optimization modules

It Works! • Results are 5% worse for simple queries, much better for complex queries • Of course, we would not really collect statistics on simple queries • Data skew made a slight difference - both normal and re-optimized queries performed slightly better

Pros and Cons • Provides significant potential for improvement without adding much overhead • Biased towards exploitation, with very limited information-gathering • A great way to retrofit an existing system • In SIGMOD04, IBM had a paper that did this in DB2 • But fairly limited to traditional DB context • Relies on us knowing the (rough) cardinalities of the sources • Query plans aren’t pipelined, meaning: • If the pipeline is broken too infrequently, MQRO may not help • If the pipeline is broken too frequently, time-to-first-answer is slow

The Opposite Extreme: Eddies • The basic idea: • Query processing consists of sending tuples through a series of operators • Why not treat it like a routing problem? • Rely on “back-pressure” (i.e., queue overflow) to tell us where to send tuples • Part of the ongoing Telegraph project at Berkeley • Large-scale federated, shared-nothing data stream engine • Variations in data transfer rates • Little knowledge of data sources

Telegraph Architecture • Simple “pre-optimizer” to generate initial plan • Creates operators, e.g.: • Select: predicate(sourceA) • Join: predicate(sourceA, sourceB) • (No support for aggregation, union, etc.) • Chooses implementations Select using index, join using hash join • Goal: dataflow-driven scheduling of operations • Tuple comes into system • Adaptively routed through operators in “eddies” May be combined, discarded, etc.

Can’t Always Re-order Arbitrarily • Need “moment of symmetry” • Some operators have scheduling dependency • e.g. nested loops join: for each tuple in left table for each tuple in right table If tuples meet predicate, output result • Index joins, pipelined hash joins always symmetric • Sometimes have order restrictions e.g. request tuple from one source, ship to another

The Eddy • Represents set of possible orderings of a subplan • Each tuple may “flow” through a different ordering (which may be constrained) • N-ary module consisting of query operators • Basic unit of adaptivity • Subplan with select, project, join operators U

Eddy Pros and Cons • Mechanism for adaptively re-routing queries • Makes optimizer’s task simpler • Can do nearly as well as well-optimized plan in some cases • Handles variable costs, variable selectivities • But doesn’t really handle joins very well – attempts to address in follow-up work: • STeMs – break a join into separate data structures; requires re-computation at each step • STAIRs – create intermediate state and shuffle it back and forth

Generalizing Adaptive Query Processing • We’ve seen a range of different adaptive techniques • How do they fit together? • Can we choose points between eddies and mid-query re-optimization?

Types of Adaptive Query Processing Adaptive scheduling (q. scrambling [UF98], dyn. pipeline sched. [UF01], XJoin [UF00], PH Join [RS86][I+99], ripple join [HH99]) Changes CPU scheduling to improve feedback or reduce delays Redundant computation (competitive exec. [AZ96]) Compare two+ ways of executing the query Plan partitioning ([S+76][KD98][I+99][M+04]) Break the plan into stages; re-optimize future stages as necessary Adaptive info passing ([IT05 sub.]) Pass data between parts of an executing plan Adaptive data partitioning • Break the data into subsets; use a different plan for each subset • The only way to reduce overall computation with fine granularity • First (only) implementation has been eddies[AH00][R+03][DH04]

Eddies Combine Adaptive Scheduling and Data Partitioning Decisions Intuitively, each tuple gets its own query plan • Route to next operator based on speed and selectivity of each operator • Elegant and simple to implement But performing a join creates subresults at the next level! Local & greedy choices may result in state that needs to join with all future data! Consider long-term effects of decisions before making them – separate CPU scheduling from plan selection

Focusing Purely on Adaptive Data Partitioning Use adaptively scheduled operators to “fill CPU cycles” Now a query optimizer problem: Choose a plan that minimizes long- term cost (in CPU cycles) To allow multiple plans, distribute union through join (and select, project, etc.): If R1 = R11[ R12, R2 = R21[ R22 then: R1⋈ R2 = (R11[ R12) ⋈ (R21[ R22) = (R11⋈ R21) [ (R12⋈ R22) [ (R11⋈ R22) [ (R12⋈ R21) R11 R21 R22 R12 R2 R1 This generalizes to njoins, other SPJ + GUoperators…

Exclude R0S0T0,R1S1T1 R1 S1T1 S1T1 È T 0 T 0 1 1 T R R S 0 0 Exclude R0S0 0 0 1 1 R S S T R 0 S 0 R S 1 1 Adaptive Data Partitioning:Routing Data across Different Plans R ⋈S ⋈ T  R0 S0T0 Options for combining across phases: • New results always injected into old plan • Old results into new plan • Wait until the end – “stitch-up” plan based on best stats … R0 S0 R S T

Special Architectural Features for ADP Monitoring and re-optimization thread runs alongside execution: • System-R-like optimizer with aggregation support;uses most current selectivity estimates • Periodic monitoring and re-optimization revises selectivity estimates, recomputes expected costs Query execution with “smart router” operators Special support for efficient stitch-up plans: • Uses intermediate results from previous plans (specialized-case of answering queries using views [H01]) • Join-over-union (“stitch-up-join”) operator that excludes certain results

ADP Application 1:Correcting Cost Mis-estimates Goal: react to plans that are obviously bad • Don’t spend cycles searching for a slightly better plan • Try to avoid paths that are likely to not be promising Monitor/reoptimizer thread watches cardinalities of subresults • Re-estimate plan cost, compare to projected costs of alternatives, using several techniques & heuristics (see paper) • Our experiments: re-estimate every 1 sec. “Smart router” operator does the following: • Waits for monitor/reoptimizer to suggest replacement plan • Re-routes source data into the new plan • New plan’s output is unioned with output of previous plan; this is fed into any final aggregation operations

Correcting for Unexpected Selectivities Pentium IV 3.06 GHzWindows XP

ADP Application 2:Optimizing for Order Most general ADP approach: • Pre-generate plans for general case and each “interesting order” • “Smart router” sends tuple to the plan whose ordering constraint is followed by this tuple • But with multiple joins, MANY plans Instead: do ADP at the operator level • “Complementary join pair” • Does its own stitch-up internally • Easier to optimize for! Can also do “partial sorting” at the router (priority queue) Q ... Merge Hash h(R) h(S) h(R) h(S) Q Q Q Q Routers R S

Exploiting Partial Order in the Data Pentium IV 3.06 GHzWindows XP (1024 tuple)

SUM(T.y) GROUP BY T.x T R SUM(T.y sums) GROUP BY T.x R SUM(T.y) GROUP BY T.x, T.joinAttrib T ADP Over “Windows”:Optimizing for Aggregation • Group-by optimization [CS94]: • May be able to “pre-aggregate” some tuples before joining • Why: aggregates can be applied over union • But once we insert pre-aggregation, we’re stuck (and it’s not pipelined) • Our solution: • “Adjustable window pre-aggregation” • Change window size depending on how effectively we can aggregate • Also allows data to propagate through the plan – better info for adaptivity, early answers vs.

Pre-Aggregation Comparison

Adaptive QP in Summary A variety of different techniques, focusing on: • Scheduling • Comparison & competition • Information passing • Data + plan partitioning A field that is still fairly open – missing: • Effective exploration methods • A true theory! • What’s possible? What kinds of queries make sense to adapt? • Guarantees of optimality and convergence (perhaps under certain assumptions)

Query Reformulation & Processing in Data Integration