Join Synopses for Approximate Query Answering

Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy

Purpose • Provide good approximate answers for join queries. • Make use of join synopsis. • Allocate space for join synopses. • Maintain join synopses.

Why????? • Reduce overhead for large DB’s. • Approximate answer is best. • Reduce access to base relation. • Fast response time.

Proposed • Make use of pre-computed samples of DJ’s. (Join Synopses). • For queries with foreign key joins, provide good approximate join aggregates using small number of join synposes. • Allocation of space among the sets. • If workload characteristics is available. • Provide confidence bounds.

Aqua System • Goal- improve response time for queries by reducing access to base relation. • Maintains smaller sized statistical summaries – “SYNOPSES” • Provides confidence bounds. • It has 3 components 1:Statistics Collection 2:Query Rewriting 3:Maintenance. • It sits on top of a DBMS.

Problem with Joins • Natural set of synopses would be random samples from each of the base relation. • Non-Uniform Result Sample.- for the join of base samples to be a uniform random sample of original relation the probability must equal. • Small join result sizes. • Using samples on base relations is not feasible.

Join Synopses • Pre-compute samples of join results. • Compute samples of the results of a small set of distinguishing joins. • Can obtain random samples for all possible joins in the schema. • This scheme is for foreign key joins. • Model the DB as a graph.

Join Synopses • There is a 1-1 correspondence between a tuple in a relation ‘r’ & a tuple in the output of any foreign key join involving ‘r’ & any of its descendants in the graph. • A sample Sr of a relation ‘r’ can be used to produce another relation J(Sr) called a join synopsis of ‘r’. ( provides random samples). • The subgraph of G on the ‘k’ nodes in any k-way foreign key join must be a connected subgraph with a single root node.

Join Synopses • For each node u in G, corresponding to a relation r1, define J(u) to be the output of the maximum foreign key join r1xr2x..xrk with source r1. • Let Su be a uniform random sample of r1. • The join synopsis J(Su) is the output of Suxr2xr3…..xrk. • J(Su) is a uniform random sample of J(u) with |Su| tuples. • Thus we can extract from our synopsis a uniform random sample of the output of any k-way foreign key join. • From 1 join synopsis for a node whose foreign key join has k relations, we can extract a URS of the output of between k-1 & pow(2,k-1)-1 distinct foreign ey joins.

Allocation • Allocate space among various join synopses when certain properties of query workload are known. • Identify heuristics for the common case when such properties are not known. • Let ‘S’ be a set of queries with selects, aggregates, group by’s & foreign key joins. • For each relation Ri, find fraction Fi of queries in S for which Ri is the source relation in a foreign key join. • Select join synopsis sizes so as to minimize the average relative error. • It is known that the error bounds are inversely proportional to sqrt(n).(n- number of tuples in join sample).

Allocation • The average relative error bound over the queries is proportional to sum(Fi/sqrt(ni)). • In the absence of query work load information heuristic strategies can be used. • There are 3- • EqJoin • CubeJoin • PropJoin

Maintenance of Join Synopses • Need to maintain synchronization when base relation is updated. • If a tuple is added do this. • Let Pu be current probability for including a newly arrived tuple for relation u in the random sample Su. • Let uxr2xr3x….xrk be the max foreign key join with source u. • We add Tp (new tuple) to Su with probability Pu. • If Tp is added to Su, we add to J(Su) the tuple Tpxr2xr3x….rk.

Maintenance of Join Synopses • On delete of a tuple Tp from u, do this • Find if Tp is in Su. If there, then delete it from Su & remove the tuple in J(Su) corresponding to Tp. • If sample becomes too small , repopulate the sample by rescanning relation u. • This algorithm performs look ups to base relation with probability Pu. • We never update join synopses for any ancestors of u.

Experimental Evaluation • Test bed – TPC-D decision support benchmark. DB of around 300 MB. • Machine – 296MHz UltraSPARC-II, 256 MB of memory, Solaris 5.6. • Query used is based on Q5 & an aggregate computed on join of Lineitem, Customer, Order, Supplier, Nation, Region. • The query used is

Join Synopsis Accuracy

Join Synopsis Maintenance

Conclusion • Focus on computing approximate answers to aggregates computed on multi-way joins. • For DB’s with schema’s that involve only foreign key joins, join synopses has been proposed as s solution. • Join synopses can be maintained effectively during updates.

Join Synopses for Approximate Query Answering