160 likes | 273 Vues
This document discusses the challenges of join sampling in database systems and presents innovative strategies to improve efficiency. It begins with an overview of traditional techniques such as Naive-Sample and Olken-Sample and then introduces novel methods like Stream Sample, Group Sample, and Frequency-Partition Sample. Experimental results indicate that these new strategies outperform previous methods in terms of effectiveness and resource management. The classification of join sampling problems is explored, highlighting the varying levels of available information.
E N D
On Random Sampling over Joins Surajit Chaudhuri Rajeeve Motwani Vivek Narasayya Microsoft ResearchStanford UniversityMicrosoft Research
Subtitles: • The difficulty of join sampling - Example. • Semantic and algorithms of sample • Two previous sampling strategies • New strategies for join sampling • Experiment’s results
The Difficulty of Join Sampling -Example: • Suppose that we have the relations
Black-Box U2: Given relation R with n tuples, generate an unweighted WR sample of size r. • 1. • 2. Initialize reservoir array A[1..r] with r dummy values. • 3. While tuples are streaming by do begin (a) get next tuple t; (b) (c) for j=1 to r set A[j] to t with probability 1/N end
Black-Box WR2: Given relation R with n tuples, generate a weighted WR sample of size r. • 1. • 2. Initialize reservoir array A[1…r] with r dummy values. • 3. While tuples are streaming by do begin (a) get next tuple t with weight w(t); (b) (c) for j=1 to r do set A[j] to t with prob. w(t)/W end.
The Classification of the Problem: • Case A :No information is available for either or . • Case B : No information is available for but indexes and /or statistics are available for . • Case C : Indexes/statistics are available for and .
Previous Sampling Strategies Strategy Naive-Sample: 1. Compute the join . 2. As the tuples of J stream by, use Black-Box U1 or U2 to produce .
Previous Sampling Strategies Strategy Olken-Sample: 1. Let M be an upper bound on for all . 2.repeat (a) Sample a tuple uniformly at random. (b) Sample a random tuple from among all tuples that have . (c) Output with probability , and with remaining probability reject the sample. Until r tuples have been produced.
New Strategies for Join Sampling • Strategy Stream Sample is more efficiency then Olken : 1. No information is required for - case B. 2. No tuple is rejected after computing the join . 3. Only one iteration is needed for each output tuple.
New Strategies for Join Sampling Strategy Stream Sample: 1. Use Black-Box WR1 or WR2 to produce a WR sample of size r, where the weight for a tuple is set to 2. While tuples of are streaming by do begin (a) get next tuple and let ; (b) sample a random tuple from among all tuples that have ; (c) output . end.
New Strategies for Join Sampling Strategy Group Sample 1. Use Black-Box WR1 or WR2 to produce a WR sample of size r, where the weight for a tuple is set to . 2. Let consist of the tuples . Produce whose tuples are grouped by ‘s tuples that generated them. 3. Use r invocations of Black-Box U1 or U2 to sample r sample, one of each group.
New Strategy for Join Sampling • Strategy Frequency-Partition-Sample
Summery • The difficulty of join sampling- example. • The classification of the problem - 3 cases. • Naive-sample Olken-sample previous strategies • Stream-sample Group-sample new strategies Frequency-partition-sample • Conclusion : The new strategies are better then the earlier techniques.