360 likes | 491 Vues
Faculty of Computer Science, Institute of System Architecture, Database Technology Group. Sampling Algorithms for Evolving Datasets Rainer Gemulla Defense of Ph.D. Thesis 20.10.2008. Application Level ( external ). Clustering Find similar groups Ofter superlinear in input size
E N D
Faculty of Computer Science, Institute of System Architecture, Database Technology Group Sampling Algorithmsfor Evolving DatasetsRainer GemullaDefense of Ph.D. Thesis20.10.2008
Application Level (external) • Clustering • Find similargroups • Ofter superlinear in inputsize • Procedure • Run k-means • Estimatemeanandvariance • 99% confidenceintervalundernormal distribution • Run on sample • 5%
System Level (internal) • SelectivityEstimation • Determinepercent-ageoftuplesthatsatisfy a query • Key toeffectivequeryoptimization • Procedure • Exactcomputation • 5% Sample • Howgoodisthis? • Arbitrarydataset • 1% absolute error,95% confidence • ≈20k items • Exact: • 1.1% • Sample: • ≈1.2% • Sample: • ≈83,6% • Exact: • 83,8%
Applications • Sample Computation • Sample Maintenance • The Whole Picture • Conclusion
Option 1: Query Sampling • Advantages • Noimpact on traditional queryprocessing • Nostoragerequirements • Disadvantages • Sampling stepis expensive • Supports only simple queries • Cannot handle dataskew Approximate queries Approximate results Base data Sampling step Queries Updates Estimation step
Option 2: Materialized Sampling • Advantages • Quick accesstothe sample • Sophisticatedpreprocessingfeasible • Disadvantages • Storage space • Impact on updates Base data Sample data Approximate queries Queries Estimation step Approximate results Updates Sampling step Mythesis
Applications • Sample Computation • Sample Maintenance • The Whole Picture • Conclusion
Sample Maintenance • Maintenance Problem forEvolving Datasets • Given: a dataset, a sample, a streamofoperations • Insert: Add an item tothedataset • Update: Change thevalueof an item in thedataset • Delete: Remove an item fromthedataset • Goal: maintainthestatisticalvalidityofthe sample • Uniform Sampling • Eachtwosamplesofthe same sizeareequallylikely • Exampledataset: {A, B, C}
The Classic Schemes • Reservoir sampling • Computes a random sample of size M • Fixed space consumption & response time • Might produce undersized samples • Bernoulli sampling • Computes a random sample of fraction ≈q • Varying space consumption & response time • Might produce oversized samples • Problems • Support for updates & deletions • Support for multisets & projections of multisets • Support for resizing & combination • Schemes cannot be used directly! M=800k q=10%
Reservoir Sampling & Deletions • Key problem • Deletionsdecreasethe sample size • Proposedsolutions • CAR samples, backingsamples, taggedsamples, passive samples, purgedbernoullisamples, … • Key ideas • Refill: gotothebasedataandgetreplacement • Recompute: letthe sample shrink, but recomputeoccasionally {A, B, C} -C A A A B A B C C B B 33% 33% 33%
Sample Size & Cost =2% ofthedata Almostconstant sample size Zero basedataaccesses
Random Pairing • Howdoesitwork? • Compensatesdeletionswith subsequent insertions • Details • Pair eachinsertionwith a deleted „partner“ • Undothedeletionofthepartner {A, B, C} +D A A B B A A A B A A B A B D C C B B C C D B 1 1 1 33% 33% 33% 33% 33% 33% 33% 33% 33% Directpairingwouldrequireentiredeletionhistory Use a randomizedpairing -C 1 1 • Pair! • Pair! 1 +C
Bernoulli Sampling & Multisets • Whymultisets? • Onlycolumns relevant foranalysisarestored in the sample • May not includetheprimarykey • Bernoulli sampling on multisets • Insertions • Acceptwithprobabilityq, rejectotherwise • Deletions • Pick a randomcopyandundoitsinsertion • Sample sizeisreducedwhenpickedcopy was sampled • Occurswithprobability #sample/#base • Weknow #sample but not #base • A • A • A • A • A • A • A • A • A • A S={(A,4)} S= S={(A,1)} S={(A,2)} S={(A,3)}
Augmented Bernoulli Sampling • Augmentingthe sample • Count thenumberofinsertionssincefirstacceptance • Howdoesthishelptoprocessdeletions? • Delete right-sideitemsfirst • Weknowthe total numberofA‘s • Naive schemewithprobability (#sample-1)/(#inserts-1) • Whenempty, deleteleft-side item • A • A • A • A • A • A • A • A • A • A • A S= S={(A,1,1)} S={(A,2,2)} S={(A,2,3)} S={(A,4,6)} S={(A,3,5)} S={(A,3,4)} #sample #inserts =#right+1 Right Fullknowledge Left Just one sample
Applications • Sample Computation • Sample Maintenance • The Whole Picture • Conclusion
Incremental Sample Maintenance Different scenariosrequire different samplingschemes Base data Set Multiset Projection (distinctitems) Data streamwindow • Fixed • Fraction • Size • Fraction • Size • Fraction • Size • Fraction • Size • Insert • Update • ? • n/a • n/a • Delete • ? • n/a • n/a • Survey sampling • Previouswork • Novelschemes
Applications • Sample Computation • Sample Maintenance • The Whole Picture • Conclusion
Conclusion • Database sampling • Has a lotofapplications … • … andprovidesuswith a lotofinterestingproblems • Materializedsampling • Avoidsperformanceproblemsofquerysampling • Requiresmaintenanceasdataevolves • Efficient, incrementalmaintenancealgorithmsexist • In thethesis • Novelsamplingalgorithms • Improvedestimators • Algorithmsforresizingsamples • Algorithmsforcombiningsamples
Thank you! Questions?
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Reservoir sampling computes a uniform sample of M elements building block for many sophisticated sampling schemes single-scan algorithm add the first M elements afterwards, flip a coin ignore the element (reject) replace a random element in the sample (accept) accept probability of the ith element Reservoir Sampling
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Reservoir Sampling (Example) • Example • sample size M = 2
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Idea use arriving insertions to refill the sample Backup: An Incorrect Approach Not uniform!
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Example Random Pairing
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Total cost stabledataset, 10M operations sample size 100k, dataaccess 10 timesmore expensive than sample access Total Cost Base data access No base data access
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Data sets variation of data set size influence on sampling Types of Data Sets Stable Growing Shrinking Goal: stable sample Goal: controlled growing sample uninteresting
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Example resize by 30% if sampling fraction drops below 9% dependent on costs of accessing base data Resizing Low costs Moderate costs High costs immediate resizing combined solution Random pairingresizing
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Why sampling? performance, performance, performance How much to sample? influencing factors storage consumption response time accuracy choosing the sample size / sampling fraction largest sample that meets storage requirements largest sample that meets response time requirements smallest sample that meets accuracy requirements Backup: Bounded-Size Sampling
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Example random pairing vs. bernoulli sampling average estimation Backup: Bounded-Size Sampling Data set Sample size Standard error BS violates 1, 2 BS violates 3
Example: Bernoulli sampling • Bernoulli sampling(coin-flip sample) • each item isincludedwithprobabilityq (=sampling rate) • sample sizeisqN in expectation, whereNiswindowsize • not a bounded-spacescheme • Example: 40byte items, 32kbyte space max 819 items q = 0.0276
Example: Priority Sampling Sample size Sample space k = 113 items
Example: BoundedPriority Sampling Sample size Sample space k = 585 items
Full-Scale Warehouse Of Data Partitions Sample Sample Sample Warehouse of Samples S1,1 S1,2 Sn,m merge S*,* S1-2,3-7 etc More Motivation:A Sample Warehouse