Sampling Algorithms for Evolving Datasets Rainer Gemulla Defense of Ph.D. Thesis 20.10.2008

Faculty of Computer Science, Institute of System Architecture, Database Technology Group Sampling Algorithmsfor Evolving DatasetsRainer GemullaDefense of Ph.D. Thesis20.10.2008

Application Level (external) • Clustering • Find similargroups • Ofter superlinear in inputsize • Procedure • Run k-means • Estimatemeanandvariance • 99% confidenceintervalundernormal distribution • Run on sample • 5%

System Level (internal) • SelectivityEstimation • Determinepercent-ageoftuplesthatsatisfy a query • Key toeffectivequeryoptimization • Procedure • Exactcomputation • 5% Sample • Howgoodisthis? • Arbitrarydataset • 1% absolute error,95% confidence • ≈20k items • Exact: • 1.1% • Sample: • ≈1.2% • Sample: • ≈83,6% • Exact: • 83,8%

Applications • Sample Computation • Sample Maintenance • The Whole Picture • Conclusion

Option 1: Query Sampling • Advantages • Noimpact on traditional queryprocessing • Nostoragerequirements • Disadvantages • Sampling stepis expensive • Supports only simple queries • Cannot handle dataskew Approximate queries Approximate results Base data Sampling step Queries Updates Estimation step

Option 2: Materialized Sampling • Advantages • Quick accesstothe sample • Sophisticatedpreprocessingfeasible • Disadvantages • Storage space • Impact on updates Base data Sample data Approximate queries Queries Estimation step Approximate results Updates Sampling step Mythesis

Sample Maintenance • Maintenance Problem forEvolving Datasets • Given: a dataset, a sample, a streamofoperations • Insert: Add an item tothedataset • Update: Change thevalueof an item in thedataset • Delete: Remove an item fromthedataset • Goal: maintainthestatisticalvalidityofthe sample • Uniform Sampling • Eachtwosamplesofthe same sizeareequallylikely • Exampledataset: {A, B, C}

The Classic Schemes • Reservoir sampling • Computes a random sample of size M • Fixed space consumption & response time • Might produce undersized samples • Bernoulli sampling • Computes a random sample of fraction ≈q • Varying space consumption & response time • Might produce oversized samples • Problems • Support for updates & deletions • Support for multisets & projections of multisets • Support for resizing & combination • Schemes cannot be used directly! M=800k q=10%

Reservoir Sampling & Deletions • Key problem • Deletionsdecreasethe sample size • Proposedsolutions • CAR samples, backingsamples, taggedsamples, passive samples, purgedbernoullisamples, … • Key ideas • Refill: gotothebasedataandgetreplacement • Recompute: letthe sample shrink, but recomputeoccasionally {A, B, C} -C A A A B A B C C B B 33% 33% 33%

Sample Size & Cost =2% ofthedata Almostconstant sample size Zero basedataaccesses

Random Pairing • Howdoesitwork? • Compensatesdeletionswith subsequent insertions • Details • Pair eachinsertionwith a deleted „partner“ • Undothedeletionofthepartner {A, B, C} +D A A B B A A A B A A B A B D C C B B C C D B 1 1 1 33% 33% 33% 33% 33% 33% 33% 33% 33% Directpairingwouldrequireentiredeletionhistory  Use a randomizedpairing -C 1 1 • Pair! • Pair! 1 +C

Bernoulli Sampling & Multisets • Whymultisets? • Onlycolumns relevant foranalysisarestored in the sample • May not includetheprimarykey • Bernoulli sampling on multisets • Insertions • Acceptwithprobabilityq, rejectotherwise • Deletions • Pick a randomcopyandundoitsinsertion • Sample sizeisreducedwhenpickedcopy was sampled • Occurswithprobability #sample/#base • Weknow #sample but not #base • A • A • A • A • A • A • A • A • A • A S={(A,4)} S= S={(A,1)} S={(A,2)} S={(A,3)}

Augmented Bernoulli Sampling • Augmentingthe sample • Count thenumberofinsertionssincefirstacceptance • Howdoesthishelptoprocessdeletions? • Delete right-sideitemsfirst • Weknowthe total numberofA‘s • Naive schemewithprobability (#sample-1)/(#inserts-1) • Whenempty, deleteleft-side item • A • A • A • A • A • A • A • A • A • A • A S= S={(A,1,1)} S={(A,2,2)} S={(A,2,3)} S={(A,4,6)} S={(A,3,5)} S={(A,3,4)} #sample #inserts =#right+1 Right Fullknowledge Left Just one sample

Incremental Sample Maintenance Different scenariosrequire different samplingschemes Base data Set Multiset Projection (distinctitems) Data streamwindow • Fixed • Fraction • Size • Fraction • Size • Fraction • Size • Fraction • Size • Insert • Update • ? • n/a • n/a • Delete • ? • n/a • n/a • Survey sampling • Previouswork • Novelschemes

Conclusion • Database sampling • Has a lotofapplications … • … andprovidesuswith a lotofinterestingproblems • Materializedsampling • Avoidsperformanceproblemsofquerysampling • Requiresmaintenanceasdataevolves • Efficient, incrementalmaintenancealgorithmsexist • In thethesis • Novelsamplingalgorithms • Improvedestimators • Algorithmsforresizingsamples • Algorithmsforcombiningsamples

Thank you! Questions?

Survey Sampling

Permuted-Data Sampling

RoughComparison

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Reservoir sampling computes a uniform sample of M elements building block for many sophisticated sampling schemes single-scan algorithm add the first M elements afterwards, flip a coin ignore the element (reject) replace a random element in the sample (accept) accept probability of the ith element Reservoir Sampling

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Reservoir Sampling (Example) • Example • sample size M = 2

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Idea use arriving insertions to refill the sample Backup: An Incorrect Approach Not uniform!

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Example Random Pairing

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Total cost stabledataset, 10M operations sample size 100k, dataaccess 10 timesmore expensive than sample access Total Cost Base data access No base data access

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Data sets variation of data set size influence on sampling Types of Data Sets Stable Growing Shrinking Goal: stable sample Goal: controlled growing sample uninteresting

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Example resize by 30% if sampling fraction drops below 9% dependent on costs of accessing base data Resizing Low costs Moderate costs High costs immediate resizing combined solution Random pairingresizing

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Why sampling? performance, performance, performance How much to sample? influencing factors storage consumption response time accuracy choosing the sample size / sampling fraction largest sample that meets storage requirements largest sample that meets response time requirements smallest sample that meets accuracy requirements Backup: Bounded-Size Sampling

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Example random pairing vs. bernoulli sampling average estimation Backup: Bounded-Size Sampling Data set Sample size Standard error BS violates 1, 2 BS violates 3

Example: Bernoulli sampling • Bernoulli sampling(coin-flip sample) • each item isincludedwithprobabilityq (=sampling rate) • sample sizeisqN in expectation, whereNiswindowsize • not a bounded-spacescheme • Example: 40byte items, 32kbyte space max 819 items q = 0.0276

Example: Priority Sampling Sample size Sample space k = 113 items

Example: BoundedPriority Sampling Sample size Sample space k = 585 items

Full-Scale Warehouse Of Data Partitions Sample Sample Sample Warehouse of Samples S1,1 S1,2 Sn,m merge S*,* S1-2,3-7 etc More Motivation:A Sample Warehouse

Sampling Algorithms for Evolving Datasets Rainer Gemulla Defense of Ph.D. Thesis 20.10.2008

Sampling Algorithms for Evolving Datasets Rainer Gemulla Defense of Ph.D. Thesis 20.10.2008

Presentation Transcript

Ph.D. Dissertation Defense

Fast N-Body Algorithms for Massive Datasets

Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla , Philipp Rösch, Wolfgang Lehner Technische Univers

Thesis Defense Olufunke Olaleye

Masters Thesis Defense

Thesis Defense

Advanced Algorithms for Massive DataSets

MS Thesis Defense:

Hierarchical Component Models A True Story (Ph.D. Thesis Defense)

Advanced Algorithms for Massive Datasets

Thesis Defense

Internal Defense of Doctoral Thesis

THESIS DEFENSE

MS Thesis Defense

Advanced Algorithms for Massive Datasets

Elizabeth Waring Thesis Defense

Master’s Thesis Defense

Final Thesis Defense

Thesis Defense

THESIS DEFENSE

Ph.D. defense

Evolving Recursive Algorithms