1 / 35

Sampling Algorithms for Evolving Datasets Rainer Gemulla Defense of Ph.D. Thesis 20.10.2008

Faculty of Computer Science, Institute of System Architecture, Database Technology Group. Sampling Algorithms for Evolving Datasets Rainer Gemulla Defense of Ph.D. Thesis 20.10.2008. Application Level ( external ). Clustering Find similar groups Ofter superlinear in input size

willa
Télécharger la présentation

Sampling Algorithms for Evolving Datasets Rainer Gemulla Defense of Ph.D. Thesis 20.10.2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Faculty of Computer Science, Institute of System Architecture, Database Technology Group Sampling Algorithmsfor Evolving DatasetsRainer GemullaDefense of Ph.D. Thesis20.10.2008

  2. Application Level (external) • Clustering • Find similargroups • Ofter superlinear in inputsize • Procedure • Run k-means • Estimatemeanandvariance • 99% confidenceintervalundernormal distribution • Run on sample • 5%

  3. System Level (internal) • SelectivityEstimation • Determinepercent-ageoftuplesthatsatisfy a query • Key toeffectivequeryoptimization • Procedure • Exactcomputation • 5% Sample • Howgoodisthis? • Arbitrarydataset • 1% absolute error,95% confidence • ≈20k items • Exact: • 1.1% • Sample: • ≈1.2% • Sample: • ≈83,6% • Exact: • 83,8%

  4. Applications • Sample Computation • Sample Maintenance • The Whole Picture • Conclusion

  5. Option 1: Query Sampling • Advantages • Noimpact on traditional queryprocessing • Nostoragerequirements • Disadvantages • Sampling stepis expensive • Supports only simple queries • Cannot handle dataskew Approximate queries Approximate results Base data Sampling step Queries Updates Estimation step

  6. Option 2: Materialized Sampling • Advantages • Quick accesstothe sample • Sophisticatedpreprocessingfeasible • Disadvantages • Storage space • Impact on updates Base data Sample data Approximate queries Queries Estimation step Approximate results Updates Sampling step Mythesis

  7. Applications • Sample Computation • Sample Maintenance • The Whole Picture • Conclusion

  8. Sample Maintenance • Maintenance Problem forEvolving Datasets • Given: a dataset, a sample, a streamofoperations • Insert: Add an item tothedataset • Update: Change thevalueof an item in thedataset • Delete: Remove an item fromthedataset • Goal: maintainthestatisticalvalidityofthe sample • Uniform Sampling • Eachtwosamplesofthe same sizeareequallylikely • Exampledataset: {A, B, C}

  9. The Classic Schemes • Reservoir sampling • Computes a random sample of size M • Fixed space consumption & response time • Might produce undersized samples • Bernoulli sampling • Computes a random sample of fraction ≈q • Varying space consumption & response time • Might produce oversized samples • Problems • Support for updates & deletions • Support for multisets & projections of multisets • Support for resizing & combination • Schemes cannot be used directly! M=800k q=10%

  10. Reservoir Sampling & Deletions • Key problem • Deletionsdecreasethe sample size • Proposedsolutions • CAR samples, backingsamples, taggedsamples, passive samples, purgedbernoullisamples, … • Key ideas • Refill: gotothebasedataandgetreplacement • Recompute: letthe sample shrink, but recomputeoccasionally {A, B, C} -C A A A B A B C C B B 33% 33% 33%

  11. Sample Size & Cost =2% ofthedata Almostconstant sample size Zero basedataaccesses

  12. Random Pairing • Howdoesitwork? • Compensatesdeletionswith subsequent insertions • Details • Pair eachinsertionwith a deleted „partner“ • Undothedeletionofthepartner {A, B, C} +D A A B B A A A B A A B A B D C C B B C C D B 1 1 1 33% 33% 33% 33% 33% 33% 33% 33% 33% Directpairingwouldrequireentiredeletionhistory  Use a randomizedpairing -C 1 1 • Pair! • Pair! 1 +C

  13. Bernoulli Sampling & Multisets • Whymultisets? • Onlycolumns relevant foranalysisarestored in the sample • May not includetheprimarykey • Bernoulli sampling on multisets • Insertions • Acceptwithprobabilityq, rejectotherwise • Deletions • Pick a randomcopyandundoitsinsertion • Sample sizeisreducedwhenpickedcopy was sampled • Occurswithprobability #sample/#base • Weknow #sample but not #base • A • A • A • A • A • A • A • A • A • A S={(A,4)} S= S={(A,1)} S={(A,2)} S={(A,3)}

  14. Augmented Bernoulli Sampling • Augmentingthe sample • Count thenumberofinsertionssincefirstacceptance • Howdoesthishelptoprocessdeletions? • Delete right-sideitemsfirst • Weknowthe total numberofA‘s • Naive schemewithprobability (#sample-1)/(#inserts-1) • Whenempty, deleteleft-side item • A • A • A • A • A • A • A • A • A • A • A S= S={(A,1,1)} S={(A,2,2)} S={(A,2,3)} S={(A,4,6)} S={(A,3,5)} S={(A,3,4)} #sample #inserts =#right+1 Right Fullknowledge Left Just one sample

  15. Applications • Sample Computation • Sample Maintenance • The Whole Picture • Conclusion

  16. Incremental Sample Maintenance Different scenariosrequire different samplingschemes Base data Set Multiset Projection (distinctitems) Data streamwindow • Fixed • Fraction • Size • Fraction • Size • Fraction • Size • Fraction • Size • Insert • Update • ? • n/a • n/a • Delete • ? • n/a • n/a • Survey sampling • Previouswork • Novelschemes

  17. Applications • Sample Computation • Sample Maintenance • The Whole Picture • Conclusion

  18. Conclusion • Database sampling • Has a lotofapplications … • … andprovidesuswith a lotofinterestingproblems • Materializedsampling • Avoidsperformanceproblemsofquerysampling • Requiresmaintenanceasdataevolves • Efficient, incrementalmaintenancealgorithmsexist • In thethesis • Novelsamplingalgorithms • Improvedestimators • Algorithmsforresizingsamples • Algorithmsforcombiningsamples

  19. Thank you! Questions?

  20. Survey Sampling

  21. Permuted-Data Sampling

  22. RoughComparison

  23. A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Reservoir sampling computes a uniform sample of M elements building block for many sophisticated sampling schemes single-scan algorithm add the first M elements afterwards, flip a coin ignore the element (reject) replace a random element in the sample (accept) accept probability of the ith element Reservoir Sampling

  24. A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Reservoir Sampling (Example) • Example • sample size M = 2

  25. A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Idea use arriving insertions to refill the sample Backup: An Incorrect Approach Not uniform!

  26. A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Example Random Pairing

  27. A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Total cost stabledataset, 10M operations sample size 100k, dataaccess 10 timesmore expensive than sample access Total Cost Base data access No base data access

  28. A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Data sets variation of data set size influence on sampling Types of Data Sets Stable Growing Shrinking Goal: stable sample Goal: controlled growing sample uninteresting

  29. A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Example resize by 30% if sampling fraction drops below 9% dependent on costs of accessing base data Resizing Low costs Moderate costs High costs immediate resizing combined solution Random pairingresizing

  30. A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Why sampling? performance, performance, performance How much to sample? influencing factors storage consumption response time accuracy choosing the sample size / sampling fraction largest sample that meets storage requirements largest sample that meets response time requirements smallest sample that meets accuracy requirements Backup: Bounded-Size Sampling

  31. A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Example random pairing vs. bernoulli sampling average estimation Backup: Bounded-Size Sampling Data set Sample size Standard error BS violates 1, 2 BS violates 3

  32. Example: Bernoulli sampling • Bernoulli sampling(coin-flip sample) • each item isincludedwithprobabilityq (=sampling rate) • sample sizeisqN in expectation, whereNiswindowsize • not a bounded-spacescheme • Example: 40byte items, 32kbyte space max 819 items q = 0.0276

  33. Example: Priority Sampling Sample size Sample space k = 113 items

  34. Example: BoundedPriority Sampling Sample size Sample space k = 585 items

  35. Full-Scale Warehouse Of Data Partitions Sample Sample Sample Warehouse of Samples S1,1 S1,2 Sn,m merge S*,* S1-2,3-7 etc More Motivation:A Sample Warehouse

More Related