Data Mining CSCI 307, Spring 2019 Lecture 30

Data MiningCSCI 307, Spring 2019Lecture 30 Sampling Bootstrap

BACKGROUND: Sampling • What is Sampling? Obtaining a small sample s to represent the whole data set N • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data • Key principle: Choose a representative subset of the data • Simple random sampling may have very poor performance in the presence of skew • Develop adaptive sampling methods, e.g., stratified sampling

Measuring the Central Tendency Mean (algebraic measure) Note: n is sample size and N is population size. • Weighted arithmetic mean • Also: Trimmed mean (chop off extreme values first and then get the mean) Median • Middle value if odd number of values, or average of the middle two values otherwise • Estimated by interpolation (for grouped data): Mode • Value that occurs most frequently in the data • Unimodal (one mode), bimodal (two modes), trimodal (three modes); two or more: multimodal • approximation (for unimodal):

Symmetric versus Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data symmetric positively skewed negatively skewed

Types of Sampling • Simple Random Sampling • An equal probability of selecting any particular item • Sampling Without Replacement • Once an object is selected, it is removed from the population • Sampling With Replacement • A selected object is not removed from the population • Stratified Sampling • Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) • Used in conjunction with skewed data

Sampling With or Without Replacement SRSWOR (simple random sample without replacement) SRSWR (simple random sample with replacement) Raw Data

SamplingCluster or Stratified Sampling Cluster/Stratified Sample Raw Data

The Bootstrap Cross-Validation (CV) uses sampling without replacement • The same instance, once selected, can not be selected again for a particular training/test set • The bootstrap uses sampling with replacement to form the training set • Sample a dataset of n instances n times with replacement to form a new dataset of n instances • Use this data as the training set • Use the instances from the original dataset that do not occur in the new training set for testing

The 0.632 Bootstrap A particular instance has a probability of 1–1/n of not being picked Its probability of ending up in the test data is: This means the training data will contain approximately 63.2% (i.e. 1-0.368) of the instances

Estimating Error with the Bootstrap • Error estimate on test data is pessimistic • Trained on just ~63% of the instances (unlike 90% training size for tenfold C.V.) • To compensate, combine it with the resubstitution error, so USE: err = 0.632 xetest_instances + 0.368 xetraining_instances • The resubstitution error gets less weight than the error on the test data • Repeat process several times with different replacement samples; average the results

Bootstrap continued +++Perhaps the best way of estimating performance for very small datasets --- Some problems • Consider the (artificial) random dataset from a few slides back, True error rate: 50% • A perfect memorizer (for the training set) will achieve 0% resubstitution error, i.e. etraining_instances= 0 and ~50% error on test data • So, the bootstrap estimate for this classifier: err = 0.632 x 50% + 0.368 x 0% = 31.6% So, it is misleadingly optimistic.

Summary: Bootstrap Works well with small data sets Samples the given training instances uniformly with replacement, i.e., each time an instance is selected, it is equally likely to be selected again and re-added to the training set Several bootstrap methods, a common one is .632 bootstrap • A data set with d instances is sampled d times, with replacement, resulting in a training set of d samples. The data instances that did not make it into the training set end up forming the test set. • Repeat sampling procedure k times, overall accuracy of the model: 12

Data Mining CSCI 307, Spring 2019 Lecture 30

Data Mining CSCI 307, Spring 2019 Lecture 30

Presentation Transcript

Data Mining CSCI 307 Spring, 2019

Data Mining CSCI 307, Spring 2019 Lecture 13

Data Structures CSCI 132, Spring 2019 Lecture 21 Doubly Linked Lists

CSci 8980: Data Mining (Fall 2002)

Data Structures CSCI 132, Spring 2014 Lecture 17 Backtracking

Data Structures CSCI 132, Spring 2019 Lecture 14 Review for Exam 1

Data Mining Spring 2013

Data Structures CSCI 132, Spring 2019 Lecture 18 Recursion and Look-Ahead

Data Mining Spring 2007