1 / 37

No CLT – No Problem? Enter the Bootstrap!

No CLT – No Problem? Enter the Bootstrap!. John McGready Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~jmcgread. Slide #2. Goals of Inferential Statistics.

rosetta
Télécharger la présentation

No CLT – No Problem? Enter the Bootstrap!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. No CLT – No Problem?Enter the Bootstrap! John McGready Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~jmcgread

  2. Slide #2

  3. Goals of Inferential Statistics • Much of what we do in statistics involves trying to talk about true characteristics of a process, using an imperfect subset of information from the process Population Information (what we WANT) Sample Information (what we have)

  4. Medical Expenditures • Suppose we want to study the FY 2005 medical expenditures for 13,000 + employees in a particular company • However, the benefits administrator will only give us one random sample of 200 employees

  5. Medical Expenditures (True) mean = 2.3 (True) sd =5.0 Median = 0.59, Mean = 2.3, sd = 5.0 (Sample) mean = 1.9 (Sample) sd =4.0 Median = 0.57, Mean = 2.0, sd = 4.3

  6. Medical Expenditures • Given the right skew, our first choice for estimating the center of the distribution is to work with the median • We can only estimate the true median using the sample median from our 200 observations

  7. Medical Expenditures • We are interested in how “good a guess” the sample median is of the true median • We would also like to estimate a range of possibilities for the true median (ie: a confidence interval)

  8. Medical Expenditures • In order to understand how a sample median from 200 observations relates to the true mean, let’s call our administrator and see if we can get 1,000 more random samples of size 200 • This way, we can compute 1,000 more sample medians and see how variable they are

  9. Making the Call

  10. The Response No Way!

  11. What to Do Now?? • Well, it seems we are out of luck • Let’s just estimate the mean instead, and use the Central Limit Theorem to estimate a range of possible values for the true mean

  12. Review: Sampling Behavior via the CLT Standard error (spread) =

  13. Sampling Behavior via the CLT • Most (95%) of the sample means we could get from samples of 200 would fall between the 2.5th and 97.5% of this distribution • These percentiles correspond to true mean +/- 1.96 standard errors

  14. Sampling Behavior via the CLT

  15. Sampling Behavior via the CLT • Rub #1 • If we knew the true mean, we wouldn’t care about possible mean values • However, taking this one step further implies that 95% of the samples we could get will fall within a know range of the truth

  16. Sampling Behavior via the CLT

  17. Sampling Behavior via the CLT

  18. Sampling Behavior via the CLT • Rub #2 • If we only have one sample, we don’t know true sampling distribution • However, CLT says it will be normal • We spread from our sample data, and center it at our sample mean

  19. Sampling Behavior via the CLT • Our Sample info • Sample mean : 2.0 (thousand $) • Sample standard deviation: 4.3 (thousand $) • Sample estimate of standard error (spread of sampling distribution (thousand $)

  20. Sampling Behavior via the CLT

  21. Sampling Behavior via the CLT

  22. Sampling Behavior via the CLT • True 95% CI • Sample mean +/- 1.96*(true standard error) • (1.3,2.7) • Estimated 95% CI • Sample mean +/- 1.97*(estimated standard error) • (1.4, 2.6)

  23. Another Approach to Estimating Sampling Distribution • Instead of relying on CLT, how about we simulate sampling distribution using just our sample of 200? • Treat our sample as “truth” • Resample multiple times (say 1000) taking random draws of 200 with replacement

  24. Resampling With Replacement Original sample (n=4): Potential resample of same size: S1 S1 S2 S2 S3 S3 S4

  25. Re-Sampling

  26. Bootstrap Estimate of Sampling Distribution • Take 1,000 resamples • Compute the mean of each re-sample • Plot a distribution of the means

  27. Bootstrap Estimate of Sampling Distribution

  28. Bootstrap Estimate of Sampling Distribution

  29. Bootstrap 95% CIs • How to get a 95% CI from the bootstrap dist • Assume normality (normal bootstrap method) • But estimate standard error from bootstrap distribution • Pick off 2.5th, 97.5th percentiles (bootstrap percentile method) • Pick off “adjusted” percentile (bias-corrected acclerated –BCa - method)

  30. 95% CIs • True Mean 2.3 Method 95% CI CLT Estimate 1.40 - 2.60 Bootstrap Normal 1.39 - 2.60 Bootstrap Percentile 1.41 - 2.58 BCa 1.47 - 2.68

  31. We Could Do with 10,000 Resamples

  32. Bootstrap 95% CIs : Mean • Empirical Coverage Probabilities1 Method 1K resamps 10K resamps CLT Estimate 2 93.4% Bootstrap Normal 2 93.2% 92.5% Bootstrap Percentile 92.4% 91.6% BCa 92.3% 93.4% 1 To be thorough, should also look at average width 2 Some intervals could contain illegal (negative) values

  33. What’s The Big Deal? • Why not just use CLT? • For many statistics, we do not have a CLT (or good CLT) based approach • Median • Ratio of mean to sd • Correlation coefficients

  34. Getting a 95% CI for A Median

  35. 95% CIs For Median • True Median 0.59 Method 95% CI (1,00 Reps) CLT Estimate NA Bootstrap Normal 0.44 - 0.71 Bootstrap Percentile 0.39 - 0.68 BCa 0.39 - 0.68

  36. Bootstrap 95% CIs : Median • Empirical Coverage Probabilities1 Method 1K resamps 10K resamps Bootstrap Normal2 94.1% 94.4% Bootstrap Percentile 93.9% 95.0% BCa 94.0% 95.2% 1 To be thorough, should also look at average width 2 Some intervals could contain illegal (negative) values

  37. Wrap Up • Pros/Cons of boostrap • Theoretical Justicifaction

More Related