Bootstrapping for Confidence Intervals: A Statistical Approach
In this presentation, John McGready from the Johns Hopkins University Department of Biostatistics explores inferential statistics and the Central Limit Theorem (CLT) in the context of estimating medical expenditures for employees. We focus on the importance of sample information, examining how to estimate the true median and mean from limited data. By utilizing the bootstrap method, resampling techniques, and confidence intervals, we can enhance our estimates and understand variability in sampling distributions, ultimately improving our inferential statistics methodology.
Bootstrapping for Confidence Intervals: A Statistical Approach
E N D
Presentation Transcript
No CLT – No Problem?Enter the Bootstrap! John McGready Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~jmcgread
Goals of Inferential Statistics • Much of what we do in statistics involves trying to talk about true characteristics of a process, using an imperfect subset of information from the process Population Information (what we WANT) Sample Information (what we have)
Medical Expenditures • Suppose we want to study the FY 2005 medical expenditures for 13,000 + employees in a particular company • However, the benefits administrator will only give us one random sample of 200 employees
Medical Expenditures (True) mean = 2.3 (True) sd =5.0 Median = 0.59, Mean = 2.3, sd = 5.0 (Sample) mean = 1.9 (Sample) sd =4.0 Median = 0.57, Mean = 2.0, sd = 4.3
Medical Expenditures • Given the right skew, our first choice for estimating the center of the distribution is to work with the median • We can only estimate the true median using the sample median from our 200 observations
Medical Expenditures • We are interested in how “good a guess” the sample median is of the true median • We would also like to estimate a range of possibilities for the true median (ie: a confidence interval)
Medical Expenditures • In order to understand how a sample median from 200 observations relates to the true mean, let’s call our administrator and see if we can get 1,000 more random samples of size 200 • This way, we can compute 1,000 more sample medians and see how variable they are
The Response No Way!
What to Do Now?? • Well, it seems we are out of luck • Let’s just estimate the mean instead, and use the Central Limit Theorem to estimate a range of possible values for the true mean
Review: Sampling Behavior via the CLT Standard error (spread) =
Sampling Behavior via the CLT • Most (95%) of the sample means we could get from samples of 200 would fall between the 2.5th and 97.5% of this distribution • These percentiles correspond to true mean +/- 1.96 standard errors
Sampling Behavior via the CLT • Rub #1 • If we knew the true mean, we wouldn’t care about possible mean values • However, taking this one step further implies that 95% of the samples we could get will fall within a know range of the truth
Sampling Behavior via the CLT • Rub #2 • If we only have one sample, we don’t know true sampling distribution • However, CLT says it will be normal • We spread from our sample data, and center it at our sample mean
Sampling Behavior via the CLT • Our Sample info • Sample mean : 2.0 (thousand $) • Sample standard deviation: 4.3 (thousand $) • Sample estimate of standard error (spread of sampling distribution (thousand $)
Sampling Behavior via the CLT • True 95% CI • Sample mean +/- 1.96*(true standard error) • (1.3,2.7) • Estimated 95% CI • Sample mean +/- 1.97*(estimated standard error) • (1.4, 2.6)
Another Approach to Estimating Sampling Distribution • Instead of relying on CLT, how about we simulate sampling distribution using just our sample of 200? • Treat our sample as “truth” • Resample multiple times (say 1000) taking random draws of 200 with replacement
Resampling With Replacement Original sample (n=4): Potential resample of same size: S1 S1 S2 S2 S3 S3 S4
Bootstrap Estimate of Sampling Distribution • Take 1,000 resamples • Compute the mean of each re-sample • Plot a distribution of the means
Bootstrap 95% CIs • How to get a 95% CI from the bootstrap dist • Assume normality (normal bootstrap method) • But estimate standard error from bootstrap distribution • Pick off 2.5th, 97.5th percentiles (bootstrap percentile method) • Pick off “adjusted” percentile (bias-corrected acclerated –BCa - method)
95% CIs • True Mean 2.3 Method 95% CI CLT Estimate 1.40 - 2.60 Bootstrap Normal 1.39 - 2.60 Bootstrap Percentile 1.41 - 2.58 BCa 1.47 - 2.68
Bootstrap 95% CIs : Mean • Empirical Coverage Probabilities1 Method 1K resamps 10K resamps CLT Estimate 2 93.4% Bootstrap Normal 2 93.2% 92.5% Bootstrap Percentile 92.4% 91.6% BCa 92.3% 93.4% 1 To be thorough, should also look at average width 2 Some intervals could contain illegal (negative) values
What’s The Big Deal? • Why not just use CLT? • For many statistics, we do not have a CLT (or good CLT) based approach • Median • Ratio of mean to sd • Correlation coefficients
95% CIs For Median • True Median 0.59 Method 95% CI (1,00 Reps) CLT Estimate NA Bootstrap Normal 0.44 - 0.71 Bootstrap Percentile 0.39 - 0.68 BCa 0.39 - 0.68
Bootstrap 95% CIs : Median • Empirical Coverage Probabilities1 Method 1K resamps 10K resamps Bootstrap Normal2 94.1% 94.4% Bootstrap Percentile 93.9% 95.0% BCa 94.0% 95.2% 1 To be thorough, should also look at average width 2 Some intervals could contain illegal (negative) values
Wrap Up • Pros/Cons of boostrap • Theoretical Justicifaction