200 likes | 387 Vues
Bootstrapping . (And other statistical trickery). Reminder Of What We Do In Statistics. Null Hypothesis Statistical Test Logic Assume that the “no effect” case is true and then ask if our data is probable given that case.
E N D
Bootstrapping (And other statistical trickery)
Reminder Of What We Do In Statistics • Null Hypothesis Statistical Test Logic • Assume that the “no effect” case is true and then ask if our data is probable given that case. • If we accept the null hypothesis: Our data isn’t improbable if the null hypothesis were true • If we reject: Our data is improbable if the null hypothesis were true
Hypothesis Tests • The Null Hypothesis: • This is the hypothesis that we are looking to disprove • Usually, that there is “No Difference” • i.e. My sample is the same as the population (in the Z test) • In statistics the Null Hypothesis takes the form of the distribution of results that we would expect by chance More Likely Outcomes Less Likely Outcomes
Hypothesis Tests • Remember, we have to take the upside down logic of how we would normally think about these things. • We say, if the null hypothesis were true, is my sample probable? More Likely Outcomes Less Likely Outcomes
To Make it Work • We have to make assumptions about the population from which we selected our data. • These usually take the form of parametric assumptions. • In a t-test: We assume that the null population is normal • In a multiple regression: we assume that the errors are normal • In Poisson regression: we assume that the DV is Poisson
T Test (Independent Samples) • Usually, the formula looks like this:
The Problem • We are always having to make assumptions that we bend. • In multiple regression: errors are rarely exactly normal • In Poisson regression: the mean rarely equals the variance • Many statistical procedures assume Multivariate Normality • In path analysis: there are situations where even if the data were perfectly normal, the errors follow strange bimodal distributions
Example • Skewed Distributions violate ALL typical parametric assumptions
Early Solutions • The Monte Carlo Simulation: • Use the mean, variance and co-variance of your data to define a truly normal distribution • Sample repeatedly from these idealized distributions • Run your analyses using this simulated data • Your CI’s are the middle 95% of the distribution of parameters
Nate Silver Example • Makes his best prediction of a candidate’s share of the vote (say 42%) • Applies a standard error to that guess (maybe he thinks this is +-5% with 95% confidence)
3. Creates this distribution of possible outcomes for this candidate 42% 37% 47%
4. Does this for each candidate in the nation 42% 51% 46% 67% 31% 62%
6. Samples randomly from each of those distributions (which may represent a win loss for each candidate) And then determines who won the house or senate and by how many seats • Does this 1000 times and ends up with this:
Problems • This method assumes that the original data is really multivariate normal, and that the obtained data just a messy approximation of this. • This only solves situations where the standard errors do not follow a known distribution (but the data, in theory, do)
The Jackknife • This is a good solution if your sample has outliers that are having undue influence on your data. • Recalculate estimate by leaving out 1 (or more) random cases from a dataset. • Repeat many times • New Parameter estimate is mean of all obtained parameters (usually B’s) • Std. Error is the variance of the distribution of B’s
Bootstrap • This is generally agreed to be the best solutions of the sampling methods • Idea is incredible simple (usually far easier than actually computing standard errors) • Computationally intensive (by 1980’s standards). With modern day computing power you barely notice the added time.
Bootstrap Procedure • Sample cases from your dataset randomly with replacement to obtain a new sample (with duplicates) that matched the N-Size of your original • Calculate parameter estimates (don’t worry about standard errors) • Repeat steps 1 and 2 200-1000 times • Every parameter will now have 200-1000 estimates • Mean of this sample is you main parameter estimate • Middle 95% of this sample is your middle 95% CI for the parameter
Advantages • Allows for non-symmetric, non-parametric distributions for variables And parameters • You don’t need to even know what your distribution is
Disadvantages • You are assuming that your sample accurately reflects the distribution of the population that you have drawn it from. • This will be the case on average but various samples will deviate significantly from the population distribution • Be careful using this in small sample (my guideline is less than 50)