560 likes | 933 Vues
Nonparametric Methods Featuring the Bootstrap. Jon Atwood November 12, 2013. Laboratory for Interdisciplinary Statistical Analysis. LISA helps VT researchers benefit from the use of Statistics.
E N D
Nonparametric Methods Featuring the Bootstrap Jon Atwood November 12, 2013
Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers benefit from the use ofStatistics Experimental Design • Data Analysis • Interpreting ResultsGrant Proposals • Software (R, SAS, JMP, SPSS...) Walk-In Consulting Monday—Friday* 1-3PM, GLC *Mon—Thurs during the summer Collaboration From our website request a meeting for personalized statistical advice. Coauthor not required, but encouraged. Great advice right now:Meet with LISA before collecting your data Short Courses Designed to help graduate students apply statistics in their research All services are FREE for VT researchers. We assist with research—not class projects or homework. www.lisa.stat.vt.edu
Short Course Goals By the end of this course, you will know… • The fundamental differences between nonparametric and parametric • In what general situations nonparametric methods would be advantageous • How to implement nonparametric alternatives to t-tests, ANOVAs and simple linear regression • What nonparametric bootstrapping is and when you can use it effectively
What we will do today… • Give a brief overview of nonparametric statistical methods • Take a look at real world data sets! (and some non-real world data sets) • Implement the following methods in R… • Wilcoxon ranked sum and signed rank (alternatives to t-tests • Kruskal-Wallis (alternative to one-way ANOVA) • Spearman correlation (alternative to Pearson correlation) • Nonparametric Bootstrapping • ?Bonus Topic?
What does nonparametric mean? Well, first of all, what does parametric mean? • Parametric is a word that can have several meanings • In statistics, parametric usually means some specific probability distribution is assumed • Normal (regression, ANOVA, etc.) • Exponential (survival) • May also involve assumptions about parameters (variance)
And now a closer look… • Regression Analysis • We want to see the relationship between a bunch of predictor variables (x1, x2,…,xp) and a response variable. • For example, suppose we wanted to see the relationship between weight and blood pressure • A usual regression model might look something like this… Simple Linear Regression Plot
Error term εi~ N(0, σ2) • The error terms are assumed to come from a normal distribution with mean 0, variance σ2 • Usual methods for testing the significance of our θ estimates are based directly on this assumption of normality • If the assumption of normality false, these methods are invalid
So onto nonparametric… • A statistical method is nonparametric if it does not require most of these assumptions • Data are not assumed to follow some underlying distribution • In some cases, it means that variables do not take predetermined forms • Nonparametric regression, for example
Assumptions nonparametric methods do make • Randomness • Independence • In multi-sample tests, distributions are of the same shape
Nonparametric Methods Advantages Disadvantages • Loss of power when data does follow usual assumptions • Reduces the data’s information • Larger sample sizes needed due to less efficiency • Free from distributional assumptions of data • Easy to interpret • Usually computationally straightforward
With that said…Rank Tests • Rank tests are a simple group of nonparametric tests • Instead of using the actual numerical value of something, we use its rank, or relative position on the number line to the other observations
As an example… • Here’s some data • Basically, these are just numbers I picked out of the sky • Any ideas on what we should call it? Wigs in a wig shop? What about ties? Their ascending values are added together and divided by the total number of ties
Wilcoxon rank tests • These provide alternatives to the standard t-tests, which test mean differences between two groups • T-tests assume that the data is normally distributed • Can be highly sensitive to outliers (we’ll see that in an example soon), which may reduce power (ability to detect significant differences) • Wilcoxon tests alleviate these problems by using ranks, not actual values
First, the Wilcoxon Rank-Sum Test • Alternative to the independent sample t-test (recall that the independent sample t-test compares two independent samples, testing if the means are equal) • The t-test assumes normality of the data, and equal variances (though adjustments can be made for unequal variances) • The Wilcoxon rank-sum test assumes that the two samples follow continuous distributions, as well as the usual randomness and independence
Let’s try it out with some data! • The source of this data is…me, Jon Atwood • Data randomly generated from R • First group, 15 observations distributed normally, mean=30000, variance=2500. Extra observation added • Second group, 16 observations distributed normally, mean=40000, variance=2500
So what is this test…testing? • Remember, for the t-test, we are testing H0: μ1=μ2 vs. Ha: μ1≠μ2 Where μ1is the mean of group 1 and μ2is the mean of group 2
What this really means is that F(G1) and F(G2) are basically the same, except one is to the right of the other, shifted by α • In the Wilcoxon rank-sum test, we are testing H0: F(G1)=F(G2) vs. Ha: F(G1)=F(G2-α) Where F(G1) is the distribution of group 1 and F(G2) is the distribution of group 2, and α is the “location shift” α
In our problem… • So, group 2 has a rank sum of 224 higher than group 1 • The question is, what is the probability of observing a combination of rank sums in the two groups where the difference is greater than 224? • R will compute p-values using the normal approximation of sample sizes greater than 50
So let’s do some R-Coding! • We will view graphs and results in the R program we run (this applies to all examples)
Moving to a paired situation… • Suppose we have two samples, but they are paired in some way • Suppose pairs of people are couples, or plants are paired off into different plots, dogs are paired by breeds, or whatever • Then, the rank-sum test will not be the optimal test. Instead, we use the signed rank test
Signed Rank Test • The null hypothesis is that the medians of the two distributions are equal • Procedure • Calculate all differences between two groups • Take the absolute value of those differences • Convert those to ranks (like before) • Attach a + or -, depending on if the difference is positive or negative • Add these signed ranks together to get the test statistic W
Example • This data comes from Laureysens et al. (2004), who, in August and November, took 13 poplar tree clones and measured aluminum levels • Below is a table with the raw data, ranks, and differences • W=59
How many more combinations could create a more extreme w? • R will give an exact p-value for sample sizes less than 50 • Otherwise, a normal approximation is used
More than 2 groups • Suppose we are comparing more than two groups • Normally (no pun intended), we would use an analysis of variance, or ANOVA • But, of course, we need to assume normality for this as well
Kruskal-Wallis • In this situation, we use the Kruskal-Wallis test • Again, we are converting the actual numeric values to ranks, regardless of groups • Ultimately, we will compute the test statistic
An example • We’ll use the built-in R data air quality • In this data, Chambers et al. (1973) compared air quality measurements in New York in 1973 from May to September • The question is, does the air quality differ from month to month?
Onto Part 2… • In this part of the course, we will do the following things • Look at the Spearman correlation (alternative to the Pearson correlation) between an x variable and y variable • Examine nonparametric bootstrapping, and how it can help us when our data does not approximate normality
Spearman correlation • Suppose you want to discover the association between infant mortality and GDP by country • Here’s a 2003 scatterplot of the situation
Pearson correlation • This data comes from www.indexmundi.com • In this example, the Pearson correlation is about -.63 • Still significant, but perhaps underestimates the monotone nature of the relationship between GDP and infant mortality rate • In addition, the Pearson correlation assumes linearity, which is clearly not present here
We can use the Spearman correlation instead • This is a correlation coefficient based on ranks, which are computed in the y variable and x variable independently, with sample size n • To calculate the coefficient, we do the following… • Take each xi and each yi, convert them into ranks (ranks of x and y are independent of each other • Subtract rxi from ryi to get di, the difference in ranks • The formula is
In this case, the hypotheses are H0: Rs=0 vs. Ha: Rs≠0 • Basically, we are attempting to see whether or not the two variables are independent, or if there is evidence of an association
Let’s return to the GDP data • We will now plot the data in R, and see how to get the Spearman correlation • R tests this by using an exact p-value for small sample sizes, and an approximate t-distribution for larger ones • The test statistic in that case would follow a t-distribution with n-2 degrees of freedom
Nonparametric Bootstrapping • Suppose you are fitting a multiple linear regression model • Recall that εi~ N(0, σ2) • But what if we have reason to suppose this assumption is not met? • Then, the regular way of testing the significance of our coefficients is invalid Yi=β0+β1x1i+β2x2i+…+βkxki+εi
So what do we do? • Depending on the situations, there are several options • The one we will talk about today is called nonparametric bootstrapping • Bootstrapping is a resampling method, where we take the data and randomly sample from it to draw inferences • There are several types of bootstrapping. We will focus on the simpler, nonparametric type
How do we do nonparametric bootstrapping? • We assign each of the n observations a 1/n probability of being selected • We then take a random sample with replacement, usually of size n • We compute estimated coefficients based on this sample • We repeated the process many times, say 10,000, or 100,000, or 100,000,000,000,000…
For example, in regression • Suppose we want to test whether or not β1 = 0. • With our newly generated sample of (10,000, or whatever) β1 coefficients • For a 95 percent confidence interval, we would look at the 250th observation (or 250th lowest), and the 9,750th observation (or the 250th highest) • If this interval does not contain 0, we conclude that there is evidence that β1 is not equal to 0
Possible issue? • Theoretically, this methods comes out of the idea that the distribution of a population is approximated by the distribution of its sample • This assumption becomes less and less valid the smaller your sample size
Example • This data was taken from The Practice of Econometrics by ER Berndt (1991) • We will be regressing wage on education and experience • We will look for whether graphs tell us the residuals are approximately normal or not • R-Code Now!
In Summary • We now understand that certain parametric methods, like t-tests and regression, depend on assumptions that may or may not be met • We know that nonparametric methods are methods that do not make distributional assumptions, and therefore are applicable when data do not meet these assumptions • We can implement these methods in R, incase our data does not meet these assumptions
Bonus Topic! Indicator Variables in Regression Used when we have categorical variables as predictors in regression
As a start, let’s re-examine the wage data • Here, we will drop education and just look at experience and sex • The full model in a case like this would look like Wi=β0+β1Ei+β2Si+β3S*Ei+εi Where… W=wage E=experience S=1if sex=“male”, 0 otherwise ε is the error term
Separate regressions for different sexes • For women, the reference group, Wia=β0a+β1aEi+εia • For men, the non-reference group, Wib=β0b+β1bEi+εib
With that in mind… • Let’s return to the full model Wi=β0+β1Ei+β2Si+β3S*Ei+εi Suppose we want to see how women do in this model Well, we can set S=0 We are left with Wi=β0+β1Ei+εi But, since this is with women, this is equivalent to the women only model, Wia=β0a+β1aEi+εia
Thus… β0= β0a β1Ei=β1aEi
Now, look at “male” We set S=1 Wi=β0+β1Ei+β2*1+β31*Ei+εi = β0+β1Ei+β2+β3Ei+εi =β0a+β1aEi+β2+β3Ei+εi =(β0a+β2)+(β1a+β3)Ei =β0b+β1bEi+εib
So… β2=β0b- β0a β3=β1b- β1a