BIOL 582

BIOL 582 Lecture Set 21 One-Way MANOVA, Part I

So far we have learned two things about multivariate data: • That linear models work equally well with multivariate data, compared to univariate data • That we can visualize general patterns in multivariate data spaces in reduced dimensions • The next two lectures will attempt to wrap up some other issues working with multivariate data, namely • Can I use ANOVA on a multivariate linear model to test group differences? • Can I determine which of my response variables is better for distinguishing group differences? • What is the difference between scatter plots that resemble data spaces and ones that resemble statistical spaces? • Before getting into these issues, let’s take a quick look again at some issues with the Bumpus data

From last time • This time, let’s try this simple procedure • It might be obvious that the plot shows the distribution of eigen values for the six possible PCs • Let’s compare this to random, uncorrelated data… > bumpus<-read.csv(”bumpus.csv") > attach(bumpus) > # the following morphological variables will be used > # AE, BHL, HL, FL, TTL, SW, SKL > Y<-cbind(AE,BHL,FL,TTL,SW,SKL) > # PCA on correlation matrix > Y.cor.pca<-princomp(Y,cor=T) > plot(Y.cor.pca) > Y.cor.pca$sdev^2 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 3.6394870 0.7481825 0.6176918 0.4498947 0.3605078 0.1842361

Random data: • Compare plots • The Bumpus data produced a principal eigenvalue that can be described as “overdispersed” (more variance than expected). Another way to describe the pattern is lower entropy (uncertainty), which means greater information content in fewer dimensions than expected by chance. • There are (debatable) ways to test the “significance” of PCs (and some use these methods to decide which information to keep and which to throw out). This is not advocated here. • Just realize that in the Bumpus data, prominent variable covariation is apparent. > Y.rand<-matrix(rnorm(6*nrow(Y),0,1),nrow=nrow(Y),ncol=6) > Y.rand.cor.pca<-princomp(Y.rand) > par(mfrow=c(1,2)) > plot(Y.cor.pca) > plot(Y.rand.cor.pca) Pop quiz! What is the total variance of any PCA using a correlation matrix? p, where p is the number of variables.

Why was the previous page important? Sometimes the covariation of variables helps explain group differences in multivariate responses. Utilizing a multivariate linear model might help reveal this. Below is a hypothetical example to illustrate this point • Are there obvious differences between the two groups for each trait, independently? • But the two groups occupy different regions of the data space • By considering the covariation of responses, it is apparent that a test to compare locations should reveal a difference. y2 y1

Let’s go through the steps of a linear model with multivariate response data, as we did with univariate response data (sort of a side by side comparison. • From each model, the error can be obtained • Then the error can be “squared” by • Thus, for univariate data, the inner-product of two vectors produces a scalar, which is SSE. For multivariate data, the result is a sums of squares and cross-products matrix. This is the key difference. The error is not one value; it is a matrix of measures of covariance.

Recall that if one wishes to test group differences in means for univariate data, one can compare the error of a “full” linear model with both an intercept and parameters for estimating group effects to a “reduced” model that contains just an intercept. This can be done with a likelihood ratio test (using, e.g., an F stat) • Inspection of this F value formula indicates that it will only work if the residuals are a single vector, such that the inner product produces a scalar. Here is a way to generalize this for matrices • Note that the F is bolded but not italicized, as it is a matrix

Two questions: What are the dimensions of the matrix? Can we get a P-value? • First answer is easy, the next answer, not so much • A sums of squares and cross-products matrix (SSCP) will ALWAYS be a p x p matrix! The difference in two SSCP matrices, like the right side of an equation is also a SSCP matrix. It is the SSCP matrix for the effect that is tested. • Sometimes a particular nomenclature is used to describe particular SSCPs. The error SSCP is called E (do not confuse with eigenvectors) and the difference between the E matrices of the two models is called H, which is much like “hypothesis.” This nomenclature has some benefit. For example, one could state that the expected value of tr(H) is 0 under the null hypothesis. Thus, F can be written as

Some sources take it a step further. Since SSCPs multiplied by the reciprocal of their degrees of freedom produces covariance matrices, some prefer to do the following: • Where B and H are “between-group” and “within-group” covariance matrices, respectively. This approach obviously assumes one is doing a one-way “MANOVA” rather than regression. However, the original formula is more general for comparing any two nested models. • As for the second question, there is no way to get a P-value for a matrix… • But, there are a few multivariate coefficients that can be converted to values that approximately follow F distributions. In some cases (i.e., when Δk = 1), the values are said to be “Exact” F values. • We will come back to what we can do with this F matrix next lecture

Let’s not forget assumptions! • Multivariate linear model assumptions are similar to univariate assumptions, with a few twists • The error is multivariate normal (not easy to test, but one can look at a PC plot of residuals and check for an elliptical shape of scatter. R has a multivariate Shapiro-Wilk test) • Homogeneity of covariance (there is a test for this – Box’s M – but it is overly sensitive and generally not used. One can examine covariance matrices – the off-diagonal elements should be similar; the diagonal elements should be similar) • The relationship of dependent variables is linear. • Independent observation vectors (not values) of subjects • n >> p (This is more so for MANOVA than the linear model) • Y is not rank deficient. I.e., it is full-rank. The “rank” is the number of variables. However, if one variable is a function of another, the matrix has more columns than it needs. If many variables are highly correlated, rank deficiency might occur. (One ad-hoc solution it to do PCA and keep the components that have eigenvalues > 0. This is more so for MANOVA than the linear model. Multivariate test statistics are based on a full rank assumption.) Note: rank deficiency in X is also a problem for either univariate or multivariate responses. Computers have trouble inverting XTX if the X is not full rank. This is called multicollinearity. It means that variables are inherently related, and prediction should use a smaller set. (E.g., Assignment 8 – DPY and PercipIn – Stepwise regression removed one of these most likely)

One-way MANOVA Steps • State Null/Alternative hypotheses • Define model (evaluate assumptions) • Calculate Sf for the error of the full model • Define the reduced “null” model (always contains just an intercept) • Calculate Srfor the error of the reduced model • Calculate a multivariate test statistic using Sf and/or Sr. (Note: one can use only Sf to get a test statistic if performing a randomization test.) • Evaluate the probability of the test statistic if the null hypothesis were true • By converting the test statistic to an F stat, which approximately follows and F distribution • By performing a randomization test to create an empirical probability distribution • Generate plots/tables • Maybe do a discriminant analysis

Example MANOVA: Bumpus Data • State Null/Alternative hypotheses • Define model (evaluate assumptions) • Calculate Sf for the error of the full model • Define the reduced “null” model (always contains just an intercept) • Calculate Srfor the error of the reduced model • Calculate a multivariate test statistic Sfusing and/or Sr. (Note: one can use only Sf to get a test statistic if performing a randomization test.) • Evaluate the probability of the test statistic if the null hypothesis were true • By converting the test statistic to an F stat, which approximately follows and F distribution • By performing a randomization test to create an empirical probability distribution • Generate plots/tables • Maybe do a discriminant analysis

Example MANOVA: Bumpus Data • State Null/Alternative hypotheses • H0: Morphology does not differ between survivors and non-survivors, and males and females (i.e., sex-survival groups) • HA: Morphology differs between survivors and non-survivors, and males and females (i.e., sex-survival groups) in some way. where ΣH is the population covariance matrix for the effect Since V = 1/Δk*His a sample estimate of Σ , then the trace of H is an adequate sample statistic for testing the null hypothesis. The expected value for the null hypothesis istr(H) = 0.

Example MANOVA: Bumpus Data Define model (evaluate assumptions) Calculate Sf for the error of the full model > bumpus<-read.csv(“bumpus.csv") > attach(bumpus) > # the following morphological variables will be used > # AE, BHL, HL, FL, TTL, SW, SKL > Y<-cbind(AE,BHL,FL,TTL,SW,SKL) > group<-factor(paste(sex,survived,sep='.')) > lm.group<-lm(Y~group) > lm.group Call: lm(formula = Y ~ group) Coefficients: AE BHL FL TTL SW SKL (Intercept) 241.571429 31.478571 18.027650 28.728307 15.279914 20.845236 groupf.TRUE -0.571429 -0.045238 0.129721 0.319617 -0.037495 -0.035379 groupm.FALSE 5.734127 0.082540 -0.008467 -0.113796 0.015825 0.610709 groupm.TRUE 5.840336 0.215546 0.174687 0.120116 0.065670 0.887702 >

Example MANOVA: Bumpus Data Define model (evaluate assumptions) Calculate Sf for the error of the full model > # Consider assumptions...... > # One can go crazy trying to do tests > # Almost any test will be too sensitive > # They will suggest assumptions are violated > # Yet the multivariate test stats are anti-conservative > # So testing assumptions is rather futile > # Better to look at residuals > > e<-resid(lm.group) > pca.e<-princomp(e,cor=F) > yh<-predict(lm.group) > pca.yh<-princomp(yh,cor=F) > par(mfrow=c(2,2)) > plot(pca.e,main="distribution of eiqenvalues") > plot(pca.e$scores,asp=1,main="best 2-D view of residuals") > plot(pca.yh$scores[,1],pca.e$scores[,1],main="best residual vs predicted view") > plot(group,pca.e$scores[,1],main="best residual vs group view")

Example MANOVA: Bumpus Data Define model (evaluate assumptions) Calculate Sf for the error of the full model

Example MANOVA: Bumpus Data Define model (evaluate assumptions) Calculate Sf for the error of the full model > Sf<-t(e)%*%e > Sf AE BHL FL TTL SW SKL AE 2966.8490 226.59416 256.26670 433.30517 111.23737 285.34416 BHL 226.5942 65.18760 35.11097 57.50294 18.80858 41.58504 FL 256.2667 35.11097 49.67377 68.00522 16.19890 35.69328 TTL 433.3052 57.50294 68.00522 141.80487 24.52523 55.61453 SW 111.2374 18.80858 16.19890 24.52523 19.39935 18.06867 SKL 285.3442 41.58504 35.69328 55.61453 18.06867 115.81108 > # potentialhomogeneity of covariance problem…

Example MANOVA: Bumpus Data Define the reduced “null” model (always contains just an intercept) Calculate Srfor the error of the reduced model > lm.null<-lm(Y~1) > e.r<-resid(lm.null) > Sr<-t(e.r)%*%e.r > Sr AE BHL FL TTL SW SKL AE 4115.0294 261.26912 263.98818 410.11550 123.17917 435.47590 BHL 261.2691 66.59993 35.79883 57.34511 19.31347 46.82664 FL 263.9882 35.79883 50.64240 69.25259 16.41615 37.77880 TTL 410.1155 57.34511 69.25259 144.59029 24.40976 54.03802 SW 123.1792 19.31347 16.41615 24.40976 19.58572 19.88596 SKL 435.4759 46.82664 37.77880 54.03802 19.88596 136.92128 > # compare to Sf > Sf AE BHL FL TTL SW SKL AE 2966.8490 226.59416 256.26670 433.30517 111.23737 285.34416 BHL 226.5942 65.18760 35.11097 57.50294 18.80858 41.58504 FL 256.2667 35.11097 49.67377 68.00522 16.19890 35.69328 TTL 433.3052 57.50294 68.00522 141.80487 24.52523 55.61453 SW 111.2374 18.80858 16.19890 24.52523 19.39935 18.06867 SKL 285.3442 41.58504 35.69328 55.61453 18.06867 115.81108 Because AE compared to other variables has such large variance, the Sfmatrix might violate the homogeneity of covariance assumption (i.e., MANOVA devolves into ANOVA of AE)

Example MANOVA: Bumpus Data • Calculate a multivariate test statistic Sfusing and/or Sr. (Note: one can use only Sf to get a test statistic if performing a randomization test.) • There are several multivariate test statistics that can be used (e.g., check out Wikipedia for “MANOVA”) • It can be debated which are the best (although differences only appear in small samples and if significance is questionable) • The following is all we will consider in order to remain practical • There are two basic ways to make this test stat more “recognizable” • Pillai’s trace PT = tr[(H + E)-1H]  resembles an R2 value • Hotelling-Lawley traceHLT = tr[E-1H]  resembles an F value (without df) • Both of these test statistics can be converted to values that approximately follow F distributions, or one can randomize rows of Y and create error distributions of tr(H) where ΣH is the population covariance matrix for the effect Since V = 1/Δk*His a sample estimate of Σ , then the trace of H is an adequate sample statistic for testing the null hypothesis. The expected value for the null hypothesis istr(H) = 0.

Example MANOVA: Bumpus Data • Evaluate the probability of the test statistic if the null hypothesis were true • By converting the test statistic to an F stat, which approximately follows and F distribution > # simple solution in R > summary(manova(lm.group)) DfPillaiapprox F numDf den DfPr(>F) group 3 0.51629 4.4692 18 387 7.506e-09 *** Residuals 132 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > summary(manova(lm.group),test="Hotelling") DfHotelling-Lawleyapprox F numDf den DfPr(>F) group 3 0.94039 6.5653 18 377 2.758e-14 *** Residuals 132 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Example MANOVA: Bumpus Data • Evaluate the probability of the test statistic if the null hypothesis were true • By performing a randomization test to create an empirical probability distribution > # Randomization approach > H<-Sr-Sf > t.H<-sum(diag(H)) # this is the test stat > permute<-999 > result<-t.H; P<-1 > for(i in 1:permute){ + Y.r<-Y[sample(nrow(Y)),] # this randomizes row vectors + # important to realize that all values for each subject + # are held together when shuffling values + lm.r<-lm(Y.r~group) + Sf.r<-t(resid(lm.r))%*%resid(lm.r) + t.H.r<-sum(diag(Sr-Sf.r)) + result<-c(result,t.H.r) + P<-ifelse(t.H.r>=t.H,P+1,P+0) + } > > > # error distribution > hist(result,xlab="trace(H)",breaks=50,col="blue") > rug(result) > rug(result[1],lwd=2,col='red') > > #Results > P.value<-P/(permute+1) > R2<-sum(diag(solve(Sr,H))) > t.H [1] 1174.643 > R2 [1] 0.5162905 > P.value [1] 0.001

Example MANOVA: Bumpus Data • Generate plots/tables • Maybe do a discriminant analysis • These two are linked because we are in a position where we might want to do multiple comparisons. Therefore, what we plot or how we present a table might be contingent upon what a multiple comparisons test reveals • Also, multiple comparisons can be done several ways • These topic will be explored in the next lecture

BIOL 582

BIOL 582

Presentation Transcript

BIOL 3340

BIOL 3340

Mgmt 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

582

BIOL 3340