BIOL 582

BIOL 582 Lecture Set 18 Analysis of frequency and categorical data Part III: Tests of Independence (cont.) Odds Ratios Loglinear Models Logistic Models

Sokal and Rohlf (2011) describe and recommend the following • We have considered examples for Models I and II. Model III lends itself well to Fisher’s Exact Test, but this test or any test of independence can be done on any Model type. The important thing to remember is that all tests tend to have inflated type I error rates and will be less robust with small samples. Correction factors are often used as a result • Fisher’s Exact Test is often used with smaller sample sizes • But it should really only be used if bothcriteria of the table (row and column totals) are fixed

Sokal and Rohlf (2011) also describe and recommend the following • A good example for illustrating the utility of this test is Dr. Bristol’s clairvoyance. This is stolen right from Wikipedia: • Fisher is said to have devised the test following a comment from Dr. Muriel Bristol, who claimed to be able to detect whether the tea or the milk was added first to her cup…. So in Fisher's original example, one criterion of classification could be whether milk or tea was put in the cup first; the other could be whether Dr Bristol thinks that the milk or tea was put in first. We want to know whether these two classifications are associated – that is, whether Dr Bristol really can tell whether milk or tea was poured in first. Most uses of the Fisher test involve, like this example, a 2 × 2 contingency table. The p-value from the test is computed as if the margins of the table are fixed, i.e. as if, in the tea-tasting example, Dr Bristol knows the number of cups with each treatment (milk or tea first) and will therefore provide guesses with the correct number in each category. As pointed out by Fisher, this leads under a null hypothesis of independence to a hypergeometric distribution of the numbers in the cells of the table.

Example Model III (Box 17.7 Sokal and Rohlf2011) • Acacia trees are cleared in an area of Central America, except for 28 lucky bushes, which are fumigated. 15 are species A; 13 are species B • 16 ant colonies are released into the experimental zone (each colony can infect exactly one tree) • Thus, 12 trees will be uninfected • Since the number of species are fixed and the number of infected and uninfected trees are fixed, this is Model III • H0: Tree infection is independent of species • Here is the result • What is the probability of this result if the null hypothesis is true? (I.e., what is the probability it could happen by chance?)

Example Model III (Box 17.7 Sokal and Rohlf2011) • First realize that the one must find all probabilities that are as rare as the observed case, which is this • And this • And this

Example Model III (Box 17.7 Sokal and Rohlf2011) • The probability of each event using the hypergeometric distribution is

Example Model III (Box 17.7 Sokal and Rohlf2011) • Which for each event is

An R x C Test of Independence means that one factor is described by rows and one factor is described by columns • And R x C test is a test of a two-way table, even if there are more than two rows and more than two columns. The format does not change. One has these three options: • Calculate expected cell values (frequencies) as r*c/n, where r and c are the marginal totals of the row and column where a cell exists, and calculate a Chi-square statistic • Calculate all f log fvalues for every cell and margin total, plus n log n. Then calculate • Calculate the exact probability of the table outcome given a known distribution of outcomes (when factor sums [margins] are fixed). This is the Fisher’s exact test we just saw • These are all more or less slightly different twists of the same theme • We might try a more complicated two-way table in R, but the concept is the same

A three-way table is much more complicated. We will not go into its complicatedness. Consult other sources if you must understand all the details • Here is an example three-way table

A three-way table is much more complicated. We will not go into its complicatedness. Consult other sources if you must understand all the details • It might be obvious that testing such a table would be an iterative process

A three-way table is much more complicated. We will not go into its complicatedness. Consult other sources if you must understand al the details • It might be obvious that testing such a table would be an iterative process • Multi-way tables are made easier to analyze with loglinear models • For example, a two-way table can be expressed by the following model • Unlike factorial ANOVA, only the interaction is important: it expresses the dependence of factors on one another, so the null hypothesis of independence assumes it is 0. A G test is (thus) a likelihood ratio test between this “full” model and a “reduced” model which lacks the interaction. Mean of logarithms of expected frequencies Effect of category i of factor A Effect of category j of factor B The dependence of category i of factor A on category j of factor B

A three-way table is much more complicated. We will not go into its complicatedness. Consult other sources if you must understand al the details • It might be obvious that testing such a table would be an iterative process • Multi-way tables are made easier to analyze with loglinear models • A three-way table can be expressed by the following model • We will leave it as sufficient that model LRTs can be performed with different interactions removed to test specific dependencies, given other dependencies. Sometimes the three-way interaction is removed as it is deemed cumbersome. • Check out how complicated these analyses can become at Quick R

Now we turn our attention to response data that can have one of two outcomes • Expressed/unexpressed • Dead/alive • Female/Male • Success/failure • Present/absent • Often these types of responses are expressed as proportions, for the obvious reason that results can be generalized to smaller or larger groups. • For example, the inoculated mouse data can be summarized as

Now we turn our attention to response data that can have one of two outcomes • For example, the inoculated mouse data can be summarized as • Proportions can be expressed as odds-ratios • For this example, , which means that the odds of mice surviving with antiserum are approximately three times higher than without.

Odds ratios can be conveniently decomposed when expressed as a logarithm. • The utility of doing this is not readily apparent. An odds ratio is convenient; a difference in logits is not as much. • So why do this? • Logits are quantities that are approximately normally distributed and can be used with linear models • When proportions are equal, logits are 0; when q (success or preferred) is larger than p, logits are positive; when q is smaller, logits are negative. • Expected values of logits from linear models can be back-transformed to get expected proportions (probabilities) of two different levels of response.

Let’s re-examine the mouse data, using R • The generalized linear model means apply a response that can be transformed to a linear model. The response, “life” cannot be used with a linear model, but the logit for this response can. > treat<-factor(c(rep("BA",57),rep("B",54))) > life<-factor(c(rep("Dead",13),rep("Alive",44),rep("Dead",25),rep("Alive",29))) > > # Use Generalized linear model > > glm.mouse<-glm(life~treat,family=binomial(link='logit')) > treat [1] BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA [23] BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA [45] BA BA BA BA BA BA BA BA BA BA BA BA BA B B B B B B B B B [67] B B B B B B B B B B B B B B B B B B B B B B [89] B B B B B B B B B B B B B B B B B B B B B B [111] B Levels: B BA > life [1] Dead Dead Dead Dead Dead Dead Dead Dead Dead Dead Dead [12] Dead Dead Alive Alive Alive Alive Alive Alive Alive Alive Alive [23] Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive [34] Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive [45] Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive [56] Alive Alive Dead Dead Dead Dead Dead Dead Dead Dead Dead [67] Dead Dead Dead Dead Dead Dead Dead Dead Dead Dead Dead [78] Dead Dead Dead Dead Dead Alive Alive Alive Alive Alive Alive [89] Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive [100] Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive [111] Alive Levels: Alive Dead

Let’s re-examine the mouse data, using R > summary(glm.mouse) Call: glm(formula = life ~ treat, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max -1.1151 -1.1151 -0.7195 1.2410 1.7194 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.1484 0.2729 -0.544 0.5866 treatBA -1.0708 0.4173 -2.566 0.0103 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 142.65 on 110 degrees of freedom Residual deviance: 135.77 on 109 degrees of freedom AIC: 139.77 Number of Fisher Scoring iterations: 4

Let’s re-examine the mouse data, using R > predict(glm.mouse) 1 2 3 4 5 6 7 8 9 10 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 11 12 13 14 15 16 17 18 19 20 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 21 22 23 24 25 26 27 28 29 30 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 31 32 33 34 35 36 37 38 39 40 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 41 42 43 44 45 46 47 48 49 50 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 51 52 53 54 55 56 57 58 59 60 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -1.21924 -0.14842 -0.14842 -0.14842 61 62 63 64 65 66 67 68 69 70 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 71 72 73 74 75 76 77 78 79 80 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 81 82 83 84 85 86 87 88 89 90 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 91 92 93 94 95 96 97 98 99 100 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 101 102 103 104 105 106 107 108 109 110 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 -0.14842 111 -0.14842 > logit1<-predict(glm.mouse)[1];logit2<-predict(glm.mouse)[111] > > log.odds.ratio<-logit2-logit1 > odds.ratio<-exp(log.odds.ratio) > odds.ratio 111 2.917772

Let’s re-examine the mouse data, using R • Recall from the Goodness of Fit, G test, the unadjusted G = 6.979 > glm.null<-glm(life~1,family=binomial(link='logit')) > > AIC(glm.null,glm.mouse) df AIC glm.null 1 144.6521 glm.mouse 2 139.7738 > > logLik(glm.mouse) 'log Lik.' -67.8869 (df=2) > logLik(glm.null) 'log Lik.' -71.32604 (df=1) > LRT<-2*(logLik(glm.mouse)-logLik(glm.null)); p.value<-1-pchisq(LRT[1],1) > LRT [1] 6.878277 > p.value [1] 0.008724963

Modeling logits with linear models is typically called logistic regression • More specifically, it is a generalized linear model with logit link function • Logits can also be modeled with continuous variables • Logistic models can be compared with AIC • Logistic models can be subjected to stepwise procedures • There are various methods for parameter estimation, but least-squares estimation is not one of them. Usually parameters are estimated with maximum likelihood or restricted maximum likelihood. • One method that is popular with ecological niche modeling (species presence or absence based on environmental variables) is maximum entropy, which allows for over-fitted models.

We might come back to the generalized linear model later • For now, this is all you need to know • If a transformation of a variable allows it to be modeled with a linear model • Then the generalized linear model is one that is linked to the original variable by the inverse of the function • This seems subtle, but it is more complex than it seems. The reason is that instead of ascertaining the likelihood of the model by the error it produces, the analysis has to ask how should the parameters of the model be adjusted to produce the expected error. Thus the name “maximum likelihood”: it is the optimization of a function to estimate parameters that maximizes the likelihood to produce the error. • Simply put, maximum likelihood procedures iteratively reweight parameter estimates until no better solution produces more ideal error.

BIOL 582

BIOL 582

Presentation Transcript

BIOL 3340

BIOL 3340

Mgmt 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

582

BIOL 3340