BIOL 582

Télécharger la présentation

BIOL 582

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

1. BIOL 582 Lecture Set 18 Analysis of frequency and categorical data Part III: Tests of Independence (cont.) Odds Ratios Loglinear Models Logistic Models

2. Sokal and Rohlf (2011) describe and recommend the following • We have considered examples for Models I and II. Model III lends itself well to Fisher’s Exact Test, but this test or any test of independence can be done on any Model type. The important thing to remember is that all tests tend to have inflated type I error rates and will be less robust with small samples. Correction factors are often used as a result • Fisher’s Exact Test is often used with smaller sample sizes • But it should really only be used if bothcriteria of the table (row and column totals) are fixed

3. Sokal and Rohlf (2011) also describe and recommend the following • A good example for illustrating the utility of this test is Dr. Bristol’s clairvoyance. This is stolen right from Wikipedia: • Fisher is said to have devised the test following a comment from Dr. Muriel Bristol, who claimed to be able to detect whether the tea or the milk was added first to her cup…. So in Fisher's original example, one criterion of classification could be whether milk or tea was put in the cup first; the other could be whether Dr Bristol thinks that the milk or tea was put in first. We want to know whether these two classifications are associated – that is, whether Dr Bristol really can tell whether milk or tea was poured in first. Most uses of the Fisher test involve, like this example, a 2 × 2 contingency table. The p-value from the test is computed as if the margins of the table are fixed, i.e. as if, in the tea-tasting example, Dr Bristol knows the number of cups with each treatment (milk or tea first) and will therefore provide guesses with the correct number in each category. As pointed out by Fisher, this leads under a null hypothesis of independence to a hypergeometric distribution of the numbers in the cells of the table.

4. Example Model III (Box 17.7 Sokal and Rohlf2011) • Acacia trees are cleared in an area of Central America, except for 28 lucky bushes, which are fumigated. 15 are species A; 13 are species B • 16 ant colonies are released into the experimental zone (each colony can infect exactly one tree) • Thus, 12 trees will be uninfected • Since the number of species are fixed and the number of infected and uninfected trees are fixed, this is Model III • H0: Tree infection is independent of species • Here is the result • What is the probability of this result if the null hypothesis is true? (I.e., what is the probability it could happen by chance?)

5. Example Model III (Box 17.7 Sokal and Rohlf2011) • First realize that the one must find all probabilities that are as rare as the observed case, which is this • And this • And this

6. Example Model III (Box 17.7 Sokal and Rohlf2011) • The probability of each event using the hypergeometric distribution is

7. Example Model III (Box 17.7 Sokal and Rohlf2011) • Which for each event is

8. An R x C Test of Independence means that one factor is described by rows and one factor is described by columns • And R x C test is a test of a two-way table, even if there are more than two rows and more than two columns. The format does not change. One has these three options: • Calculate expected cell values (frequencies) as r*c/n, where r and c are the marginal totals of the row and column where a cell exists, and calculate a Chi-square statistic • Calculate all f log fvalues for every cell and margin total, plus n log n. Then calculate • Calculate the exact probability of the table outcome given a known distribution of outcomes (when factor sums [margins] are fixed). This is the Fisher’s exact test we just saw • These are all more or less slightly different twists of the same theme • We might try a more complicated two-way table in R, but the concept is the same

9. A three-way table is much more complicated. We will not go into its complicatedness. Consult other sources if you must understand all the details • Here is an example three-way table

10. A three-way table is much more complicated. We will not go into its complicatedness. Consult other sources if you must understand all the details • It might be obvious that testing such a table would be an iterative process

11. A three-way table is much more complicated. We will not go into its complicatedness. Consult other sources if you must understand al the details • It might be obvious that testing such a table would be an iterative process • Multi-way tables are made easier to analyze with loglinear models • For example, a two-way table can be expressed by the following model • Unlike factorial ANOVA, only the interaction is important: it expresses the dependence of factors on one another, so the null hypothesis of independence assumes it is 0. A G test is (thus) a likelihood ratio test between this “full” model and a “reduced” model which lacks the interaction. Mean of logarithms of expected frequencies Effect of category i of factor A Effect of category j of factor B The dependence of category i of factor A on category j of factor B

12. A three-way table is much more complicated. We will not go into its complicatedness. Consult other sources if you must understand al the details • It might be obvious that testing such a table would be an iterative process • Multi-way tables are made easier to analyze with loglinear models • A three-way table can be expressed by the following model • We will leave it as sufficient that model LRTs can be performed with different interactions removed to test specific dependencies, given other dependencies. Sometimes the three-way interaction is removed as it is deemed cumbersome. • Check out how complicated these analyses can become at Quick R

13. Now we turn our attention to response data that can have one of two outcomes • Expressed/unexpressed • Dead/alive • Female/Male • Success/failure • Present/absent • Often these types of responses are expressed as proportions, for the obvious reason that results can be generalized to smaller or larger groups. • For example, the inoculated mouse data can be summarized as

14. Now we turn our attention to response data that can have one of two outcomes • For example, the inoculated mouse data can be summarized as • Proportions can be expressed as odds-ratios • For this example, , which means that the odds of mice surviving with antiserum are approximately three times higher than without.

15. Odds ratios can be conveniently decomposed when expressed as a logarithm. • The utility of doing this is not readily apparent. An odds ratio is convenient; a difference in logits is not as much. • So why do this? • Logits are quantities that are approximately normally distributed and can be used with linear models • When proportions are equal, logits are 0; when q (success or preferred) is larger than p, logits are positive; when q is smaller, logits are negative. • Expected values of logits from linear models can be back-transformed to get expected proportions (probabilities) of two different levels of response.