 Download Download Presentation Stats 330: Lecture 27

# Stats 330: Lecture 27

Télécharger la présentation ## Stats 330: Lecture 27

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Stats 330: Lecture 27 Contingency tables

2. Plan of the day In today’s lecture we apply Poisson regression to the analysis of one and two-dimensional contingency tables. Topics • Contingency tables • Sampling models • Equivalence of Poisson and multinomial • Correspondence between interactions and independence

3. Contingency tables • Contingency tables arise when we classify a number of individuals into categories using one or more criteria. • Result is a cross-tabulation or contingency table • 1 criterion gives a 1-dimensional table • 2 criteria give a 2 dimensional table • … and so on.

4. Example: one dimension Income distribution of New Zealanders 15+ 2006 census

5. Example 2: 2 dimensions New Zealanders 15+ by annual income and sex: 2006 census

6. Censuses and samples • Sometimes tables come from a census • Sometimes the table comes from a random sample from a population • In this case we often want to draw inferences about the population on the basis of the sample e.g.is income independent of sex?

7. Example: death by falling The following is a random sample of 16,976 persons who met their deaths in fatal falls, selected from a larger population of deaths from this cause. The falls classified by month are

8. The question • Question: if we regard this as a random sample from several years, are the months in which death occurs equally likely? • Or is one month more likely than the others? • To answer this, we use the idea of maximum likelihood. First, we must detour into sampling theory in order to work out the likelihood

9. Sampling models • There are two common models used for contingency tables • The multinomial sampling model assumes that a fixed number of individuals from a random sample are classified with fixed probabilities of being assigned to the different “cells”. • The Poisson sampling model assumes that the table entries have independent Poisson distributions

10. Multinomial sampling

11. Multinomial model • Suppose a table has M “cells” • n individuals classified independently (A.k.a. sampling with replacement) • Each individual has probability pi of being in cell i • Every individual is classified into exactly 1 cell, so p1 + p2 + . . . + pM = 1 • Let Yi be the number in the ith cell, so Y1 + Y2 + . . . YM = n

12. Multinomial model (cont) Y1, . . . ,Ym have a multinomial distribution This is the maximal model, as in logistic regression, making no assumptions about the probabilities.

13. Log-likelihood

14. MLE’s • If there are no restrictions on the p’s (except that they add to one) the log-likelihood is maximised when pi = yi/n • These are the MLE’s for the maximal model • Substitute these into the log-likelihood to get the maximum value of the maximal log-likelihood. Call this Log Lmax

15. Deviance • Suppose we have some model for the probabilities: this will specify the form of the probabilities, perhaps as functions of other parameters. In our example, the model says that each probability is 1/12. • Let log Lmod be the maximum value of the log-likelihood, when the probabilities are given by the model. • As for logistic regression, we define the deviance as D = 2log Lmax - 2 log Lmod

16. Deviance:Testing adequacy of the model • If n is large (all cells more than 5) and M is small, and if the model is true, the deviance will have (approximately) a chi-square distribution with M-k-1 df, where k is the number of unknown parameters in the model. • Thus, we accept the model if the deviance p-value is more than 0.05. • This is almost the same as the situation in logistic regression with strongly grouped data. • These tests are an alternative form of the chi-square tests used in stage 2.

17. Example: death by falling • Are deaths equally likely in all months? In statistical terms, is pi=1/12, i=1,2,…,12? • Log Lmax is, up to a constant, Calculated in R by >y<-c(1688,1407,1370,1309,1341,1388, 1406,1446,1322,1363, 1410,1526) > sum(y*log(y/sum(y)))  -42143.23

18. Example: death by falling Our model is pi=1/12, i=1,2,…,12. (all months equally likely). This completely specifies the probabilities, so k=0, M-k-1=11. Log Lmod is, up to a constant, > sum(y*log(1/12))  -42183.78 # now calculate deviance > D<-2*sum(y*log(y/sum(y)))-2*sum(y*log(1/12)) > D  81.09515 > 1-pchisq(D,11)  9.05831e-13Model is not plausible!

19. > Months<-c("Jan","Feb","Mar","Apr","May","Jun", "Jul","Aug","Sep","Oct","Nov","Dec") > barplot(y/(sum(y),names.arg=Months)

20. Poisson model • Assume that each cell is Poisson with a different mean: this is just a Poisson regression model, with a single explanatory variable “month” • This is the maximal model • The null model is the model with all means equal • Null model deviance will give us a test that all months have the same mean

21. R-code for Poisson regression > months<-c("Jan","Feb","Mar","Apr","May", "Jun","Jul","Aug","Sep","Oct","Nov","Dec") • > months.fac<-factor(months,levels=months) > falls.glm<-glm(y~months.fac,family=poisson) > summary(falls.glm)

22. Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 7.43130 0.02434 305.317 < 2e-16 *** months.facFeb -0.18208 0.03610 -5.044 4.56e-07 *** months.facMar -0.20873 0.03636 -5.740 9.46e-09 *** months.facApr -0.25428 0.03683 -6.904 5.04e-12 *** months.facMay -0.23013 0.03658 -6.291 3.15e-10 *** months.facJun -0.19568 0.03623 -5.401 6.64e-08 *** months.facJul -0.18280 0.03611 -5.063 4.13e-07 *** months.facAug -0.15474 0.03583 -4.318 1.57e-05 *** months.facSep -0.24440 0.03673 -6.655 2.84e-11 *** months.facOct -0.21386 0.03642 -5.873 4.29e-09 *** months.facNov -0.17995 0.03608 -4.988 6.10e-07 *** months.facDec -0.10089 0.03532 -2.856 0.00429 ** Null deviance: 8.1095e+01 on 11 degrees of freedom Residual deviance: -7.5051e-14 on 0 degrees of freedom > D  81.09515 Same as before! Why????

23. General principle: • To every Poisson model for the cell counts, there corresponds a multinomial model, obtained by conditioning on the table total. • Suppose the Poisson means are m1, … mM .... The cell probabilities in the multinomial sampling model are related to the means in the Poisson sampling model by the relationship pi = mi/(m1 + … + mM) • We can estimate the parameters in the multinomial model by fitting the Poisson model and using the relationship above. • We can test hypotheses about the multinomial model by testing the equivalent hypothesis in the Poisson regression model.

24. Example: death by falling • Poisson means are

25. Relationship between Poisson means and multinomial probs

26. Relationship between Poisson parameters and multinomial probs – alternative form

27. Testing all months equal • Clearly, all months hav equal probabilities if and only if all the deltas are zero. • Thus, testing for equal months in the multinomial model is the same as testing for deltas all zero in the Poisson model. • This is done using the null model deviance, or, equivalently, the anova table. Recall: The null deviance was 8.1095e+01 on 11 degrees of freedom, with a p-value of 9.05831e-13, so months not equally likely

28. 2x2 tables • Relationship between Snoring and Nightmares: • Data from a random sample, classified according to these 2 factors

29. Parametrising the 2x2 table of Poisson means Main effect of Nightmare Main effect of Snoring Snoring/nightmare interaction

30. Parameterising the 2x2 table of probabilities Marginal distn of Snorer Marginal distn of Nightmare

31. Independence • Events A and B are independent if the conditional probability that A occurs, given B occurs, is the same as the unconditional probability of A occuring. i.e. P(A|B)=P(A). • Since P(A|B)=P(A and B)/P(B), this is equivalent to P(A and B) = P(A) P(B)

32. Independence in a 2 x2 table • Independence of nightmares and snoring means P(Snoring and frequent nightmares)= P(Snoring) x P(frequent nightmares) (plus similar equations for the other 3 cells) Or, p1 =(p1 + p2) (p1 + p3) In fact, this equation implies the other 3 for a 2 x 2 table

33. Independence in a 2x2 table (cont) Three equivalent conditions for independence in the 2 x 2 table • p1 =(p1 + p2) (p1 + p3) • p1 p4 =p2 p3 • g = 0 (ie zero interaction in Poisson model) Thus, we can test independence in the multinomial model by testing for zero interaction in the Poisson regression.

34. Math stuff

35. More math stuff

36. Odds ratio • Prob of not being a snorer for “frequent nightmares” population is p1/(p1 + p3) • Prob of being a snorer for “frequent nightmares” population is p3/(p1 + p3) • Odds of not being a snorer for “frequent nightmares” population is (p1/(p1 + p3)) /(p3/(p1 + p3)) = p1/p3 • Similarly, the odds of not being a snorer for “occasional nightmares” population isp2/p4

37. Odds ratio (2) • Odds ratio (OR) is the ratio of these 2 odds. Thus OR = (p1/p3) / (p2/p4) = (p1p4)/(p2p3) • OR = exp(g), log(OR) = g • OR =1 (i.e. log (OR) =0 ) if and only if snoring and nightmares are independent • Acts like a “correlation coefficient” between factors, with 1 corresponding to no relationship • Interchanging rows (or columns) changes OR to 1/OR

38. Odds ratio (3) • Get a CI for the OR by • Getting a CI for g • “Exponentiating” • CI for g is estimate +/- 1.96 standard error • Get estimate, standard error from Poisson regression summary

39. Example: snoring > y<-c(11,82,12,74) > nightmares<-factor(rep(c("F","O"),2)) > snore<-factor(rep(c("N","Y"),c(2,2))) > snoring.glm<-glm(y~nightmares*snore,family=poisson) > anova(snoring.glm, test="Chisq") Analysis of Deviance Table Model: poisson, link: log Response: y Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 3 111.304 nightmares 1 110.850 2 0.454 <2e-16 *** snore 1 0.274 1 0.180 0.6008 nightmares:snore 1 0.180 0 0.000 0.6713 No evidence of association between nightmares and snoring.

40. Odds ratio Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.39790 0.30151 7.953 1.82e-15 *** nightmaresO 2.00882 0.32110 6.256 3.95e-10 *** snoreY 0.08701 0.41742 0.208 0.835 nightmaresO:snoreY -0.18967 0.44716 -0.424 0.671 Estimate of g is -0.18967, standard error is 0.44716, so CI for g is -0.18967 +/- 1.96 * 0.44716, or (-1.066 , 0.6868,) CI for OR = exp(g) is (exp(-1.066), exp(0.6868) ) = (0.3443, 1.987)

41. The general I x J table • Example: In the 1996 general Social Survey, the National center for Opinion Research collected data on the following variables from 2726 respondents: • Education: Less than high school, High school, Bachelors or graduate • Religious belief: Fundamentalist, moderate or liberal. • Cross-classification is

42. Education and Religious belief

43. Testing independence in the general 2-dimensional table • We have a two-dimensional table, with the rows corresponding to Education, having I=3 levels, and the columns corresponding to Religious Belief, having J=3 levels. • Under the Poisson sampling model, let mij be the mean count for the i,j cell • Split log mij into overall level, main effects, interactions as usual

44. Testing independence in the general 2-dimensional table (ii) • Under the corresponding multinomial sampling model, let pij be the probability that a randomly chosen individual has level i of Education and level j of Religious belief, and so is classified into the i,j cell of the table. • Then, as before

45. Testing independence (iii) • Let pi+ = pi1 + … + piJ be the marginal probabilities: the probability that Education=i. • Let p+j be the same thing for Religious belief • The factors are independent if pij = pi+p+j • This is equivalent to (ab)ij = 0 for all i,j.

46. Testing independence (iv) Now we can state the principle: We can test independence in the multinomial sampling model by fitting a Poisson regression with interacting factors, and testing for zero interaction.

47. Doing it in R counts = c(178, 570, 138, 138, 648, 252, 108,442, 252) example.df = data.frame(y = counts, expand.grid(education = c("LessThanHighSchool", "HighSchool", "Bachelor"), religion = c("Fund", "Mod", "Lib"))) levels(example.df\$education) = c("LessThanHighSchool", "HighSchool", "Bachelor") levels(example.df\$religion) = c("Fund", "Mod", "Lib") example.glm = glm(y~education*religion, family=poisson, data=example.df) anova(example.glm, test="Chisq")

48. The data > example.df y education religion 1 178 LessThanHighSchool Fund 2 570 HighSchool Fund 3 138 Bachelor Fund 4 138 LessThanHighSchool Mod 5 648 HighSchool Mod 6 252 Bachelor Mod 7 108 LessThanHighSchool Lib 8 442 HighSchool Lib 9 252 Bachelor Lib

49. The analysis Degrees of freedom > example.glm = glm(y~education*religion, family=poisson, data=example.df) > anova(example.glm, test="Chisq") Analysis of Deviance Table Model: poisson, link: log Response: y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 8 1009.20 education 2 908.18 6 101.02 6.179e-198 religion 2 31.20 4 69.81 1.675e-07 education:religion 4 69.81 0 -2.047e-13 2.488e-14 Chisq P-value Small p-value is evidence against independence