Stats 330: Lecture 27

Stats 330: Lecture 27 Contingency tables

Plan of the day In today’s lecture we apply Poisson regression to the analysis of one and two-dimensional contingency tables. Topics • Contingency tables • Sampling models • Equivalence of Poisson and multinomial • Correspondence between interactions and independence

Contingency tables • Contingency tables arise when we classify a number of individuals into categories using one or more criteria. • Result is a cross-tabulation or contingency table • 1 criterion gives a 1-dimensional table • 2 criteria give a 2 dimensional table • … and so on.

Example: one dimension Income distribution of New Zealanders 15+ 2006 census

Example 2: 2 dimensions New Zealanders 15+ by annual income and sex: 2006 census

Censuses and samples • Sometimes tables come from a census • Sometimes the table comes from a random sample from a population • In this case we often want to draw inferences about the population on the basis of the sample e.g.is income independent of sex?

Example: death by falling The following is a random sample of 16,976 persons who met their deaths in fatal falls, selected from a larger population of deaths from this cause. The falls classified by month are

The question • Question: if we regard this as a random sample from several years, are the months in which death occurs equally likely? • Or is one month more likely than the others? • To answer this, we use the idea of maximum likelihood. First, we must detour into sampling theory in order to work out the likelihood

Sampling models • There are two common models used for contingency tables • The multinomial sampling model assumes that a fixed number of individuals from a random sample are classified with fixed probabilities of being assigned to the different “cells”. • The Poisson sampling model assumes that the table entries have independent Poisson distributions

Multinomial sampling

Multinomial model • Suppose a table has M “cells” • n individuals classified independently (A.k.a. sampling with replacement) • Each individual has probability pi of being in cell i • Every individual is classified into exactly 1 cell, so p1 + p2 + . . . + pM = 1 • Let Yi be the number in the ith cell, so Y1 + Y2 + . . . YM = n

Multinomial model (cont) Y1, . . . ,Ym have a multinomial distribution This is the maximal model, as in logistic regression, making no assumptions about the probabilities.

Log-likelihood

MLE’s • If there are no restrictions on the p’s (except that they add to one) the log-likelihood is maximised when pi = yi/n • These are the MLE’s for the maximal model • Substitute these into the log-likelihood to get the maximum value of the maximal log-likelihood. Call this Log Lmax

Deviance • Suppose we have some model for the probabilities: this will specify the form of the probabilities, perhaps as functions of other parameters. In our example, the model says that each probability is 1/12. • Let log Lmod be the maximum value of the log-likelihood, when the probabilities are given by the model. • As for logistic regression, we define the deviance as D = 2log Lmax - 2 log Lmod

Deviance:Testing adequacy of the model • If n is large (all cells more than 5) and M is small, and if the model is true, the deviance will have (approximately) a chi-square distribution with M-k-1 df, where k is the number of unknown parameters in the model. • Thus, we accept the model if the deviance p-value is more than 0.05. • This is almost the same as the situation in logistic regression with strongly grouped data. • These tests are an alternative form of the chi-square tests used in stage 2.

Example: death by falling • Are deaths equally likely in all months? In statistical terms, is pi=1/12, i=1,2,…,12? • Log Lmax is, up to a constant, Calculated in R by >y<-c(1688,1407,1370,1309,1341,1388, 1406,1446,1322,1363, 1410,1526) > sum(y*log(y/sum(y))) [1] -42143.23

Example: death by falling Our model is pi=1/12, i=1,2,…,12. (all months equally likely). This completely specifies the probabilities, so k=0, M-k-1=11. Log Lmod is, up to a constant, > sum(y*log(1/12)) [1] -42183.78 # now calculate deviance > D<-2*sum(y*log(y/sum(y)))-2*sum(y*log(1/12)) > D [1] 81.09515 > 1-pchisq(D,11) [1] 9.05831e-13Model is not plausible!

> Months<-c("Jan","Feb","Mar","Apr","May","Jun", "Jul","Aug","Sep","Oct","Nov","Dec") > barplot(y/(sum(y),names.arg=Months)

Poisson model • Assume that each cell is Poisson with a different mean: this is just a Poisson regression model, with a single explanatory variable “month” • This is the maximal model • The null model is the model with all means equal • Null model deviance will give us a test that all months have the same mean

R-code for Poisson regression > months<-c("Jan","Feb","Mar","Apr","May", "Jun","Jul","Aug","Sep","Oct","Nov","Dec") • > months.fac<-factor(months,levels=months) > falls.glm<-glm(y~months.fac,family=poisson) > summary(falls.glm)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 7.43130 0.02434 305.317 < 2e-16 *** months.facFeb -0.18208 0.03610 -5.044 4.56e-07 *** months.facMar -0.20873 0.03636 -5.740 9.46e-09 *** months.facApr -0.25428 0.03683 -6.904 5.04e-12 *** months.facMay -0.23013 0.03658 -6.291 3.15e-10 *** months.facJun -0.19568 0.03623 -5.401 6.64e-08 *** months.facJul -0.18280 0.03611 -5.063 4.13e-07 *** months.facAug -0.15474 0.03583 -4.318 1.57e-05 *** months.facSep -0.24440 0.03673 -6.655 2.84e-11 *** months.facOct -0.21386 0.03642 -5.873 4.29e-09 *** months.facNov -0.17995 0.03608 -4.988 6.10e-07 *** months.facDec -0.10089 0.03532 -2.856 0.00429 ** Null deviance: 8.1095e+01 on 11 degrees of freedom Residual deviance: -7.5051e-14 on 0 degrees of freedom > D [1] 81.09515 Same as before! Why????

General principle: • To every Poisson model for the cell counts, there corresponds a multinomial model, obtained by conditioning on the table total. • Suppose the Poisson means are m1, … mM .... The cell probabilities in the multinomial sampling model are related to the means in the Poisson sampling model by the relationship pi = mi/(m1 + … + mM) • We can estimate the parameters in the multinomial model by fitting the Poisson model and using the relationship above. • We can test hypotheses about the multinomial model by testing the equivalent hypothesis in the Poisson regression model.

Example: death by falling • Poisson means are

Relationship between Poisson means and multinomial probs

Relationship between Poisson parameters and multinomial probs – alternative form

Testing all months equal • Clearly, all months hav equal probabilities if and only if all the deltas are zero. • Thus, testing for equal months in the multinomial model is the same as testing for deltas all zero in the Poisson model. • This is done using the null model deviance, or, equivalently, the anova table. Recall: The null deviance was 8.1095e+01 on 11 degrees of freedom, with a p-value of 9.05831e-13, so months not equally likely

2x2 tables • Relationship between Snoring and Nightmares: • Data from a random sample, classified according to these 2 factors

Parametrising the 2x2 table of Poisson means Main effect of Nightmare Main effect of Snoring Snoring/nightmare interaction

Parameterising the 2x2 table of probabilities Marginal distn of Snorer Marginal distn of Nightmare

Relationship between multinomial probabilities and Poisson parameters

Independence • Events A and B are independent if the conditional probability that A occurs, given B occurs, is the same as the unconditional probability of A occuring. i.e. P(A|B)=P(A). • Since P(A|B)=P(A and B)/P(B), this is equivalent to P(A and B) = P(A) P(B)

Independence in a 2 x2 table • Independence of nightmares and snoring means P(Snoring and frequent nightmares)= P(Snoring) x P(frequent nightmares) (plus similar equations for the other 3 cells) Or, p1 =(p1 + p2) (p1 + p3) In fact, this equation implies the other 3 for a 2 x 2 table

Independence in a 2x2 table (cont) Three equivalent conditions for independence in the 2 x 2 table • p1 =(p1 + p2) (p1 + p3) • p1 p4 =p2 p3 • g = 0 (ie zero interaction in Poisson model) Thus, we can test independence in the multinomial model by testing for zero interaction in the Poisson regression.

Math stuff

More math stuff

Odds ratio • Prob of not being a snorer for “frequent nightmares” population is p1/(p1 + p3) • Prob of being a snorer for “frequent nightmares” population is p3/(p1 + p3) • Odds of not being a snorer for “frequent nightmares” population is (p1/(p1 + p3)) /(p3/(p1 + p3)) = p1/p3 • Similarly, the odds of not being a snorer for “occasional nightmares” population isp2/p4

Odds ratio (2) • Odds ratio (OR) is the ratio of these 2 odds. Thus OR = (p1/p3) / (p2/p4) = (p1p4)/(p2p3) • OR = exp(g), log(OR) = g • OR =1 (i.e. log (OR) =0 ) if and only if snoring and nightmares are independent • Acts like a “correlation coefficient” between factors, with 1 corresponding to no relationship • Interchanging rows (or columns) changes OR to 1/OR

Odds ratio (3) • Get a CI for the OR by • Getting a CI for g • “Exponentiating” • CI for g is estimate +/- 1.96 standard error • Get estimate, standard error from Poisson regression summary

Example: snoring > y<-c(11,82,12,74) > nightmares<-factor(rep(c("F","O"),2)) > snore<-factor(rep(c("N","Y"),c(2,2))) > snoring.glm<-glm(y~nightmares*snore,family=poisson) > anova(snoring.glm, test="Chisq") Analysis of Deviance Table Model: poisson, link: log Response: y Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 3 111.304 nightmares 1 110.850 2 0.454 <2e-16 *** snore 1 0.274 1 0.180 0.6008 nightmares:snore 1 0.180 0 0.000 0.6713 No evidence of association between nightmares and snoring.

Odds ratio Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.39790 0.30151 7.953 1.82e-15 *** nightmaresO 2.00882 0.32110 6.256 3.95e-10 *** snoreY 0.08701 0.41742 0.208 0.835 nightmaresO:snoreY -0.18967 0.44716 -0.424 0.671 Estimate of g is -0.18967, standard error is 0.44716, so CI for g is -0.18967 +/- 1.96 * 0.44716, or (-1.066 , 0.6868,) CI for OR = exp(g) is (exp(-1.066), exp(0.6868) ) = (0.3443, 1.987)

The general I x J table • Example: In the 1996 general Social Survey, the National center for Opinion Research collected data on the following variables from 2726 respondents: • Education: Less than high school, High school, Bachelors or graduate • Religious belief: Fundamentalist, moderate or liberal. • Cross-classification is

Education and Religious belief

Testing independence in the general 2-dimensional table • We have a two-dimensional table, with the rows corresponding to Education, having I=3 levels, and the columns corresponding to Religious Belief, having J=3 levels. • Under the Poisson sampling model, let mij be the mean count for the i,j cell • Split log mij into overall level, main effects, interactions as usual

Testing independence in the general 2-dimensional table (ii) • Under the corresponding multinomial sampling model, let pij be the probability that a randomly chosen individual has level i of Education and level j of Religious belief, and so is classified into the i,j cell of the table. • Then, as before

Testing independence (iii) • Let pi+ = pi1 + … + piJ be the marginal probabilities: the probability that Education=i. • Let p+j be the same thing for Religious belief • The factors are independent if pij = pi+p+j • This is equivalent to (ab)ij = 0 for all i,j.

Testing independence (iv) Now we can state the principle: We can test independence in the multinomial sampling model by fitting a Poisson regression with interacting factors, and testing for zero interaction.

Doing it in R counts = c(178, 570, 138, 138, 648, 252, 108,442, 252) example.df = data.frame(y = counts, expand.grid(education = c("LessThanHighSchool", "HighSchool", "Bachelor"), religion = c("Fund", "Mod", "Lib"))) levels(example.df$education) = c("LessThanHighSchool", "HighSchool", "Bachelor") levels(example.df$religion) = c("Fund", "Mod", "Lib") example.glm = glm(y~education*religion, family=poisson, data=example.df) anova(example.glm, test="Chisq")

The data > example.df y education religion 1 178 LessThanHighSchool Fund 2 570 HighSchool Fund 3 138 Bachelor Fund 4 138 LessThanHighSchool Mod 5 648 HighSchool Mod 6 252 Bachelor Mod 7 108 LessThanHighSchool Lib 8 442 HighSchool Lib 9 252 Bachelor Lib

The analysis Degrees of freedom > example.glm = glm(y~education*religion, family=poisson, data=example.df) > anova(example.glm, test="Chisq") Analysis of Deviance Table Model: poisson, link: log Response: y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 8 1009.20 education 2 908.18 6 101.02 6.179e-198 religion 2 31.20 4 69.81 1.675e-07 education:religion 4 69.81 0 -2.047e-13 2.488e-14 Chisq P-value Small p-value is evidence against independence

Stats 330: Lecture 27