Stats 330: Lecture 29

Stats 330: Lecture 29 More on Contingency Tables

Plan of the day In today’s lecture we continue our discussion of contingency tables. Topics • Conditional and unconditional Odds ratios • The product multinomial sampling model • Testing goodness of fit

Odds ratios in three dimensional tables: Florida murders Conditional table for Victim = White OR = 23*265/(39*105) = 1.488400 log(OR) = 0.3977

Example: Florida murders glm(formula = counts ~ victim * defendant * dp, family = poisson, data = murder.df) Coefficients: Estimate (Intercept) 2.5649 victimw 0.5705 defendantw -2.5649 dpn 2.7081 victimw:defendantw 3.0930 victimw:dpn -1.1896 defendantw:dpn 0.2364 victimw:defendantw:dpn 0.1613 0.2364 + 0.1613 = 0.3977

Estimating odds ratios: 3 dimensional table In a 3-dimensional table, fitting the maximal model: • The log of the conditional odds ratio equals the corresponding interaction in the maximal model, plus the 3 factor interaction • The log of the conditional odds ratio equals the corresponding interaction in the homogeneous association model (3 f.i. is zero) • Also can calculate OR directly from the conditional tables, as on previous slide.

Example: Florida murders glm(formula = counts ~ victim * defendant * dp, family = poisson, data = murder.df) Coefficients: Estimate (Intercept) 2.5649 victimw 0.5705 defendantw -2.5649 dpn 2.7081 victimw:defendantw 3.0930 victimw:dpn -1.1896 defendantw:dpn 0.2364 victimw:defendantw:dpn 0.1613 0.2364 + 0.1613 = 0.3977

The product multinomial sampling model • So far, we have assumed that we are sampling at random from a single population. All the factors have been “responses” (although we didn’t analyse them as such; we used the “Poisson trick”) • Nevertheless, we were examining the joint distribution of the factors.

The product multinomial sampling model (cont) • Sometimes, we want to think of one factor A say, as representing different populations. We might want to compare the distribution of another factor B in the different populations, by taking a separate sample from each population Thus, some of the margins are fixed, not random. • In this case, we want to study the conditional distribution of B, given A.

Literary Example • Many literary scholars have used sentence length distributions (and similar statistical measures) to characterise the work of classical writers, with a view to attributing authorship. • For this to be sensible, the consistency of sentence length distributions across the whole body of work of a writer has to be established. • The table on the next slide gives the sentence length distributions on samples from each of 9 works by the Greek historian Herodotus. We want to test if the population distributions in the 9 works are identical. • Here, the factor B is the number of words, and the factor A the work.

Number of sentences

The product multinomial sampling model (cont) • This type of sampling, where some margins are fixed, is called product multinomial sampling. • The likelihood for this type of sampling is the same as before, so we can ignore the fact that some margins are fixed. • The distribution of B will be the same for all levels of A if and only if A and B are independent. • Thus, to see if the distribution of B is the same for all levels of A, we test the independence of A and B in the joint distribution – ie test for an interaction between A and B.

Literary example: points to note • 9 random samples, not 1 • Bottom totals are fixed • Product-multinomial: 9 separate multinomials, get likelihood by multiplying them together • Can test if all 9 distributions are the same by • Fitting Poisson model using length and work as categorical explanatory variables, count as a response • Testing for an interaction between length and work.

Getting the data into R • Often the data is presented in the form of a table. • We need to transform it into a data frame with 3 variables count, work and length. • Can do this in various ways, but using expand.grid is perhaps the easiest.

The data in table form > temp.df Length W1 W2 W3 W4 W5 W6 W7 W8 W9 1 1-5 16 6 7 8 13 8 10 9 13 2 6-10 35 33 48 38 47 30 36 42 45 3 11-14 47 34 37 49 43 35 46 43 54 4 15-20 39 39 34 38 32 41 38 33 28 5 21-25 24 26 21 24 26 29 23 27 23 6 26-30 16 22 23 15 14 30 19 18 14 7 31-35 11 19 6 15 9 10 10 10 6 8 36-40 5 5 11 8 2 3 4 9 9 9 41-45 4 4 6 2 5 6 4 4 4 10 46-50 1 6 2 0 4 5 5 3 0 11 50+ 2 6 5 3 5 3 5 2 4 12 Total 200 200 200 200 200 200 200 200 200

Creating the input data frame length.labels<-as.vector(temp.df[-12,1]) work.labels<-paste(“W”,1:9, sep=“”) input.df<-data.frame(expand.grid(lengths=length.labels, work=work.labels), counts=unlist(temp.df[-12,-1]), row.names=1:99) Ouch! What are these !! > as.vector(temp.df[-12,1]) [1] "1-5""6-10""11-14""15-20""21-25""26-30“ "31-35“ "36-40" "41-45" [10] "46-50" "50+“ > paste("W",1:9,sep="") [1] "W1" "W2" "W3" "W4" "W5" "W6" "W7" "W8" "W9“ > unlist(temp.df[-12,-1]) all the counts in the form of a vector, column by column

Form of input.df > input.df lengths work counts 1 1-5 W1 16 2 6-10 W1 35 3 11-14 W1 47 4 15-20 W1 39 5 21-25 W1 24 6 26-30 W1 16 7 31-35 W1 11 8 36-40 W1 5 9 41-45 W1 4 10 46-50 W1 1 11 50+ W1 2 12 1-5 W2 6 13 6-10 W2 33 And so on . . .

Graphical comparison of frequencies • Use barplots, also mosaic plots (see 201/8) • Need to • calculate frequencies • “transpose the matrix” • label rows and columns • Draw the plot frequencies<-t(as.matrix(temp.df[-12,-1]))/200 dimnames(frequencies)<-list(work.labels, length.labels) barplot(frequencies,beside=T, legend.text=work.labels, space=c(0,5),xlab = "sentence length", ylab="frequencies”,col=heat.colors(9))

barplot(frequencies,beside=T, legend.text=work.labels, space=c(0,5),xlab = "sentence length", ylab="frequencies”,col=heat.colors(9))

Mosaic plot my.table = matrix(input.df$counts, nrow=11, ncol=9) dimnames(my.table) = list(length.labels, work.labels) mosaicplot(t(my.table), main="mosaic plot of counts")

Analysis Maximal model > herodotus.glm<-glm(counts~work*lengths,family=poisson,data=input.df) > anova(herodotus.glm, test="Chisq") Analysis of Deviance Table Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 98 1253.82 work 8 0.00 90 1253.82 1.00 lengths 10 1159.75 80 94.07 6.925e-243 work:lengths 80 94.07 0 1.816e-05 0.13 > 1-pchisq(94.07,80) [1] 0.1345596 Interaction is non-significant, indicating no significant differences between works

Testing goodness of fit • In the previous example, we concluded that the distribution of sentence lengths was the same over the 9 works studied. • This raises the question, can we model the common distribution by some standard distribution such as the Poisson or the negative binomial? • We will look at the general problem of deciding if a given distribution fits a set of count data.

General solution • Suppose the observations are counts, and we can represent the data in the form of a contingency table, as in the last example. • We want to test if the distribution of the count data is some given distribution, say the Poisson. • Using this distribution, we can work out the probability of “hitting” a particular cell, and hence calculate the log-likelihood log Lmod. • Using the deviance, we can test if the Poisson model is adequate.

Example Flying bombs on London • In 1944, the German army fired 6725 “flying bombs” at London. • The British Government needed to determine how accurate these bombs were. • The locations of the bomb hits in an area of South London were recorded and analysed to determine this.

Pictures

Data collection Area in South London divided up into a square 24 x 24 grid, 576 squares in all Number of bomb hits in each square recorded

Rationale • If the aiming was random, the hits should make a random pattern in the 576 squares. • In this case, theory indicates that the distribution of the number of hits in a square should be Poisson. • We can check how accurately the German forces could aim by checking if the distribution is Poisson.

Analysis • The total number of squares is 576. • The mean number of hits per square is [(0´229) + (1´211) + (2´93) + (3´35) + (4´7) + (5´1)] /576=0.9288 no.of.squares<-c(0,1,2,3,4,5) no.of.hits<-c(229,211,93,35,7,1) mean.hits<-sum(no.of.squares*no.of.hits)/ sum(no.of.hits) > mean.hits [1] 0.9288194

Analysis (cont) • The Poisson probabilities are calculated using dpois: poisson.probs<-dpois(0:5,mean.hits) freqs<-no.of.hits/576 Can compare the Poisson and maximal probabilities with a barplot

Analysis (cont) • The log-likelihoods for the maximal and Poisson models are of the form Siyi log pi where pi is estimated by the frequencies and Poisson probabilities respectively poisson.probs<-dpois(0:5,mean.hits) freqs<-no.of.hits/576 Deviance<-2*(sum(no.of.hits*log(freqs)) - sum(no.of.hits*log(poisson.probs))) > Deviance [1] 1.499450 > 1-pchisq(Deviance,4) [1] 0.8267388

Conclusions • The agreement is remarkably good, with a large p-value • Bombs seem to be aimed randomly • Degrees of freedom are 4 • maximal model has m -1=6 -1=5 df, since there are 5 unknown parameters p1, … p5 • Poisson model has 1 df, since there is 1 unknown parameter, the Poisson mean

Stats 330: Lecture 29