Contingency tables and Correspondence analysis

Contingency tables and Correspondence analysis • Contingency table • Pearson’s chi-squared test for association • Correspondence analysis using SVD • Plots • References • Exercises

Contingency tables Contingency tables are often used in social sciences (such as sociology, education, psychology). These tables can be considered as frequency tables. Rows and columns are some categorical variables. If variables are continuous then we can use bins for these continuous variables and convert them into categorical ones. Categorical variables have discrete values. For example: Different drugs and effects of different drugs as “Excellent”, “good” etc. Contingency tables sometimes are called incidence matrices. Example of contingency tables. Survey of effects of four different drug types. Patients gave score for each drug type (excellent, very good, good, fair, poor). Number of all elements is 121. excellent very good good fair poor Drug A 6 8 10 1 5 Drug B 12 8 3 3 5 Drug C 0 3 12 6 10 Drug D 1 1 8 12 7 First question is if there is association between columns and rows. If there is some association then we want to find some structure in this data table. Can we order columns and rows by their closeness? Can we find associations between columns and rows? Problem of correspondence analysis is to find an optimal representation of contingency table in a lower dimensional space so that columns and rows are on the same scale.

Pearson chi-squared test Suppose that we have a data matrix N that has I rows and J columns. Elements of the matrix are nij. Let us use the following notations: r and c are row and column sums, R and C are row and column profiles, respectively. Q is difference between P and product of row and column sums. More notations and relations: row and column inertias are multiple of chi-squared with degrees of freedom (I-1)(J-1). Multiplicity is 1/n. If P would be probability then if there would be no association between rows and columns then Q would be 0. It is equivalent to saying that rows and columns are independent For above example chi-squared test carried out in R gives: Pearson's Chi-squared test data: Dr1 X-squared = 47.0718, df = 12, p-value = 4.53e-06 This test shows that null-hypothesis should be rejected. I.e. there is strong evidence that there is row-column association.

Probabilistic interpretation of matrices If the matrix P would be a probability matrix i.e. each element pij are probability of happening rows and columns simultaneously then we can have the following interpretation of the involved matrices: • Elements of r are the marginal probabilities of columns. Elements of c are the marginal probabilities of rows • Elements of Q are differences between joint probability and product of individual probabilities. In some sense this matrix represents the degree of dependencies of rows and columns • Elements of R are the conditional probabilities of columns when row is know • Elements of C are the conditional probabilities of rows when column is known • Total inertia is the total indicator of dependencies of rows and columns.

Contingency tables: homogeneity and heterogeneity t=in(I)=X2/n is the coefficient of association called as Pearson’s mean-square contingency. It is the total inertia. The total inertia is a measure of homogeneity/heterogeneity of the table. If t is large it is a measure of heterogeneity and if t is small it is a measure of homogeneity of the table. Homogeneity means that there is no row-column association. t can also be calculated using: Second summation is sum of a weighted squared distance between the vector of relative frequency of the ith row (i.e. jth row profile – pij/ri) and the average row profile – c. Inverse of the elements of c are the weights. It is known as chi-squared distance between ith row profile and the average row profile.The total inertia is is further weighted sums of I chi-squared distances. The weights are the elements of r. If all elements of row profiles are close to the average row profile then table is homogenous. Otherwise table is heterogeneous. We can do similar calculations for the column profiles. It is done easily by changing roles of r and c. This distances are similar to Euclidean distances and techniques used for Euclidean distances can also be used for this case. We will learn techniques for metric scaling in one of the lectures.

Correspondence analysis and eigenvalues For a given contingency table we calculate row and column profiles. Now we want to find a vector when multiplied by row profiles from the left will have highest possible variance. It means that we want to maximise To make this problem solvable we add an additional constraint (similar to PCA). We want weighted norm of the vector to be unit and weighted mean to be 0. Weights are column sums. If mean is 0 and we know that cTy=rTRy=0, then we can write for the maximisation problem If we use Lagrange multipliers technique then we get: Thus the problem reduces to the eigenvalue problem. As a result we will have principal coordinates for columns. Similarly we can find principal coordinates for columns. This problem easily and compactly solved if we use singular value decomposition. Conditions of the weighted norm of the vector to be unit and weighted mean to be 0 are similar to those in PCA (norm of the vector is unit and mean values of the variables are 0 in case of PCA ).

Contingency table: Correspondence analysis Above stated problem is solved using singular value decomposition of the probability matrix minus column average multiplied by row average Let us use the singular value decomposition: It is equivalent to (generalized singular value decomposition): Principal row and column coordinates are: First few (one or two) elements of F and G are usually taken and plotted simultaneously. Transitions between columns and rows are given: This relation is useful for addition of supplementary rows or columns to the picture. Another useful formula is a reconstruction formula:

Correspondence analysis Elements of D are called the principal inertias. They are also related to the canonical correlations given by the package R. Larger value of D means that the corresponding element has higher importance. It is usual to use one or two elements of F and G. Then these elements are used for various plots. For pictorial representation either columns and row are plotted in and ordered form or biplots is used to find possible association between rows and columns as well as their order. It is worth noting that correspondence analysis is a very useful tool. It is very useful in archeology, ecology, medicine, psychology. It may even be useful in history and other fields. There are many, many problems can be brought to this type of analysis. As soon as you can define two sets of categories say cat1, cat2 and find frequencies for all cross terms of cat1 and cat2 you can apply correspondence analysis. On the other hand it should be considered as a dimension reduction technique and can be used together with others (for example PCA). Comparative application of different dimension reduction technique may give insight to the problem and structure in the data.

Algorithm of Correspondence analysis • Take a contingency table (N) and find sum of all elements (total sum) • Divide all elements by the total sum (call it P) • Find row and column sums (r and c) • From each element of P subtract product of corresponding elements of row and column sums (call it Q). • Find generalised SVD of the Q. Normalisation conditions for left and right side matrices are weigted normalisation with weights corresponding to the inverses of row and column sums. • Find principal row and column coordinates. Take few elements and plot them. • If there are new elements (rows or columns) use transition formula to find principal coordinates corresponding to them. Plot them as a supplementary points. (R does not allow to do it directly) • Analyse the results (order and closeness of columns and rows, possible associations between columns and rows).

Plot of correspondence analysis: Example This is 1D pictorial form of the table quality of drugs. Positions of rows and columns correspond to row and column scores. Size of the circles corresponds to number of elements for the corresponding cell of the contingency table. This picture already can tell something about the structure of the data.

Biplot for the correspondence analysis Biplot produced by R: Columns and rows are plotted simultaneously. Black are rows and red are columns. Positions of the points correspond to their scores. Again from this picture we can deduce some structure about data.

R commands for contingency tables and correspondence analysis For correspondence analysis we need libraries ctest, MASS and mva. We need to load them library(mva) library(MASS) library(ctest) (mva and ctest may not be needed if you use R version 2.0.0 or higher) To perform chi-squared test we can use (load data first) data(drivers) dr1 = matrix(drivers,ncol=12,byrow=1) chisq.test(dr1) chisq.test(dr1,simulate.p.value=T) If there is some association between rows and columns then we can start usinng the correspondence analysis: cdriver = corresp(dr1,nf=1) nf is the number of factors we want to find. we can plot this using the plot command plot(cdriver) – If we have only 1 factor then result will be pictorial representation of the table. if nf=2 then result will be the biplot.

References • Krzanowski WJ and Marriout FHC. (1994) Multivatiate analysis. Kendall’s library of statistics • Greenacre MJ (1984) Theory and applications of Correspondence analysis

Contingency tables and Correspondence analysis