Comparing Proportions & Analysing Categorical Data

Comparing Proportions & Analysing Categorical Data Scott HarrisOctober 2009

Learning outcomes By the end of this session you should be able to choose between, perform (using SPSS and CIA) and interpret the results from the following methods of analysing categorical data: • A test for association or independence (Chi-square or Fisher’s exact test). • A test for assessing if a sample proportion differs from a specified proportion (Chi-square). • A test for a change in categorical response (McNemar’s test). • A test of agreement of categories between 2 raters (Kappa test). You should also be aware of the concept of odds, odds ratios and how to calculate them from a 2x2 table.

Contents • Introduction • Refresher - types of data. • Data requirements. • The example dataset: CISR data. • Association between 2 variables • Test information. • ‘How to’ in SPSS and CIA. • One sample versus a specified proportion • Test information. • ‘How to’ in SPSS and CIA.

Contents • Change in response • Test information. • ‘How to’ in SPSS and CIA. • Quick crosstabs • Summary Data tables in SPSS: ‘How to’ • Agreement between 2 raters • Test information. • ‘How to’ in SPSS and CIA.

Refresher: Types of data • Quantitative – a measured quantity. • Continuous – Measurements from a continuous scale: Height, weight, age. • Discrete – Count data: Children in a family, number of days in hospital. • Qualitative – Assessing a quality. • Ordinal – An order to the data: Likert scale (much worse, worse, the same, better, much better), age group (18-25, 26-30…). • Categorical / Nominal – Simple categories: Blood group (O, A, B, AB). A special case is binary data (two levels): Status (Alive, dead), Infection (yes, no).

Data requirements • The Statistical tests that will be covered in this talk compare a sample with a categorical outcome against either: • a published or hypothesised proportion, • another group / another category or multiple categories, • a repeated categorical outcome from the same individual, • another measurement of the same outcome from another source or • a gold standard ‘true’ outcome. • A different type of test / method is used in each of the situations above.

Example dataset: Information CISR (Clinical Interview Schedule: Revised) data: • Measure of depression – the higher the score the worse the depression. • A CISR value of 12 or greater is used to indicate a clinical case of depression. • 3 groups of patients (each receiving a different form of treatment: GP, CMHN and CMHN problem solving). • Data collected at two time points (baseline and then a follow-up visit 6 months later). • An additional reading at 6 months was taken by another researcher.

Example CISR dataset: Raw data

Example CISR dataset: Labelled data

Association / Independence (Difference in proportions) Chi-square test or Fisher’s exact test

Chi-square statistic • The most common statistic used when dealing with categorical data. Alongside the t test this is the most often seen statistical technique. • The Chi-square ( ) or Pearson Chi-squared statistic compares the observed proportion of a categorical response with an expected value. • The null hypothesis (as always) is that there is no difference or no association between the variables (depending on the context). • As the difference between these observed and expected values increases then the evidence supporting a difference or an association builds up.

Theory: Chi-square statistic The following equation is used to calculate the chi-square statistic: Observed = The actual count in each cell. Expected = The number expected to be in each cell of the table if the test proportion was true or the rows and columns were unrelated / independent (i.e. assuming no difference in response). This is distributed with (n1-1) x (n2-1) degrees of freedom. Number of levels for column variable Number of levels for row variable

Chi-square statistic: Example We want to see if Gender is associated with Clinical status at 6 months

Theory: Chi-square statistic The expected values are produced by multiplying together the 2 marginal totals and then dividing by the grand total. For Male clinical cases: 37 x 39 = 1443 , 1443 / 109 = 13.24 (2dp)

Theory: Chi-square statistic For Male clinical cases: (15 - 13.24)2 / 13.24 = 3.0976 / 13.24 = 0.23 (2dp) For Male non cases: (22 – 23.76)2 / 23.76 = 3.0976 / 23.76 = 0.13 (2dp) For Female clinical cases: (24 – 25.76)2 / 25.76 = 3.0976 / 25.76 = 0.12 (2dp) For Female non cases: (48 – 46.24)2 / 46.24 = 3.0976 / 46.24 = 0.07 (2dp) Chi-squarestatistic = 0.23 + 0.13 + 0.12 + 0.07 = 0.552 (3dp)

Chi-square alternative: Fisher’s exact test • The chi-square test is only appropriate when the sample size is large enough that there are no ‘rare’ combinations of categories in the cross-tabulation. • The definition of ‘rare’ is that the expected counts for all of the cells in the table need to be at least 5. • If at least one of the cells in the table has an expected count <5 then the Pearson Chi-square statistic should not be reported and an alternative test called Fisher’s exact test should be used instead. • Fisher’s exact test is an exact permutation test for categorical variables. It is convention to only use Fisher’s exact test when Pearson’s Chi-square is not appropriate.

Chi-square statistic: SPSS * Chi-square test for Sex and M6Cat . CROSSTABS /TABLES=SEX BY M6Cat /FORMAT= AVALUE TABLES /STATISTIC=CHISQ /CELLS= COUNT ROW /COUNT ROUND CELL . Analyze  Descriptive statistics  Crosstabs…

Chi-square statistic: Output 2x2 cross tabulation with suitable percentages (as we are comparing the two genders here we use row percentages). Pearson Chi-square p value = 0.457 Fisher’s exact test p value = 0.529 (Notice how only the 2-sided tests are considered.)

Fisher’s exact test: Table > 2 x 2 Everything should be set up as before, with the following additional option:

Chi-square statistic: 3 x 2 Output (Exact) 3x2 cross tabulation with suitable percentages (as we are comparing the treatment groups here we use row percentages). Pearson Chi-square p value = 0.380 Fisher’s exact test p value = 0.379 (Again notice how only the 2-sided tests are considered.)

Info: Chi-square in SPSS • From the menus select ‘Analyze’  ‘Descriptive Statistics’  ‘Crosstabs…’. • Put one of your categorical variables into the ‘Row(s):’ box and the other into the ‘Column(s):’ box. • Click the ‘Cells…’ button and then select the box for any percentages that you require. Then click the ‘Continue’ button. • Click the ‘Statistics…’ button and tick the option for ‘Chi-square’. Then click the ‘Continue’ button. • If your table will be bigger than a 2x2 then click the‘Exact…’ button and tick the option for ‘Exact’. Then click the ‘Continue’ button. • Finally click ‘OK’ to produce the cross tabulation with the Chi-square statistic or ‘Paste’ to add the syntax for this into your syntax file.

Theory: Options for a 2 x 2 table • Difference in proportions (absolute difference): • Relative risk (multiplicative difference): • Another common alternative is the Odds ratio: Both the Chi-square test and Fisher’s exact tests are tests for association between the two independent variables. They do not quantify the effect size. For a 2 x 2 table there are a number of options that you can use to quantify effects, each of which has pros and cons.

Theory: Odds and odds ratios Outcome of interest: Clinical case Odds for Females = b/a Odds for Males = d/c Odds are a tricky topic for some to understand, but easy for others. They are most commonly encountered in gambling situations. The odds for females being a clinical case are 24 to 48 or 1 to 2. For females, you would expect 1 clinical case for every 2 non cases. Odds ratios are simply the ratio of the 2 odds (1 divided by the other).

Theory: Odds and odds ratios Outcome of interest: Clinical case Odds for Females = b/a = 24/48 = 0.5 Odds for Males = d/c = 15/22 = 0.68 (2dp) This is often reported as 2 to 1 against in gambling situations. = 0.68 / 0.5 = 1.36 (2dp) The odds of being a clinical case are 1.36 times larger for Males than Females. Here we have taken females as the reference category (we have divided by the female result) thereby getting the relative ‘increase’ in odds for being male. Odds ratios are produced in logistic regression

Reminder: 95% confidence intervals in CIA Always check that you are producing 95% CI’s:  Options menu in CIA

Difference in proportions: CIA Example Methods  Proportions and their differences  Unpaired samples

Difference in proportions: CIA Example It is Easiest to have the group with the largest ‘feature’ proportion as Sample 1. This will produce a positive difference.

Difference in proportions: CIA Example 95% confidence interval. (This CI includes 0, therefore agreeing with the earlier p value from SPSS) Observed proportions and difference in proportions

Chi-square statistic: Presentation There was found to be no significant difference (Pearson Chi square: p = 0.457) in the proportions of clinical cases, in male and female patients (Difference 7.2%, 95% CI: -11.1% to 25.9%).

One sample vs. a specific proportion Chi-square test or Exact test

Chi-square statistic for one variable • In the same way that the chi-square test can be used when you have larger than a 2 x 2 table, it can also be used when you have a n x 1 table. • In this situation you are testing whether the proportion of some event (or events) that you have seen are different to a value that you specify. • In this case the expected values are calculated from the specified proportion(s) but in all other regards the test is computed in the same way.

One variable Chi-square: Example * Chi-square test for B0cat vs. 0.1 / 0.9 . NPAR TEST /CHISQUARE=B0Cat /EXPECTED=0.1 0.9 /MISSING ANALYSIS /METHOD=EXACT TIMER(5). The specified proportions are entered here, the ordering is important and is dependent on how the variable was set up. Analyze  Nonparametric tests  Chi-square…

Info: One variable Chi-square in SPSS • From the menus select ‘Analyze’  ‘Nonparametric tests’  ‘Chi-square…’. • Put your categorical variable into the ‘Test variable list:’ box. • Either specify expected proportions one at a time in the ‘Values’ box (in the order of the levels in the categorical variable), each time clicking the ‘Add’ button or leave the test to compare against equal proportions. • Click the‘Exact…’ button and tick the option for ‘Exact’. Then click the ‘Continue’ button. • Finally click ‘OK’ to produce the one variable Chi-square statistic or ‘Paste’ to add the syntax for this into your syntax file.

One variable Chi-square: Output The number of observed frequencies in each category as well as the expected number if the specified proportions of 0.1 and 0.9 were true. Pearson Chi-square p value = 0.213 Exact test p value = 0.263

Single sample Chi-square: CIA Example Methods  Proportions and their differences  Single sample

Single sample Chi-square: CIA Example Decide on which proportion you would like to produce the confidence interval for by setting that as the ‘feature’.

Single sample Chi-square: CIA Example 95% confidence interval (This CI includes 0.9, therefore agreeing with the earlier p value from SPSS) Observed proportion

Test for change in paired proportions McNemar test

The McNemar test • When you have paired or repetitious binary categories then the Chi-square test is no longer appropriate and you should make use of an alternative test known as the McNemar test. • An example of this type of data are the binary clinical case variables in the example dataset. Here we have the same information on an individual at both baseline and 6 months. These readings are paired categorical results. • The McNemar test looks at whether there has been a significant shift in state in the two paired results. • The focus of this test is whether there is a large shift in one direction rather than the other, as well as how much change there has actually been in the paired results.

Theory: McNemar test Looking at the example cross tabulation on the right then: • The empty cells of the table indicate where there is no change / difference in outcomes 1 and 2. • The solid red cell indicates those who didn’t have the outcome at time 1, but did have it at time 2. • Vice versa the red striped cell indicates those who had the outcome at time 1, but not at time 2. If both of the coloured cells are sufficiently small in proportion then there has been little change in response. Likewise if the proportion in each of the shaded cells is similar, then overall there has been little change in response direction. McNemar’s test takes both of these into account

McNemar test: Example * McNemar test for B0Cat and M6Cat . CROSSTABS /TABLES=B0Cat BY M6Cat /FORMAT= AVALUE TABLES /STATISTIC=MCNEMAR /CELLS= COUNT TOTAL /COUNT ROUND CELL . Analyze  Descriptive statistics  Crosstabs…

Info: McNemar in SPSS • From the menus select ‘Analyze’  ‘Descriptive Statistics’  ‘Crosstabs…’. • Put one of your paired categorical variables into the ‘Row(s):’ box and the other into the ‘Column(s):’ box. • Click the ‘Cells…’ button and then select the box for ‘Total’ percentages. Then click the ‘Continue’ button. • Click the ‘Statistics…’ button and tick the option for ‘McNemar’. Then click the ‘Continue’ button. • Finally click ‘OK’ to produce the cross tabulation with the McNemar statistic or ‘Paste’ to add the syntax for this into your syntax file.

McNemar statistic: Example 2x2 cross tabulation with overall percentages. From the above table we can see that 58.7% of the total sample were clinical cases at baseline and became non cases by 6 months, whereas only 0.9% initially were non cases and became cases. McNemar test p value: p <0.001

McNemar test: CIA Example Methods  Proportions and their differences  Paired samples

McNemar test: CIA Example There is an alternative Table view that may be easier for entering data:

McNemar test: CIA Example It is easiest to put the largest change proportion into the bottom left corner of this table. In this way you will get positive differences:

McNemar test: CIA Example 95% confidence interval (This CI excludes 0, therefore agreeing with the earlier p value from SPSS) Observed difference in proportions

McNemar test: Presentation There was found to be a highly significant change in clinical status of depression (McNemar: p < 0.001), from baseline to 6 months (Difference 57.8%, 95% CI: 47.0% to 66.5%) in favour of a lessoning of symptoms (reduction of clinical cases).

Quick Crosstabs Dealing with summary data in SPSS

Summary Data: SPSS If you only have access to the cross-tabulated data (or you only have to do one quick analysis) then rather than having to enter one row of data for each individual in the dataset you can enter it as a summary data table. A summary data table will only contain one row for each cross combination in the table (one row per table cell). The number of observations in that table will not affect the amount of data entry. Summary data tables can be used for any, all categorical technique.

Comparing Proportions & Analysing Categorical Data