Statistical Methods in Computer Science

Hypothesis Testing III: Categorical dependence and Ido Dagan Statistical Methods in Computer Science

Testing Categorical Data Single-factor, with categorical dependent variable Determine effect of independent variable values (nominal/ categorical too) on the dependent variable treatment1Ind1 & Ex1 & Ex2 & .... & Exn ==> Dep1 treatment2 Ind2 & Ex1 & Ex2 & .... & Exn ==> Dep2 control Ex1 & Ex2 & .... & Exn ==> Dep3 Compare performance of algorithm A to B to C .... Cannot use the typical ANOVA testing

Testing Categorical Data Single-factor, with categorical dependent variable Determine effect of independent variable values (nominal/ categorical too) on the dependent variable treatment1Ind1 & Ex1 & Ex2 & .... & Exn ==> Dep1 treatment2 Ind2 & Ex1 & Ex2 & .... & Exn ==> Dep2 control Ex1 & Ex2 & .... & Exn ==> Dep3 Compare performance of algorithm A to B to C .... Dep1, Dep2, Dep3, .... are categorical data Cannot use the numerical ANOVA testing Values of dependent variable Values of independent variable

Contingency tables We collect categorical data in contingency tables Different screens and their hue bias

Contingency tables We collect categorical data in contingency tables Different screens and their hue bias Coffee and tea drinkers in a sample of 100 people

Contingency tables We collect categorical data in contingency tables Different screens and their hue bias Coffee and tea drinkers in a sample of 100 people Operating System used at home

Is there a relation? We want to know whether there exists a dependence between the variables, or between what we observed and some expectation e.g., are tea-drinkers more likely to be coffee-drinkers? e.g., is the selection of operating systems different from expected? To find out, cannot ask about means, medians, ... Focus instead on proportions/distributions: Does the observed distribution differ from the expected one?

Single Sample Suppose we have a-priori notion of what's expected For instance, selection of OS is expected to be uniform How do we tell whether the observed data is different?

Hypotheses Refers to the number of data points of each value H0 : Expected distribution = underlying distribution of observed data H1 : Expected != Observed Need a method to measure likelihood of null hypothesis

Difference in Expectations Focus on the difference between expected and observed For instance: Expected (out of 256 sampled homes) 25% each operating system = 64 homes each operating system Observed (out of 256 sampled homes)

Chi-square • The chi-square ( ) of given data is: • Where: • fe is the expected frequency of the value • fo is the observed frequency of the value • The sum is over all values • Recall from Cramer correlation • Chi-square values of random samples are distributed according to the chi-square distribution • A family of distributions, according to degrees-of-freedom • Large enough chi-square indicates significance – reject null hypothesis • Too low values are unlikely – check for error

In our example This is compared to chi-square distribution with df = 3 Result: highly significant

Features of chi-square Cannot be negative (sum of squares) Only tests for difference from expected distribution But here it is a one-tailed test (only tests for greater value) Equals 0 when all frequencies are exactly as expected It's not the size of the discrepancy, but its relative size Relative to the expected frequency Depends on the number of discrepancies This is why we have to consider degrees of freedom – the more df, the more discrepancies we might get by chance

Comparing against arbitrary distribution Do we always need to assume population is uniform/known? No: Instead of expecting: Can have different expectations (from past), for instance:

Comparing against arbitrary distributions • This is compared to chi-square distribution with df = 3 • Result: not significant • Same procedure may be used for testing against any expected distribution (e.g. Normal, to decide on t-test applicability)

Multiple-variable contingency tables Correlation-testing Two numerical variables Single-factor testing (one-way ANOVA) Independent variable: Categorical Dependent variable: Numerical Two-way contingency table: two categorical variables As seen for Cramer correlation Different screens and their hue bias

Hypotheses about contingency tables Null hypothesis: values are independent of each other Analogous to correlation and ANOVA tests Alternative (H1): values of variables are dependent How do we use chi-square here? Calculate expected frequency from margin probabilities Calculate chi-square value Compare to chi-square distribution

Example Independent probabilities: Any display has probability of 140/400 of being bluish, 160/400 of being reddish, 100/400 of being greenish Any hue has probability of 200/400 of being of display 1, 100/400 of display 2, 100/400 of display 3

Example Given the above table, what is the likelihood of: Being a blueish display 1 ? Being a greenish display 2?

Example: Expected Frequencies Given the above table, what is the likelihood of: Being a blueish display 1 ? Being a greenish display 2? 70 cases 25 cases

Translating into expected frequencies Observed Expected

Chi-square of contingency table In this example the chi-square is: Observed-expected Squared, summed... chi-square

Degrees of freedom of contingency table There are two variables One has 2 degrees of freedom (3 possible values) Other has 2 degrees of freedom (3 possible values) Total: 2 * 2 = 4 degrees of freedom Knowing the marginal values, setting 4 values in the table enables to derive the 5 remaining ones In general: df = (#rows – 1) · (#columns – 1) Comparison against chi-square distribution with df=4: Significant: p < 0.01

Chi-square in Spreadsheet Excel, openoffice spreadsheets have chi-square testing chitest(observed,expected) Where observed, expected are data-arrays of same size Give back p value of null hypothesis

Interpreting dependence Knowing variables are dependent does not tell us how Analogical to single-factor ANOVA testing we learned(but unlike Pearson correlation, which is directional) Have to consider interpretation carefully For instance, consider the following data Suppose we are investigating the following relation: Are people drinking tea likely to drink coffee? (i.e., tea-drinking correlated with coffee-drinking)

Interpreting Dependence We gather the following results: The chi-square test is significant: p = 0.05 But what is the direction of dependence?

Interpreting Dependence We gather the following results: Naively: Out of 25 tea drinkers, 20 also drink coffee (80% ! )

Interpreting Dependence We gather the following results: Naively: Out of 25 tea drinkers, 20 also drink coffee (80% ! ) But: 90% of people drink coffee (a-priori) 93% of non-tea drinkers drink coffee

Interpreting Dependence We gather the following results: Naively: Out of 25 tea drinkers, 20 also drink coffee (80% ! ) But: 90% of people drink coffee (a-priori) 93% of non-tea drinkers drink coffee So: Negative dependence between tea and coffee!(can be revealed by comparing observed to expected)

Statistical Methods in Computer Science