Download Presentation
## Chi-Square test

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Presentation 12**Chi-Square test**What does it mean for two categorical variables to be**related? • Remember that Chi-Square is used to test for a relationship between 2 Categorical variables. Ho: There is no relationship between the variables. Ha: There is a relationship between the variables. • If two categorical variables are related, it means the chance that an individual falls into a particular category for one variable depends upon the particular category they fall into for the other variable. • Let’s say that we wanted to determine if there is a relationship between religion (Christian, Jew, Muslim, Other) and smoking. When we test if there is a relationship between these two variables, we are trying to determine if being part of a particular religion makes an individual more likely to be a smoker. If that is the case, then we can say that Religion and Smoking are related or associated.**Chi-Square test for 2-way tables**• Suppose we are studying two categorical variables in a population, where the first variable has r levels (i.e. possible outcomes) and the second one has s levels. • We can summarize a sample from this population using a table with r rows and c columns. • A two-way table, also called contingency table, displays the counts of how many individuals fall into each possible combination of categories of two categorical variables. So, each cell of the table (total number of cells is r xc) represents a combination of categories of the two variables. • The following table presents the data on race and smoking. The two variables of interest, race and smoking, have r = 4 and c = 2, resulting in 4x2=8 combinations of categories.**Chi-Square test for 2-way tables**• By considering the number if observation falling into each category, we will see how to test the hypotheses of the form: H0: The two variables are not associated. Ha: The two variables are associated. • Two different experimental situations will lead to contingency tables • If we have two populations under study, both of which have a particular trait with respect to a categorical variable. In this case the null hypothesis is a statement of homogeneityamong the two populations. • If we have one population under study, and we are interested to check the relationship between two categorical variables. In this case the null hypothesis is a statement of independencebetween the two variables. • For sufficiently large samples, the same test is appropriate for both of these situations. This test is called chi-square test, and in the following we will go over the steps in for testing the relationship between two variables.**Some Notation!**• For i taking values from 1 to r (number of rows) and j taking values from 1 to c (number of columns), denote: Ri = total count of observations in the i-th row. Cj = total count of observations in the j-th column. Oij = observed count for the cell in the i-th row and the j-th column. Eij = expected count for the cell in the i-th row and the j-th column if the two variables were independent, i.e if H0 was true. These counts are calculated as**Example**E11=(695x1180)/1363 E12=(695x183)/1363 E21=(281x1180)/1363 E22=(281x183)/1363 E31=(159x1180)/1363 E32=(159x183)/1363 E41=(228x1180)/1363 E42=(228x183)/1363**Chi-Square Analysis Details**The 5 Steps in a Chi-Square Test: • Step 1: Write the null and alternative hypothesis. H0: There is no relationship between the variables. Ha: There is a relationship between the variables. • Step 2: Check conditions. A) All expected counts should be > 1. B) At least 80% of expected counts should > 5. • Step 3: Calculate Test Statistic and p-value. The test statistic measure the difference between the observed counts and the expected counts assuming independence. This is called chi-square statistic because if the null hypothesis is true, then it has a chi-square distribution with (r-1)x(c-1) degrees of freedom.**Step 3 Cont. Find the p-value.**If the χ2- statistic is large, it implies that the observed counts are not close to the counts we would expect to see if the two variables were independent. Thus, ''large'' χ2 gives evidence against the null hypothesis, and supports the alternative. The p-value of the chi-square test is the probability that the χ2- statistic, is as large or larger than the value we obtained if H0 is true. Also, if H0 is true, the χ2- statistic has chi-square distribution with (r-1)x(c-1) df. Thus, the p-value for Chi-Square test is ALWAYS the area to the right of the test statistic under the curve, i.e. p-value = P(X> χ2), where X has a chi-square distribution with (r-1)x(c-1) df curve. To get this probability we need to use a chi-square distribution with (r-1)x(c-1) df (Table A.4). Using Minitab, or any other statistical software, you can obtain the p-value form the output. Otherwise, you can report a range for the p-value using Table 4 (since usually you will not be able to find the exact p-value on the table. Chi-Square Analysis Details**Chi-Square Analysis Details**• Step 4: Decide whether or not the result is statistically significant. The results are statistically significant if the p-value is less than alpha, where alpha is the significance level (usually α = 0.05). • Step 5: Report the conclusion in the context of the situation. • Thep-valueis ______ which is< a, this result is statistically significant. Reject the H0 Conclude that (the two variables) are related. • Thep-valueis ______ which is> a, this result is NOT statistically significant. We cannot reject the H0 Cannot conclude that (the two variables) are related.**Detailed Example**• Derek wants to know if the geographical area that a student grew up in is associated with whether or not that the student drinks alcohol. Below are the results he obtained from a random sample of PSU students**Detailed Example**1. Ho: There is no relationship between the geographical area that a student grew up and whether or not that the student drinks alcohol. Ha: There is relationship between the geographical area that a student grew up and whether or not that the student drinks alcohol. 2. To check the conditions we need to calculate the expected counts for each cell. E11 = (R1xC1)/n = (86x87)/825 = 9.07, E12 = (R1xC2)/n = (86x738)/825 = 76.93, … E32 = (R3xC2)/n = ___________________, …**Detailed Example**No Yes All Big_City 21 65 86 9.07 76.93 86.00 Rural 11 130 141 14.87 126.13 141.00 SmallTow 18 198 216 22.78 193.22 216.00 Suburban 37 345 382 40.28 341.72 382.00 All 87 738 825 87.00 738.00 825.00 Here is the Minitab output with the Observed and Expected counts for each cell. We can see that the conditions are satisfied!**Detailed Example**3.Chi- Square statistic and P-value: χ2 = sum {(Observed – Expected)2/Expected} = (21-9.07)2/9.07+ (65-76.93)2/76.93 + (11-14.87)2/14.87+ (130-126.13)2/126.13 + (18-22.78)2/22.78+ (198-193.22)2/193.22 + (37-40.28)2/40.28+ (345-341.72)2/341.72 = 20.091 df = (4-1)x(2-1) =3 p-value= P(X> 20.091) < P(X> 16.17) = 0.001 (Table A.4) 4. Since the p-value< 0.05, the test is significant, and we can reject the null. 5. We can conclude that there is a relationship between the geographical area that a student grew up and whether or not that the student drinks alcohol.**Special Case - Analyzing 2x2 tables**• In a lot of cases the categorical variables of interest have two levels each. In this case, we can summarize the data using a contingency table having two rows and two columns (i.e. r=c=2). The general form of a 2x2 table is • In this case, the chi-square statistic has the following simplified form, • Under the null hypothesis, χ2-statistic has chi-square distribution with (2-1)x(2-1)=1 degrees of freedom.**Example for 2x2 table: Is there relationship between gender**and smoking habits? Minitab Output C1 C2 Total 1 540 52 592 540.17 51.83 2 325 31 356 324.83 31.17 Total 865 83 948 Chi-Sq = 0.000 + 0.001 + 0.000 + 0.001 = 0.002 DF = 1, P-Value = 0.968 Minitab uses the general formula of the χ2 test statistic.**Relationship Between Chi-Square and 2 Proportions Tests**When do we use Chi-Square and when do we use 2 proportions? • Situation 1: Both categorical variables of interest have exactly 2 levels. Question - Is there a relationship between the variables, or is there a difference in the proportions? Answer - Either Chi-Square or Two Sided Test of 2-proportions will lead to the same conclusion! In this case, the χ2 –statistic = (z-statistic)2, and the p-values of the two tests are equal, i.e. P(X(1df) > χ2 –stat) = 2 P (Z > |z-stat|). • Situation 2: Both categorical variables of interest have exactly 2 levels. Question - Is one proportion greater/smaller than the other. Answer - This is a one-sided test and you MUST use a test of 2 proportions. • Situation 3: At least one of the two categorical variables of interest has MORE than 2 levels. Question - Is there a relationship between the variables? Answer - MUST use a Chi-Square Test.**Examples of Chi-Square and 2-Proportions**Q1: Is there a difference in the proportion of males and females that smoke? Solution: Either a Chi-Square or Test of 2 proportions is fine. 2-proportionsChi-Square H0: pm – pf = 0 H0: There is no relationship between Gender and Smoking. Ha: pm – pf≠ 0 Ha: There is a relationship between Gender and Smoking. Q2: Is the proportion of males who smoke greater than the proportion of females who smoke? Solution: Test of 2 proportions, because the alternative is one sided! 2-proportionsH0: pm – pf = 0 vs Ha: pm – pf > 0**Examples of Chi-Square and 2-Proportions**Q: Is there a relationship between Race and Smoking? Is there a difference in the proportion smokers of the different races? Solution: Chi-Square because Race has more than 2 levels! Chi-Square Test H0: There is no relationship between Race and Smoking. Ha: There is a relationship between Race and Smoking.