Topic-14

Topic-14 Nonparametric Methods: Chi-Square Applications

Nonparametric Method: Chi-Square Applications What is “Nonparametric Statistic Testing”? Remember, in all previous discussed statistic testing (z-value, t-value, or F-value), two key assumptions are required about the original data set: * Level of Measurement:at least interval-level or higher (ratio level, i.e., must be “quantitative”). That is, for many data set which are measured at nominal or ordinal level, those tests are not applicable. * Population Distribution: sample data must be obtained from a population which is normally distributed. That is, if it is believed that a population is “far” away from normally distributed, those tests are not applicable. * A special statistic test that is designed for data sets which are measured by categoriesonly (nominal or ordinal) and do not require a normally distributed population is called Nonparametric Tests (or Distribution-Free Tests). The test statistic used is “chi-square” (x²).

Understanding Chi-Square Distribution * Observed Frequency (ƒo) vs. Expected Frequency (ƒe): For categorized data, the major numerical characteristic is the “relative frequency” of each category (not the mean, median, etc.). The frequencies observed from a sample data about each category are called “observed frequency” (like “sample mean”), while the expected frequencies supposed in the population are called “expected frequency” (like assumed “population mean”). * For nonparametric tests, the primary purpose is to test -whether the observed frequencies are equal to the expected frequencies, or not. The x² is designed for such a test, defined as: x² = S[(ƒo - ƒe)²/ƒe] [with a df = (k-1), k - the No. of categories] * From its definition, x² distribution has the following features: - Non-negative, positively skewed, will approach “normal” when the “df” increases. - There is a “family” of distributions (e.g., curves) based on different “degree of freedom” (not the “sample size” like in previous t-test). - Like F-test, x² value can be obtained from a given table.

Goodness-of-Fit Test • “Goodness-Of-Fit” test is one of major nonparametric statistic test to test - whether an observed data set (a sample) fit an expected data set (assumed population). The two common tests are regarding “equal expected frequencies” and “unequal expected frequencies”. • 1)Testing for “Equal Expected Frequencies”: Similar 5-step testing procedure: • (1) H0: ƒo = ƒe, (There is no difference between the two sets.) • H1: ƒo g ƒe; (There is a significant difference between the two.) • (2) Select Significance Level (a), • (3) Identify Critical Value (from the Table with df = k-1). • (4) Compute x²-value (can be obtained from computer printout). • (5) If computed x² > Critical Value, Reject H0. • 2) Testing for “Unequal Expected Frequencies”: If the testing objective is to test - whether the “observed” (sample - like: local, or regional, or before-treatment) frequencies are “unequal” to those “expected” (population - national, or after-treatment) frequencies, the 5-step testing procedure is exactly same - except the interpretation of the null and alternative hypothesis.

Special Cautions in Using Chi-Square Test • * “False” conclusion may be reached in a x² test - when some frequencies in a data set are too small (ƒe < 5), because ƒe is in the denominator position. • To prevent such a “false” conclusion, three general rules are used in using x² test: • (1) If there are only two categories (or “cells” in a table), the x² • test should be used only when the expected frequency for each category is larger than 5 (ƒe > 5). • (2) If there are more than two categories, the x² test should not be used when more than 20% of the ƒe’s have the expected frequency less than 5 (more than 20% of ƒe < 5). • (3) When possible, those categories with smaller “expected frequencies” (ƒe < 5) could be combined to their total “share” is large than 20%, so the x² test could be correctly used. • Note: The differences between ƒo and ƒe could always be explained by either - there is a statistically significant difference or not - that these differences are the results of “randomly sampling”.

Goodness-of-Fit Test for Normality * Since the “Normal Distribution” has played such an important role in almost all statistical studies - to test whether a population is “normally” (or closed to) distributed or not - has been always of interest in statistic research projects. * Goodness-of-Fit Test for Normality: is used to test - for “category-only” level data set, whether the “observed frequencies”(ƒo) in a frequency distribution match those obtained from a “theoretical” normal (ƒe) distribution. Its procedure is a few steps more than the classical 5-step process: a) Determine the mean (µ) and standard deviation (r) of the frequency distribution (from a computer program). b) Compute z values for lower and upper limits of each class (category) [ z = (X - µ)/r ], then determine ƒe for each class. c) Goodness-of-Fit Chi-Square 5-step testing procedure: (1) H0: the population is normally distributed. H1: the population is not normally distributed. (2) Select (a), (3) Identify Critical Value (with df = k-1). (4) Compute x²-value (obtained from computer printout). (5) If computed x² > Critical Value, Reject H0.

Contingency Table Analysis * As discussed in earlier topics, the Contingency Table is a useful tool in organizing and presenting category-only (nominal or ordinal level) data sets. Chi-Square test can be used with a contingency table to test - whether two traits (or variables) are related or not (called “contingency table analysis”). * All observations are first classified into two variables (being studied: asƒo) and organized into a contingency table in which: - Two Rows (r = 2) - for two traits (or two variables), and - K Columns (c = k) - for k categories to be used in data set. * Then follow the classic 5-step process: (1) H0: there is no relationship between two variables. H1: there is a relationship between two variables. (2) Select (α), (3) Identify Critical Value [with df =(r-1)(c-1)]. (4) Compute x²-value: first, compute the expected frequency (ƒe) of each cell in the table by: (can be obtained from Minitab) ƒe = [(row total)(column total)]/(grand total), then x² = S[(ƒo - ƒe)²/ƒe] (5) If computed x² > Critical Value, Reject H0.

Summary • As nominal or ordinal data sets have been used very common in many business (or social) statistical studies, and more than often that whether the distribution of the population (from which your sample is taken) is normal (or at least close to) or not - is unknown before the proposed study, x² test is one of several important statistical tests in practice. • When the mean (µ) and standard deviation (r) of the population are unknown and have to be estimated from sample data, two more “degrees of freedom” (df) are lost (because two population parameters need to be estimated). Then, in looking for a critical value from the x² Table, you need to use : [ df = (k - 1 - 2), k: the No. of categories). In general, if there are additional p population parameters need to be estimated in a statistic testing, then: df = • (k - 1 - p). • Finally, for category-only data with ranking (ordinal level), more meaningful tests are available - “Analysis of Ranked Data”.

Example-1 • The human resource department at a small manufacturing plant collected the following data on absenteeism by day of the week. Test at the 0.05 level of significance to determine whether there is a difference in the absence rate by day of the week. • We assume equal expected frequency, then the expected frequency will be 89 found by (120+45+60+90+130)/5 = 89. • The computed test statistic is 42.4719. See next slide for computations. • The degree of freedom is 4 (5-1). • The critical value is 9.488. Verify.

Example 1 (continued) • H0: There is no difference between the observed and the expected frequencies of absences. • H1: There is a difference between the observed and the expected frequencies of absences. • Test Statistic: chi-square=42.47.19. • Decision Rule: Reject H0 if test statistic is greater than the critical value. • Conclusion: Reject the null hypothesis and conclude that there is a difference between the observed and expected frequencies of absences.

Example-2 • The U.S. Bureau of the Census indicated that 21.5% of the population is single, 63.9% is married, 7.7% is widowed, and 6.9% is divorced. A sample of 500 adults in the Atlanta area showed that 120 were single, 310 were married, 40 were widowed, and 30 were divorced. At the 0.05 significance level can we conclude that the Atlanta area is different from the U.S. as a whole? • Step 1: State the null and the alternative hypotheses. • H0: The distribution has not changed. • H1: The distribution has changed. • Step 2: State the decision rule. • H0 is rejected if x2>7.815, df=3, α=0.05. • Step 3: Compute the value of the test statistic. • x2=2.3824. • Step 4: What is the decision on H0? • H0 is rejected. The distribution has changed.

Example- 3 • A sample of the amounts sent by 500 customers at Ray’s Marathon is reported in the following frequency distribution. Is it reasonable to conclude that the distribution is normally distributed with a mean of $10 and standard deviation of $2? Use the 0.05 significance level.

To compute fe for the first class, first compute the probability for this class. We need P (X<6) = P[Z<(6-10)/2]=0.0228. Thus fe is (0.0228) (500) = 11.4. Similarly you can compute fe for the other classes. • Step 1: State the null and the alternative hypotheses. • Ho: The distribution is normal. • H1: The distribution is not normal. • Step 2: State the decision rule. • H0 is rejected if x2>11.07, df =5, α=0.05. • Step 3: Compute the value of the test statistic. • x2=336.33 • Step 4: What is the decision in H0? • H0 is rejected. The distribution is not normal.

Example 4 • Is there is relationship between the location of an accident and the sex of the person involved in the accident? A sample of 150 accidents reported to a hospital was classified by type and gender. At the 0.05 level of significance, can we conclude that gender and the location of the accident are related?

Note: The expected frequency for the work-male intersection is computed as (90) (80)/150=48. Similarly, you can compute the expected frequencies for the other cells. • Step 1: State the null and the alternative hypotheses. • H0: Gender and location are not related. • H1: Gender and location are related. • Step 2: State the decision rule. • H0 is rejected if X2>5.991, df=2, α=0.05. • Step 3: Compute the value of the test statistic. • X2= (60-48)2/48+…+(10-8)2/8 = 16.667. • Step 4: What is the decision on H0? • H0 is rejected. Gender and location are related.

Exercises: Topic-14 • The personal manager is concerned about absenteeism. She decides to sample the records to determine whether absenteeism is distributed evenly throughout the six-day workweek. The null hypothesis to be tested is: Absenteeism is distributed evenly throughout the week. The 0.01 level is to be used. The sample results are:

What are the numbers 12,9,11,10,9 and 9 called? • How many categories (cells) are there? • What is the expected frequency for each day? • How many degrees of freedom are there? • What is the chi-square critical value at the 1 percent level? • Compute the X2 test statistic. • Is the null hypothesis rejected? • Specifically, what does this indicate to the personnel manager?

A group of department store buyers viewed a new line of dresses and gave their opinions of them. The result were: Because the largest number (47) indicated the new line is outstanding, the head designer thinks that this is a mandate to go into mass production of the dresses. The head sweeper (who somehow became involved in this) believes that there is not a clear mandate and claims that the opinions are evenly distributed among the six categories. He further states that the slight differences among the various counts are probably due to chance. Test the null hypothesis that there is no significant difference among the opinions of the buyers. Test at the 0.01 level of risk. Follow a formal approach; that is, state the null hypothesis, the alternate hypothesis, and so on.

The manufacture of a computer terminal reports in its advertising that the mean life of the terminal, under normal use, is 6 years, with a standard of 1.4 years. A sample of 90 units sold 10 years ago revealed he following distribution of the lengths of life. At the 0.05 significance level, can the manufacturer conclude that the terminal lives are normally distributed?

A sociologist was researching this question: Is there any relationship between the level of education and social activities of an individual? She decided in three levels of education: attended or completed college, attended or completed high school, and attended or completed grade school or less. Each individual kept a record of his or her social activities, such as bowling with a group, dancing and church functions. The sociologist divided them into above-average frequency, average frequency, and below-average frequency.

What is the table called? • State the null hypothesis. • Should the null hypothesis be rejected at the 0.05 significance level? Site figures to substantiate your decision. • What specifically does this indicate in this problem?

Topic-13: Chi-Square Testing Procedure to use <Minitab> to calculate Chi-Square <Example-1> 1) Input your data into C-1. Name C1 as 'fo', C2 as 'fe', C3 as "Sqaure', and C4 as 'Chisquare' 2) From <Mintab>, first click [Edit], then select [Command Line Editor] 3) Within <Command Line Editor> Box, type in: let c2 = sum(c1)/5 (n = 5 - the number of rows in data) let c3 = (c1 - c2)**2/c2 let c4 = sum(c3) print c3 c4 ------------------------------------------- Then go to <Submit Commands> Then, you will see: Data Display RowSquaresChisquare 1 10.7978 60.8989 2 21.7528 3 9.4494 4 0.0112 5 18.8876 ---------------------------------------------------- 4) Now close <command Line Screen> Box.

Topic-13: Chi-Square <Supplement: Example-3> 1) Input your data into C1 (upper limit) & c2 (6 rows). Name c1 - "Amount", C2 -"fo", C3 - "Cum-Prob.", C4 - "Area", C5 - "fe", C6 - "Sqaure", and C7 - "Chisquare" 2) Goto <Calc> - <Probability Distributions> - <Normal> On <Normal Dist.> Screen - click <cumulative probability>, type in <Mean> [10.0], <Standard deviation> [2.0] <Input column> [c1], <Optional storage> [c3] -------- Then <OK> 3) From <Mintab> [Editor] menu, select [Command Line Editor] 4) Within <Command Line Editor> Box, type in:

let c4(1) = C3(1) let c4(2) = c3(2) - c3(1) let c4(3) = c3(3) - c3(2) let c4(4) = c3(4) - c3(3) let c4(5) = c3(5) - c3(4) let c4(6) = c3(6) - c3(5) let c5 = 500*c4 let c6 = (c2 - c5)**2/c5 let c7 = sum(c6) print c1-c7 ------------------------------------------- Then go to <Submit Commands> Then, you will see: Data Display Row Amount Fo Cum-Prob. Area Fe Squares Chisquare ---------------------------------------------------------------------- 1 6 20 0.02275 0.022750 11.375 6.540 337.322 2 8 60 0.15866 0.135905 67.953 0.931 3 10 140 0.50000 0.341345 170.672 5.512 4 12 120 0.84134 0.341345 170.672 15.045 5 14 90 0.97725 0.135905 67.953 7.153 6 99 70 1.00000 0.022750 11.375 302.142 5)Now close <Command Line Screen> Box.

Topic-13: Chi-Square <Supplement: Example-4> 1) Input your data into C1 – C3 (2 rows). Name c1 - "Work", C2 -"Home", C3 - "Other". 2) Goto <Stat> - <Tables> - <Chi-Square Test> On <Chi-Square Test.> Screen On <Columns containing the table> box, type in: < C1-C3 > -------- Then <OK> Then, you will see: ====== Chi-Square Test Expected counts are printed below observed counts Work Home Other Total --------------------------------------- 1 60 20 10 90 48.00 30.00 12.00 2 20 30 10 60 32.00 20.00 8.00 Total 80 50 20 150 Chi-Sq = 3.000 + 3.333 + 0.333 + 4.500 + 5.000 + 0.500 = 16.667 DF = 2, P-Value = 0.000 -------------------------------------------------------------------- 3. Since 16.667 > 5.991, so H0 is Rejected.

Topic-13: <Exercise B> Step-1: State null and alternative hypothesis: H0: There is no difference in the proportion of opinions. H1: There is a difference in the proportion of opinions. Step-2: Given:  = 0.01, df = (k-1) = (6-1) = 5, from the Table (p.A-6), Critical Value = 15.086. Step-3: Decision Rule: If computed 2 > 15.086, Reject H0. Step-4: Calculate 2 (use Minitab) ============================================================= 1) Input your data into C-1. Name C1 as 'fo', C2 as 'fe', C3 as "Sqaure', and C4 as 'Chisquare' 2) From <Mintab> [Editor] menu, select [Command Line Editor] 3) Within <Command Screen Box>, type in: let c2 = sum(c1)/6 (n = 6 - the number of rows in data) let c3 = (c1 - c2)**2/c2 let c4 = sum(c3) print c3 c4 ------------------------------------------- Then go to <Submit Commands>

Then, you will see: Data Display RowSquaresChisquare 1 1.225 3.4 2 0.625 3 0.000 4 0.025 5 0.625 6 0.900 ---------------------------------------------------- ======================================================== Step-5: Since 3.4 < 15.086, Ho is not rejected. There is no enough evidence to suggest that there be differences in the proportion of opinions. The observed differences are due to random sampling errors.

Topic-13: <Exercise C> Step-1: State null and alternative hypothesis: H0: Distribution is normally distributed. H1: Distribution is Not normally distributed. Step-2: Given:  = 0.05, df = (k-1) = (6-1) = 5, from the Table (p.A-6), Critical Value = 11.07 Step-3: Decision Rule: If computed 2 > 11.07, Reject H0. Step-4: Calculate 2 (use Minitab) ============================================================= 1) Input your data into C1 (upper limit) & c2 (6 rows). Name c1 - "Length", C2 -"fo", C3 - "Cum-Prob.", C4 - "Area", C5 - "fe", C6 - "Sqaure", and C7 - "Chisquare" 2) Goto <Calc> - <Probability Distributions> - <Normal> On <Normal Dist.> Screen - click <cumulative probability>, type in <Mean> [6.0], <Standard deviation> [1.4] <Input column> [c1], <Optional storage> [c3] -------- Then <OK> 3) From <Mintab> [Editor] menu, select [Command Line Editor]

Within <Command Line Editor Box>, type in: let c4(1) = C3(1) let c4(2) = c3(2) - c3(1) let c4(3) = c3(3) - c3(2) let c4(4) = c3(4) - c3(3) let c4(5) = c3(5) - c3(4) let c4(6) = c3(6) - c3(5) let c5 = 90*c4 let c6 = (c2 - c5)**2/c5 let c7 = sum(c6) print c1-c7 ------------------------------------------- Then go to <Submit Commands>, Then, you will see: Data Display Row Length Fo Cum-Prob. Area Fe Squares Chisquare ---------------------------------------------------------------------- 1 4 7 0.07656 0.076564 6.8907 0.001733 0.483101 2 5 14 0.23753 0.160962 14.4865 0.016341 3 6 25 0.50000 0.262475 23.6227 0.080299 4 7 22 0.76247 0.262475 23.6227 0.111471 5 8 16 0.92344 0.160962 14.4865 0.158117 6 99 6 1.00000 0.076564 6.8907 0.115141 ---------------------------------------------------------------------- Step-5: Since 0.483101 < 11.07, Ho is not rejected. It is normally distributed.

Topic-13: <Exercise D> Q-1: It is a Contingency Table. Q-2: State null and alternative hypothesis: H0: There is no relationship between the level of education and the frequency of social activity. H1: There is a relationship between the level of education and the frequency of social activity. Q-3: Given:  = 0.05, df = (r-1)(c-1) = 2 x 2 = 4, from the Table (p.A-6), Critical Value = 9.488. Decision Rule: If computed 2 > 9.488, Reject H0. Calculate 2 (use Minitab) ============================================================= 1) Input your data into C1 – C3 (3 rows). Name C1 - "Above Avg.", C2 -"Average", C3 - "Below Avg.". 2) Goto <Stat> - <Tables> - <Chi-Square Test> On <Chi-Square Test.> Screen On <Columns containing the table> box, type in: < C1-C3 > -------- Then <OK>

Then, you will see: ====== Chi-Square Test Expected counts are printed below observed counts Above Av Average Below Av Total 1 20 10 10 40 6.00 12.00 22.00 2 30 50 80 160 24.00 48.00 88.00 3 10 60 130 200 30.00 60.00 110.00 Total 60 120 220 400 Chi-Sq = 32.667 + 0.333 + 6.545 + 1.500 + 0.083 + 0.727 + 13.333 + 0.000 + 3.636 = 58.826 DF = 4, P-Value = 0.000 -------------------------------------------------------------------- Q-4. Since 58.826 > 9.448, so H0 is Rejected. There is a relationship between the level of education and the frequency of social activity.

Topic-14

Topic-14

Presentation Transcript

Topic 14

Topic 14 Linked Lists

Topic 14 Iterators

Topic 14: Vertical Restraints

Topic 14: Vertical Restraints

Topic 14: Pigeonhole Principle

Today’s Topic (02/03/14)

Topic 14 Bonding (HL)

Topic 14 Density Driven Currents

Topic 14: Underwater energy

Topic-14

NOTES 14 - Topic 2 - Mechanics *

Topic 2 Lecture 14

Topic 5-2. (Ch. 14)

Topic 14

Topic # 14 Unit 3 Vocabulary

Topic 14 Christian Theology

Topic 14: Solubility

Topic 14

Topic 14 Fertilisers