Download
introduction to the analysis of variance n.
Skip this Video
Loading SlideShow in 5 Seconds..
Introduction to the Analysis of Variance PowerPoint Presentation
Download Presentation
Introduction to the Analysis of Variance

Introduction to the Analysis of Variance

113 Vues Download Presentation
Télécharger la présentation

Introduction to the Analysis of Variance

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Introduction to the Analysis of Variance Basic Concepts, Section 12.1 - 12.2 One-Way ANOVA, Section 12.3

  2. ANOVA Overview • Test for a difference among several means from independently drawn samples • The extension of the two sample t-test for means to three or more samples requires the analysis of variance • Consider the negative income tax experiment in New Jersey • Tested whether was a difference in hours of work between the control and the treatment group • In this experiment income was supplemented by different amounts • The benefit guarantee level ranged from 50 to 125% of the poverty level • Consider then three groups of income • The control group, the first treatment group that received 50% of the poverty level and a second treatment group that received 75% of the poverty level • The null hypothesis is that the mean annual hours over three years is the same for each group • H0: 1 = 2 = 3 • H1: at least one of the population means differs from the others PP 7

  3. ANOVA Overview • Could compare the three population means by evaluating all possible pairs of sample means using the two sample t-test • Compare • Group 1 to group 2 • Group 1 to group 3 • Group 2 to group 3 • For a total of three groups the number of tests required is (3 pick 2) • Evaluated as 3!/(2!1!) • If number of groups = 10 there would be 45 different pair-wise t-tests (10 pick 2) PP 7

  4. ANOVA Overview • Pair-wise t-tests are likely to lead to an incorrect conclusion • Suppose that the three population means are in fact equal and that we conduct all three pair-wise tests • Assume that the tests are independent and set the significance level at 0.05 for each one • By the multiplication rule, the probability of failing to reject a null hypothesis of no difference in all three instances would be • P(fail to reject in all three tests) = (1 - 0.05)3 = (0.95)3 = 0.857 (probability of “accepting” all three) • Consequently, the probability of rejecting the null hypothesis in at least one of the tests is • P(reject in at least one test) = 1 - 0.857 = 0.143 • Since we know that the null hypothesis is true in each case, 0.143 is the probability of committing a type I error PP 7

  5. ANOVA Overview • Need a testing procedure in which the overall probability of committing a Type I error is equal to some predetermined level of alpha • One-way analysis of variance is such a technique • An experiment is a study designed for the purpose of examining the effect that one variable (the independent variable) has on the value of another variable (the dependent variable) PP 7

  6. ANOVA Overview • Negative income tax experiment was a designed experiment • Families were assigned to different treatment groups and given money (or not given money) by the Labor Dept • Intervention by researcher • Hours of work were observed for the next three years • Economists often work with observational studies rather than actual experiments • For example, we might study families from the Current Population Survey or the Census • Observe level of income and hours of work for each family and try to relate the two variables PP 7

  7. ANOVA Overview • In NIT example, hours worked is the dependent variable • What influences the hours of work? • There are three groups of families, distinguished by the amount of income they received from the government • Think of the income received as the independent variable • Income received will influence hours of work • The independent variable is also called the factor or treatment effect • Here we have an experiment in which we try to determine if various levels of a given factor (income) might have different effects on hours of work PP 7

  8. Variation between and within Groups • Looking at the data • There are three different levels of the factor income • The values of the hours worked for the different families are grouped by the factor level • We observe the group means Factor: Income Supplement Levelj groups, j=1, 2, …t i rows, i=1, 2, …n 123 Measurements: x11 x12 x13 Hours Worked x21 x22 x23 for different families .. .. .. xn1 xn2 xnt Group Mean PP 7

  9. Two Sources of Variation • Variation between groups reflects the effect of the factor levels, of the treatment • Variation between groups is seen by looking at the three group means • If there are large differences in the group means • Suggest that the differences in income supplements has an effect on average hours worked • Variation within groups represents random error from sampling • Values within a sample will vary chance • ANOVA uses these two kinds of variation to test for whether the factor has an effect on the dependent variable PP 7

  10. The Model and Assumptions • One-way analysis of variance • Examines populations that are classified by one characteristic • In our example, the characteristic is the amount of income supplement the family receives • There are three levels of that factor, or three groups • If we had only two samples instead of t samples, one-way ANOVA is equivalent to the two sample equal variance t-test for independent samples PP 7

  11. Assumptions • The samples have been independently selected • The population variances are equal • Not usually tested • The dependent variable follows a normal distribution in the populations PP 7

  12. Online Homework - Chapter 12 Intro to ANOVA • CengageNOW ninth assignment: Chapter 12 Intro to ANOVA PP 7

  13. Procedure • Remember each population represents a level of a factor • The hypotheses are • H0: 1 = 2 = …. = t • H1: Not all the means are equal • The null hypothesis would be • Supported if we observed small differences from one sample mean to the next • Rejected if at least some of the differences in sample means were large PP 7

  14. Procedure • We need a precise measure of the discrepancies among the sample means • A possible choice is the variance of the sample means • The basic idea of ANOVA is to express a measure of the total variation in a data set as a sum of two components • Variation within groups and variation between groups • If the variation within groups is small relative to the variation between the group means • Suggests that the population means are in fact different PP 7

  15. Problem – Are There Any Differences in Detergents? • Consumer Report is testing the cleansing action of three leading detergents • Cleansing action is the dependent variable • The different detergents represent the treatment • There are three levels of the factor because there are three detergents PP 7

  16. Problem – Are There Any Differences in Detergents? • There are 15 swatches of dirty cloth • We select at random 5 swatches to be washed by each of the detergents • After the swatches are cleaned, rate each on the basis of 0 to 100 • Let the level of significance be 0.01 PP 7

  17. Problem – Are There Any Differences in Detergents? What is the value of x23? What is x51? = X23 = x51 PP 7

  18. Problem – Are There Any Differences in Detergents? • Consider all 15 observations as one data set for the moment • Calculate the total variation in the pooled data set • Then break the total variation into two component • Variation within groups • Variation between groups PP 7

  19. Total Variation • Total variation • Grand mean • Where xij is the ith observation in the jth sample • j = 1, 2,….t samples or groups or levels of the factor • i = 1, 2, … nj observations in a group PP 7

  20. Total Variation • Grand mean is the mean of all the pooled observations • Capital N represents the total number of observations when the data are pooled • Not necessary for each sample (group) to have the same number of observations PP 7

  21. Summation Notation - Grand Mean • When we work with double summation signs, evaluate the inner summation sign first PP 7

  22. Total Sum of Squares • The total sum of squares can be found next, SST PP 7

  23. SST = SSTR + SSE • SST is divided into the variation between groups and the variation within groups (not variance) • SST = SSTR + SSE • SSTR = Variation between groups (Treatment) • SSE = Variation within groups (Error) • SST = SSB + SSW PP 7

  24. SSTR – Treatment Sum of Squares -Variation between Groups • SSTR The dot means that the average is carried out across the index i. We select a particular group, j, and then find the average of all the observations within that group. PP 7

  25. SSE Note that SST = SSTR + SSE 666 = 390 + 276 Can solve for two of the three and find the remaining Sum of Squares (SS) by subtraction SSE - Error Sum of Squares - Variation within Groups PP 7

  26. SST = SSTR + SSE • Examine two different variances • One based on the SSTR • The other based on the SSE • Remember that a variance is computed by dividing the sum of squared deviations by the appropriate degrees of freedom • Do the same here • Create Variances • Also called Mean Squared Deviations PP 7

  27. Mean Square Deviations • Mean Square Deviation for Treatment where t = the number of groups (We use up one degree of freedom in estimating the grand mean.) where N = the total number of observations across all groups (Each group mean is estimated by the sample observations anduses up one degree of freedom.) PP 7

  28. Rationale of the Test • The variance within groups, MSE, measures • Variability of the values around the mean of each group • Random variation of values within groups • The variance between groups, MSTR, measures • Random variation of values within groups • Also measures differences from one group to another • If there is no real difference from group to group, the variance between groups should be close to the variance within groups • MSTR  MSE • Ratio is close to 1 • However, if there is a difference between groups, then • MSTR > MSE PP 7

  29. = Ft-1,N-t ANOVA - Test Statistic • Test Statistic • If the null hypothesis is true and we draw a large number of samples from the populations and calculate the test statistic repeatedly • The sampling distribution of the test statistic follows the F distribution with t - 1 and N - t degrees of freedom • “Most” of the F values will be close to 1 PP 7

  30. reject = Ft – 1,N - t ANOVA - Test Statistic Sampling Distribution of • Even when the null hypothesis is true, arithmetically, the SSTR > SSE • So the test takes place in the upper tail of the distribution • Place all of the level of significance in the upper tail PP 7

  31. = Ft – 1,N - t ANOVA – Test Statistic Sampling Distribution of • Find critical value F⍺ • The decision rule is • If test statistic reject the H0 reject PP 7

  32. 0.01 Problem - ANOVA Sampling Distribution of • Calculate the MSTR • Calculate the MSE • Calculate the F test Do not reject reject F2,12 PP 7

  33. F Table ⍺ = 0.01

  34. 0.01 6.93 8.48 F2,12 Problem - ANOVA • Find critical value at  = 0.01 • Reject H0, some of the means differ significantly • Some of the detergents clean better than others reject PP 7

  35. ANOVA Table PP 7

  36. Completed ANOVA Table PP 7

  37. ANOVA p - value • Computer output provides the probability of observing an F test statistic as large as 8.48 if the H0 is true • This p-value is 0.0051 • To find the p-value, in a cell within a Microsoft Excel spreadsheet, type • =FDIST(Test value, t-1, N-t) • =FDIST(8.48,2,12) = .0051 • Setting our level of significance at 0.01, .0051 < 0.01 • Reject the null hypothesis PP 7

  38. Multiple Comparison Procedures • What happens if we reject the null hypothesis? • Conclude that the population means are not all equal • Do not know whether all of the means are different from one another or if only some of them are different • Want to conduct additional tests to find out where the differences lie • Number of multiple comparison tests available, each with advantages and disadvantages • Simple approach is to perform a series of two sample t-tests • This increases the probability of committing a Type I error • Avoid this problem by reducing the individual  levels to ensure that the overall level of significance is kept at a predetermined level PP 7

  39. ANOVA Assumption - Homogeneity of Variances • Bartlett’s Test for Homogeneity of Variances • Most common method used to test whether the population variances are equal • Test is powerful • Can discern that the null hypothesis is false • Badly affected by non-normal populations • ANOVA is robust • Robust means that the validity of a test is not seriously affected by moderate deviations from the underlying assumptions • Anova operates well even with considerable heterogeneity of variances, as long as nj are equal or nearly equal • ANOVA is also robust with respect to the assumption of the underlying populations’ normality, especially as n increases PP 7

  40. Online Homework - Chapter 12 ANOVA • CengageNOW tenth assignment: Chapter 12 ANOVA • CengageNOW eleventh assignment: Chapter 12: Overview of ANOVA PP 7

  41. Multiple Comparison Technique: Bonferroni Correction • The significance level for each of the individual comparisons depends on the number of pair-wise tests being conducted • In our problem, we set  = 0.01 and we have (3 pick 2) = 3 pair-wise comparisons • To set the overall probability of committing a Type I error at 0.01 we should usefor the significance level for an individual comparison PP 7

  42. Bonferroni Correction • Instead of pooling the data from only two samples to estimate the common variance, pool all t samples • Degrees of freedom are N – t • The test statistic is PP 7

  43. Bonferroni Correction • The sample variances are • The pooled variance is S1 = 3.937 S2 = 6.325 S3 = 3.674 PP 7

  44. Bonferroni Correction: Group 1&2, Group 1&3, Group 2&3 • Perform three t –tests p-value = .0118, do not reject at  = .003 p-value = .171, do not reject at  = .003 p-value = .0019, reject at  = 0.003. There is a significant difference between detergent 2 and 3. PP 7

  45. P-values from Excel • Using Excel’s statistical function • =TDIST(x,df,tails) • =TDIST(2.967,12,2) • =TDIST(-.989,12,2) • =TDIST(-3.956,12,2) PP 7