Analysis of Variance in Comparing Population Averages

Analysis of Variance Nutan S. Mishra Department of Mathematics and Statistics University of South Alabama

Description of the problem To compare the average performances of more than two populations. That is to test H0 : µ1 = µ2 = µ3 =… = µk Where µ1 is the average of first population µ2 is the average of second population µ3 is the average of third population… and so on µk is the average of kth population

Description of the problem There are k populations with the following parameters: 1: (µ1, σ1) 2: (µ2, σ2) 3: (µ3, σ3)….. k: (µk, σk) To test the H0 : µ1 = µ2 = µ3 =… = µk To test this we collect one sample from each population, thus if we have k populations, we need to collect k samples n1: size of the sample coming from 1 n2: size of the sample coming from 2 n3: size of the sample coming from 3 and so on nk: size of the sample coming from k

Analysis of Variance The statistical technique to test such an hypothesis is called Analysis of Variance popularly known as ANOVA. The technique of ANOVA is developed under following important assumptions: • The populations from which samples are drawn are all approx. normal • All these populations have same variances that is σ12 = σ22 = σ32 = …= σk2 • All the k samples are random and independent.

Example 1 A pharmaceutical company manufactures a medicine to lower down the couchpotato level in the patients suffering from high couchpotato levels A patient is supposed to take 300 mg dose per day They want to test if the average effects of 50 mg pills, 100 mg pills and 150 mg pills are significantly different Three populations : 50 , 100, 150 H0: µ50= µ100= µ150 They collected three random independent samples: one from the population of all high couchpotato patients who take 50 mg pills Second sample from the population of all high couchpotato patients who take 100 mg pills and Third sample from the population of all high couchpotato patients who take 150 mg pills.

Example 2 At a university, a researcher wanted to compare the average GPA levels of the students who work for less than 20 hrs, the students who work between 20-40hrs and the students who work for more than 40 hrs. The three populations are as follows: 1 : all the students who work for less than 20 hrs 2 : all the students who work for 20 to 40 hrs 3 : all the students who work for more than 40 hrs µ1: average GPA of 1 µ2: average GPA of 2 µ3: average GPA of 3 H0: µ1=µ2=µ3 They collected three random independent samples, one from each population. And ran ANOVA on computer.

Principle Suppose y is the expected response from a member in a sample. The actual observed response is different from y say yij Then there is variation in the response y-yij This variation may have occurred due to two causes: due to effect of the sample ( medicine or group of students it is coming from) and secondly due all other factors which we are not considering in the study Variation in a response = effect of the sample group + effect of other factors In other words Total variation in the dataset = variation due to the factor under study + variation due to error Alternatively Total variation = variation between samples + variation within samples Total variation in the data set is measured by “total sum of square” denoted by SST . Variation between samples (or factors) is measured by sum of squares due to between sample variation SSB (or SSF) Variation within samples (or error) is measured by sum of squares due to within sample variations SSW (or SSE) Thus SST = SSB + SSW or SST = SSF + SSE

One way ANOVA table k = total number of populations under the study n = total number of observations collected in this study (size of the grand sample) SSB is also known as SSF and SSW is also known as SSE

Minitab Example sample1 3.2200 3.3100 3.2600 3.2500 sample2 3.0400 2.9900 3.2700 3.2000 sample3 3.0600 3.1700 2.9300 3.0900 sample4 2.6400 2.7500 2.5900 2.6200 sample5 3.1900 3.4000 3.1100 3.2300 sample6 2.4900 2.3700 2.3800 2.3700 This dataset was collected to study the average yields from the six different varieties of alfalfa. The numbers show the pounds of alfalfa yield from a plot which was given variety of alfalfa seeds. For example the number 3.22 represents the yield in pounds from a plot which was given variety1 alfalfa seed. The number 2.49 represent the yield in pounds from a plot which was given variety 6 alfalfa seeds µ1 = average yield from all the plots which were given variety1 seeds .And similarly for all six varieties H0: µ1 = µ2 = µ3 = µ4 = µ5 = µ6 The dataset above consists of six samples each of size 4 (each row represents the data values in a sample) Thus grand sample size n = 6*4 = 24 and # populations k = 6

Minitab example In this example the response = yield in lbs and factor under study is the variety (six types) One-way Analysis of Variance Analysis of Variance for Yield Source DF SS MS F P Variety 5 2.43507 0.48701 56.22 0.000 Error 18 0.15593 0.00866 Total 23 2.59100 Since F-value is laarge and p-value is very small we decide to reject the null hypothesis And conclude that there is a difference in the average yields of those six varieties

Minitab example Further analysis Individual 95% CIs For Mean Based on Pooled StDev Level N Mean StDev ----+---------+---------+---------+-- 1 4 3.2600 0.0374 (---*--) 2 4 3.1250 0.1318 (--*--) 3 4 3.0625 0.0998 (--*--) 4 4 2.6500 0.0698 (--*---) 5 4 3.2325 0.1223 (---*--) 6 4 2.4025 0.0585 (--*--) ----+---------+---------+---------+-- Pooled StDev = 0.0931 In the above analysis, the first column gives the variety type, second column gives the size of the sample drawn from each population. In this example a sample of size 4 is drawn from each of six populations. The third column gives the sample mean yield from each variety. Note that the sample mean yield for the fourth and sixth variety are lower than rest of the four varieties. This confers with our conclusion that not all the varieties give equal yields.

Two way ANOVA All the 24 plots in the earlier examples are coming from four different types of fields with ph-value of the soil as PH1, PH2, PH3, PH4. The data is tabulated as follows Rows: Variety Columns: Field 1 2 3 4 1 3.2200 3.3100 3.2600 3.2500 2 3.0400 2.9900 3.2700 3.2000 3 3.0600 3.1700 2.9300 3.0900 4 2.6400 2.7500 2.5900 2.6200 5 3.1900 3.4000 3.1100 3.2300 6 2.4900 2.3700 2.3800 2.3700 This is two way classified data. From each of the six varities we took sample of size 4 From each of the four fields we took sample of size 6

Two way ANOVA We want to test two hypotheses simultaneously H0 (variety): α1 = α2 = α3 = α4 = α5 = α6 where αi is the average yield from ith quality H0(field) : β1 = β2 = β3 = β4 Where βj is the average yield from jth field To test these two hypotheses simultaneously, we run two-way ANOVA Yijk = y + αi + βj + ijk That is kth observation consists of common mean y + effect due to ith variety + effect due to jth field + error term

Two way ANOVA Two-way Analysis of Variance Analysis of Variance for Yield Source DF SS MS F P Variety 5 2.43507 0.48701 53.27 0.000 Field 3 0.01878 0.00626 0.68 0.575 Error 15 0.13715 0.00914 Total 23 2.59100 The first columns lists the sources of variation; namely due to variety, due to field and due to error. The corresponding degrees of freedom are in the second column. The p-values in the last columns In the first row p-value = 0.000 tells that p-value is very small for the first hypothesis and hence we reject the null hypothesis that all variety produces equal average yield.. On the other hand a higher p-value in the second row is an indication that sample data supports the second null hypothesis, hence concluding that all the four types of fields give same average yield Thus in the linear model Yijk = y + αi + βj + ijk , αi play significant roles where as βj do not.

Two way Anova

Two way ANOVA

Analysis of Variance in Comparing Population Averages