1 / 55

Topic 20: Single Factor Analysis of Variance

Topic 20: Single Factor Analysis of Variance. Outline. Analysis of Variance One set of treatments (i.e., single factor) Cell means model Factor effects model Link to linear regression using indicator explanatory variables. One-Way ANOVA. The response variable Y is continuous

lindsey
Télécharger la présentation

Topic 20: Single Factor Analysis of Variance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic 20: Single Factor Analysis of Variance

  2. Outline • Analysis of Variance • One set of treatments (i.e., single factor) • Cell means model • Factor effects model • Link to linear regression using indicator explanatory variables

  3. One-Way ANOVA • The response variable Y is continuous • The explanatory variable is categorical • We call it a factor • The possible values are called levels • This approach is a generalization of the independent two-sample pooled t-test • In other words, it can be used when there are more than two treatments

  4. Data for One-Way ANOVA • Y is the response variable • X is the factor (it is qualitative/discrete) • r is the number of levels • often refer to these levels as groups or treatments • Yi,j is the jth observation in the ith group

  5. Notation • For Yi,j we use • i to denote the level of the factor • j to denote the jth observation at factor level i • i = 1, . . . , r levels of factor X • j = 1, . . . , ni observations for level i of factor X • ni does not need to be the same in each group

  6. KNNL Example (p 685) • Y is the number of cases of cereal sold • X is the design of the cereal package • there are 4 levels for X because there are 4 different package designs • i =1 to 4 levels • j =1 to ni stores with design i (ni=5,5,4,5) • Will use n if ni the same across groups

  7. Data for one-way ANOVA data a1; infile 'c:../data/ch16ta01.txt'; input cases design store; proc print data=a1; run;

  8. The data

  9. Plot the data symbol1 v=circle i=none; proc gplot data=a1; plot cases*design; run;

  10. The plot

  11. Plot the means proc means data=a1; var cases; by design; output out=a2 mean=avcases; proc print data=a2; symbol1 v=circle i=join; proc gplot data=a2; plot avcases*design; run;

  12. New Data Set

  13. Plot of the means

  14. The Model • We assume that the response variable is • Normally distributed with a • mean that may depend on the level of the factor • constant variance • All observations assumed independent • NOTE: Same assumptions as linear regression except there is no assumed linear relationship between X and E(Y|X)

  15. Cell Means Model • A “cell” refers to a level of the factor • Yij = μi + εij • where μi is the theoretical mean or expected value of all observations at level (or cell) i • the εij are iid N(0, σ2) which means • Yij ~N(μi, σ2) and independent • This is called the cell means model

  16. Parameters • The parameters of the model are • μ1, μ2, … , μr • σ2 • Question (Version 1) – Does our explanatory variable help explain Y? • Question (Version 2) – Do the μi vary? H0: μ1= μ2= … = μr = μ (a constant) Ha: not all μ’s are the same

  17. Estimates • Estimate μi by the mean of the observations at level i, (sample mean) • ûi = = ΣYi,j/ni • For each level i, also get an estimate of the variance • = Σ(Yij- )2/(ni-1) (sample variance) • We combine these to get an overall estimate of σ2 • Same approach as pooled t-test

  18. Pooled estimate of σ2 • If the ni were all the same we would average the • Do notaverage the si • In general we pool the , giving weights proportional to the df, ni -1 • The pooled estimate is

  19. Running proc glm Difference 1: Need to specify factor variables proc glm data=a1; class design; model cases=design; means design; lsmeans design run; Difference 2: Ask for mean estimates

  20. Output Important summaries to check these summaries!!!

  21. SAS 9.3 default output for MEANS statement

  22. MEANS statement output Table of sample means and sample variances

  23. SAS 9.3 default output for LSMEANS statement

  24. LSMEANS statement output Provides estimates based on model(i.e., constant variance)

  25. Notation

  26. ANOVA Table Source df SS MS Model r-1 Σij( - )2 SSR/dfR Error nT-r Σij(Yij - )2 SSE/dfE Total nT-1 Σij(Yij - )2 SST/dfT

  27. ANOVA SAS Output

  28. Expected Mean Squares • E(MSR) > E(MSE) when the group means are different • See KNNL p 694 – 698 for more details • In more complicated models, these tell us how to construct the F test

  29. F test • F = MSR/MSE • H0: μ1 = μ2 = … = μr • Ha: not all of the μi are equal • Under H0, F ~ F(r-1, nT-r) • Reject H0 when F is large • Report the P-value

  30. Maximum Likelihood Approach procglimmix data=a1; class design; model cases=design / dist=normal; lsmeans design; run;

  31. GLIMMIX Output

  32. GLIMMIX Output

  33. GLIMMIX Output

  34. Factor Effects Model • A reparameterization of the cell means model • Useful way at looking at more complicated models • Null hypotheses are easier to state • Yij = μ + i + εij • the εij are iid N(0, σ2)

  35. Parameters • The parameters of the model are • μ, 1, 2, … , r • σ2 • The cell means model had r + 1 parameters • r μ’s and σ2 • The factor effects model has r + 2 parameters • μ, the r ’s, and σ2 • Cannot uniquely estimate all parameters

  36. An example • Suppose r=3; μ1 = 10, μ2 = 20, μ3 = 30 • What is an equivalent set of parameters for the factor effects model? • We need to have μ + i = μi • μ = 0, 1 = 10, 2 = 20, 3 = 30 • μ = 20, 1 = -10, 2 = 0, 3 = 10 • μ = 5000, 1 = -4990, 2 = -4980, 3 = -4970

  37. Problem with factor effects? • These parameters are not estimable or not well defined (i.e., unique) • There are many solutions to the least squares problem • There is an X΄X matrix for this parameterization that does not have an inverse (perfect multicollinearity) • The parameter estimators here are biased (SAS proc glm)

  38. Factor effects solution • Put a constraint on the i • Common to assume Σi i = 0 • This effectively reduces the number of parameters by 1 • Numerous other constraints possible

  39. Consequences • Regardless of constraint, we always have μi = μ + i • The constraint Σi i = 0 implies • μ = (Σi μi)/r (unweighted grand mean) • i = μi – μ (group effect) • The “unweighted” complicates things when the ni are not all equal; see KNNL p 702-708

  40. Hypotheses • H0: μ1 = μ2 = … = μr • H1: not all of the μi are equal are translated into • H0: 1 = 2 = … = r = 0 • H1: at least one i is not 0

  41. Estimates of parameters • With the constraint Σi i = 0

  42. Solution used by SAS • Recall, X΄X does not have an inverse • We can use a generalized inverse in its place • (X΄X)- is the standard notation • There are many generalized inverses, each corresponding to a different constraint

  43. Solution used by SAS • (X΄X)- used in proc glm corresponds to the constraint r = 0 • Recall that μ and the i are not estimable • But the linear combinations μ + i are estimable • These are estimated by the cell means

  44. Cereal package example • Y is the number of cases of cereal sold • X is the design of the cereal package • i =1 to 4 levels • j =1 to ni stores with design i

  45. SAS coding for X • Class statement generates r explanatory variables • The ith explanatory variable is equal to 1 if the observation is from the ith group • In other words, the rows of X are • 1 1 0 0 0 for design=1 • 1 0 1 0 0 for design=2 • 1 0 0 1 0 for design=3 • 1 0 0 0 1 for design=4

  46. Some options proc glm data=a1; class design; model cases=design /xpx inverse solution; run;

  47. Output Also contains X’Y The X'X Matrix Int d1 d2 d3 d4 cases Int 19 5 5 4 5 354 d1 5 5 0 0 0 73 d2 5 0 5 0 0 67 d3 4 0 0 4 0 78 d4 5 0 0 0 5 136 cases 354 73 67 78 136 7342

  48. Output X'X Generalized Inverse (g2) Int d1 d2 d3 d4 cases Int 0.2 -0.2 -0.2 -0.2 0 27.2 d1 -0.2 0.4 0.2 0.2 0 -12.6 d2 -0.2 0.2 0.4 0.2 0 -13.8 d3 -0.2 0.2 0.2 0.45 0 -7.7 d4 0 0 0 0 0 0 cases 27.2 -12.6 -13.8 -7.7 0 158.2

  49. Output matrix • Actually, this matrix is • (X΄X)- (X΄X)- X΄Y • Y΄X(X΄X)- Y΄Y-Y΄X(X΄X)- X΄Y • Parameter estimates are in upper right corner, SSE is lower right corner (last column on previous page)

  50. Parameter estimates St Par Est Err t P Int 27.2 B 1.45 18.73 <.0001 d1 -12.6 B 2.05 -6.13 <.0001 d2 -13.8 B 2.05 -6.72 <.0001 d3 -7.7 B 2.17 -3.53 0.0030 d4 0.0 B . . .

More Related