AMS 572 ANOVA: One-Way, Two-Way, and Multiway.
Group 3 1 – Intro & Hist. - Na Chan 2 – Basics of ANOVA - AllaTashlitsky 3 - Data Collection - Bryan Rong 4 - Checking Assumptions in SAS - Junying Zhang 5 - 1-Way ANOVA derivation - Yingying Lin and Wenyi Dong 6 - 1-Way ANOVA in SAS - Yingying Lin and Wenyi Dong 7 - 2-Way ANOVA derivation - Peng Yang 8 - 2-Way ANOVA in SAS - Phil Caffrey and Yin Diao 9 - Multi-Way ANOVA Derivation - Michael Biro 10 - ANOVA and Regression – Cris(Jiangyang) Liu
Intro & History Na Chen
USES OF T-TEST • A one-sample location test of whether the mean of a normally distributed population has a value specified in a null hypothesis. • A two sample location test of the null hypothesis that the means of two normally distributed populations are equal
USES OF T-TEST • A test of the null hypothesis that the difference between two responses measured on the same statistical unit has a mean value of zero • A test of whether the slope of a regression line differs significantly from 0
BACKGROUND • If comparing means among > 2 groups, 3 or more t-tests are needed -Time-consuming (Number of t-tests increases) -Inherently flawed (Probability of making a Type I error increases)
RONALD A.FISHER • Informally used by researchers in the 1800s • Formally proposed by Ronald A. Fisher in 1918 • Biologist • Eugenicist • Geneticist • Statistician “A genius who almost single-handedly created the foundations for modern statistical science” - Anders Hald “The greatest of Darwin's successors” -Richard Dawkins
HISTORY • Fisher proposed a formal analysis of variance in his paper The Correlation Between Relatives on the Supposition of Mendelian Inheritance in 1918. • His first application of the analysis of variance was published in 1921. • Become widely known after being included in Fisher's 1925 book Statistical Methods for Research Workersin 1925.
DEFINITION • An abbreviation for: ANalysisOf VAriance • The procedure to consider means from k independent groups, where k is 2 or greater.
ANOVA and T-TEST • ANOVA and T-Test are similar -Compare means between groups • 2 groups, both work • 2 or more groups, ANOVA is better
TYPES • ANOVA - analysis of variance – One way (F-ratio for 1 factor ) – Two way (F-ratio for 2 factors) • ANCOVA - analysis of covariance • MANOVA - multiple analysis
APPLICATION • Biology • Microbiology • Medical Science • Computer Science • Industry • Finance
Basics of ANOVA Alla Tashlitsky
Definition • ANOVA can determine whether there is a significant relationship between variables. It is also used to determine whether a measurable difference exists between two or more sample means. • Objective: To identify important independent variables (predictor variables – yi’s) and determine how they affect the response variables. • One-way, two-way, or multi-way ANOVA depend on the number of independent variables there are in the experiment that affect the outcome of the hypothesis test.
Model & Assumptions • (Simple Model) • E(εi) = 0 • Var(ε1) = Var(ε2) = … = Var(εk): homoscedasticity • All εi’sare independent. • εi ~ N(0,σ2)
Classes of ANOVA • Fixed Effects: concrete (e.g. sex, age) • Random Effects: representative sample (e.g. treatments, locations, tests) • Mixed Effects: combination of fixed and random
Procedure • H0: µ1=µ2=…=µkvs Ha: at least one the equalities doesn’t hold • F~fk,n-(k+1),α= MSR/MSE = t2 (when there are only 2 means) • Where mean square regression: MSR = SSR/1 and mean square error: MSE = SSE/n-2 • The rejection region for a given significance level is F > f
Regression • SST (sum of squares total) = SSR (sum of squares regression) + SSE (sum of squares error) • Sample variance: S2 = MSE = SSE/n-k → Unbiased estimator for σ2
Data Collection Bryan Rong
Data Collection • 3 industries – Application Software, Credit Service, Apparel Stores • Sample 15 stocks from each industry • For each stock, we observed the last 30 days and calculated • Mean daily percentage change • Mean daily percentage range • Mean Volume
Application software • CA, Inc. [CA] • Compuware Corporation [CPWR] • Deltek, Inc. [PROJ] • Epicor Software Corporation [EPIC] • Fundtech Ltd. [FNDT] • Intuit Inc. [INTU] • Lawson Software, Inc. [LWSN] • Microsoft Corporation [MSFT • MGT Capital Investments, Inc. [MGT] • Magic Software Enterprises Ltd. [MGIC] • SAP AG [SAP] • Sonic Foundry, Inc. [SOFO] • RealPage, Inc. [RP] • Red Hat, Inc. [RHT] • VeriSign, Inc. [VRSN]
Credit Service • Advance America, Cash Advance Centers, Inc. [AEA] • Alliance Data Systems Corporation [ADS] • American Express Company [AXP] • Asset Acceptance Capital Corp. [AACC] • Capital One Financial Corporation [COF] • CapitalSource Inc. [CSE] • Cash America International, Inc. [CSH] • Discover Financial Services [DFS] • Equifax Inc. [EFX] • Global Cash Access Holdings, Inc. [GCA] • Federal Agricultural Mortgage Corporation [AGM] • Intervest Bancshares Corporation [IBCA] • Manhattan Bridge Capital, Inc. [LOAN] • MicroFinancial Incorporated [MFI] • Moody's Corporation [MCO]
APPAREL STORES • Abercrombie & Fitch Co. [ANF] • American Eagle Outfitters, Inc. [AEO] • bebe stores, inc. [BEBE] • DSW Inc. [DSW] • Express, Inc. [EXPR] • J. Crew Group, Inc. [JCG] • New York & Company, Inc. [NWY] • Nordstrom, Inc. [JWN] • Pacific Sunwear of California, Inc. [PSUN] • The Gap, Inc. [GPS] • The Buckle, Inc. [BKE] • The Children's Place Retail Stores, Inc. [PLCE] • The Dress Barn, Inc. [DBRN] • The Finish Line, Inc. [FINL] • Urban Outfitters, Inc. [URBN]
Checking Assumptions Zhang Junying
Major Assumptions of Analysis of Variance • The Assumptions • Normal populations • Independent samples • Equal (unknown) population variances • Our Purpose • Examine these assumptions by graphical analysis of residual
Residual plot • Violations of the basic assumptions and model adequacy can be easily investigated by the examination of residuals. • We define the residual for observation j in treatment i as • If the model is adequate, the residuals should be structureless; that is, they should contain no obvious patterns.
Normality • Why normal? • ANOVA is anAnalysis of Variance • Analysis of two variances, more specifically, the ratio of two variances • Statistical inference is based on the F distribution which is given by the ratio of two chi-squared distributions • No surprise that each variance in the ANOVA ratio come from a parent normal distribution • Normality is only needed for statistical inference.
Sas code for getting residual PROCIMPORTdatafile = 'C:\Users\junyzhang\Desktop\mydata.xls' out = stock; RUN; PROCPRINT DATA=stock; RUN; Procglm data=stock; Class indu; Model adpcdata=indu; Output out =stock1 p=yhat r=resid; Run; PROCPRINT DATA=stock1; RUN;
Normality test The normal plot of the residuals is used to check the normality test. proc univariate data= stock1 normal plot; varresid; run;
Normality Tests Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.731203 Pr < W <0.0001 Kolmogorov-Smirnov D 0.206069 Pr > D <0.0100 Cramer-von Mises W-Sq 1.391667 Pr > W-Sq <0.0050 Anderson-Darling A-Sq 7.797847 Pr > A-Sq <0.0050 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.989846 Pr < W 0.6521 Kolmogorov-Smirnov D 0.057951 Pr > D >0.1500 Cramer-von Mises W-Sq 0.03225 Pr > W-Sq >0.2500 Anderson-Darling A-Sq 0.224264 Pr > A-Sq >0.2500 Normal Probability Plot 2.3+ ++ * | ++* | +** | +** | **** | *** | **+ | ** | *** | **+ | *** 0.1+ *** | ** | *** | *** | ** | +*** | +** | +** | **** | ++ | +* -2.1+*++ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 Normal Probability Plot 8.25+ | * | | | * | | * | + 4.25+ ** ++++ | ** +++ | *+++ | +++* | ++**** | ++++ ** | ++++***** | ++****** 0.25+* * ****************** +----+----+----+----+----+----+----+----+----+----+ 34 34
Independence • Independent observations • No correlation between error terms • No correlation between independent variables and error • Positively correlated data inflates standard error • The estimation of the treatment means are more accurate than the standard error shows.
SAS code for independence test The plot of the residual against the factor is used to check the independence. proc plot; plot resid* indu; run;
Homogeneity of Variances • Eisenhart (1947) describes the problem of unequal variances as follows • the ANOVA model is based on the proportion of the mean squares of the factors and the residual mean squares • The residual mean square is the unbiased estimator of 2, the variance of a single observation • The between treatment mean squares takes into account not only the differences between observations, 2,just like the residual mean squares, but also the variance between treatments • If there was non-constant variance among treatments, we can replace the residual mean square with some overall variance, a2, and a treatment variance, t2, which is some weighted version of a2 • The “neatness” of ANOVA is lost
Sas code for Homogeneity of Variances test The plot of residuals against the fitted value is used to check constant variance assumption. proc plot; plot resid* yhat; run;
Result about our data • Normal populations • Nearly independent samples • Equal (unknown) population variances So we can employ ANOVA to analyze our data. 43
1-Way ANOVA Derivation and SAS Yin gying Lin & Wenyi Dong
Derivation – 1-Way ANOVA • Hypotheses • H0: μ= μ1 = μ2 = μ3 = … = μn • H1: μi≠ μj for some i,j • We assume that the jth observation in group i is related to the mean by xij = μ+ (μi – μ) + εij, where εij is a random noise term. • We wish to separate the variability of the individual observations into parts due to differences between groups and individual variability
Derivation – 1-Way ANOVA – Cont’ • We can show that • Using the above equation, we define
Derivation – 1-Way ANOVA – Cont’ • Given the distributions of the MSS values, we can reject the null hypothesis if the between group variance is significantly higher than the within group variance. That is, • We reject the null hypothesis if F > fn-1,N-n,α
Brief Summary Statistics • Code proc means data=stock maxdec=5 n mean std; by industry; var ADPC; Get simple summary statistics(sample size, sample mean and SD of each industry) with max of 5 decimal places
Brief Summary Statistics • Output