490 likes | 719 Vues
Multiple Regression. Objectives. Maximize the predictive power of the independent variables as represented in the variate . Compare two or more sets of independent variables to ascertain the predictive power of each variate. X 3. Y’ . X 1. X 2. Explanation.
E N D
Objectives • Maximizethepredictivepower of theindependentvariables as represented in thevariate. • Comparetwoormoresets of independentvariablestoascertainthepredictivepower of eachvariate
X3 Y’ X1 X2 Explanation • Themostdirectinterpretation of theregressionvariate is a determination of therelativeimportance of eachindependentvariable in theprediction of thedependentmeasure. • Assessthenature of therelationshipsbetweentheindependentvariablesandthedependentvariable. • Provideinsightintotherelationshipsamongindependentvariables.
Sample Problem (Leslie Salt Property):FindingFairPrice of a Land
ELEVATION DISTANCE COUNTY SEWER FLOOD PRICE DATE SIZE SEWER FLOOD SIZE COUNTY DISTANCE ELEVATION DATE PRICE
summary(model) Call: lm(formula = leslie_salt[, 1] ~ leslie_salt[, 4] + leslie_salt[, 5] + leslie_salt[, 6]) Residuals: Min 1Q Median 3Q Max -9.6076 -3.2506 -0.0281 2.8770 20.2776 Coefficients: EstimateStd. Error t value Pr(>|t|) (Intercept) 21.2787636 2.9203157 7.286 7.75e-08 *** leslie_salt[, 4] 0.5614588 0.2515472 2.232 0.034107 * leslie_salt[, 5] -0.0005871 0.0004460 -1.316 0.199129 leslie_salt[, 6] 0.1836824 0.0421712 4.356 0.000172 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residualstandarderror: 5.559 on 27 degrees of freedom Multiple R-squared: 0.5327, Adjusted R-squared: 0.4807 F-statistic: 10.26 on 3 and 27 DF, p-value: 0.000111
Assumptions • Linearity of thedependentvariable in terms of independentvariables.
Linearity (cts.) A higherorderterm of thedependentvariableshould be included. Inthatcase define a newvariablebytakingthesquare (forthiscase) of thatindependentvariableandusesquaredvalues in theregression. Use: Visual inspection
Moretroublesome is MODERATOR effect • If an independent-dependentvariablerelationship is effectedbyanotherindependentvariablethissituation is termed a moderatoreffect. • Themostcommonmoderatoreffect in multipleregression is thebilinearmoderator in whichtheslope of therelationship of oneindependentvariable (X1) changesacrossvalues of themoderatorvariable (X2).
Example Familyincome (X2) can be a positivemoderator of therelationshipbetweenfamily size (X1) andcreditcardusage (Y). Thenexpectedchange in creditcardusagebased on family size () might be lowerforfamilieswithlowincomesandhigh in highincomes. Withoutthemoderatoreffectweareassumingthatfamily size have a constanteffect on creditcardusage.
AddingModeratorEffect The idea comesfromobserving a self moderatoreffect. If a variable has a moderatoreffectontoitselfthenwewouldassume a nonlinear (seconddegree) relationshipwiththedependentvariable. Thusifthere is a moderatoreffectadd X1X2 as an independentvariabletoregressionequation. But wewillreturnbacktothis!!!
Assumption:HomoscedasticityConstantvariance of theerrorterms.
Heteroscedasticity (cts.) withinvariables in residuals
Heteroscedasticity (cts.) • Use: Levene Test. Levene Test: Tests the equalityof variance. Levene's test works by testing the null hypothesis that the variances of the group are the same. The output probability is the probability that at least one of the samples in the test has a significantly different variance. If this is greater than a selected percentage (usually 5%) then it is considered too great to be able to usefully apply parametric tests. Variances In SPSS it is reported. In R: In «lawstat» libraryuselevene.test() function.
Assumptions • Independence of theerrorterms. Checkthecoordinates!!!
Independence of ErrorTerms • Use: Durbin-Watson The value of the Durbin-Watson statistic ranges from 0 to 4. As a general rule of thumb, the residuals are uncorrelated is the Durbin-Watson statistic is approximately 2. A value close to 0 indicates strong positive correlation, while a value of 4 indicates strong negative correlation.
In SPSS Durbin Watson is reported. • In R under «lmtest» libraryusedwtest() dwtest(formula, order.by = NULL, alternative = c("greater", "two.sided", "less"), iterations = 15, exact = NULL, tol = 1e-10, data = list()) Forourregression model. • > dwtest(model) • Durbin-Watson test data: model • DW = 2.3762, p-value = 0.7783 • alternativehypothesis: trueautocorrelation is greaterthan 0
Assumptions • Normality of theerrortermdistribution.
Diagonistics Call: lm(formula = leslie_salt[, 1] ~ leslie_salt[, 4] + leslie_salt[, 5] + leslie_salt[, 6]) Residuals: Min 1Q Median 3Q Max -9.6076 -3.2506 -0.0281 2.8770 20.2776 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.2787636 2.9203157 7.286 7.75e-08 *** leslie_salt[, 4] 0.5614588 0.2515472 2.232 0.034107 * leslie_salt[, 5] -0.0005871 0.0004460 -1.316 0.199129 leslie_salt[, 6] 0.1836824 0.0421712 4.356 0.000172 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.559 on 27 degrees of freedom Multiple R-squared: 0.5327, Adjusted R-squared: 0.4807 F-statistic: 10.26 on 3 and 27 DF, p-value: 0.000111
Identifying Influential Observations • observationsthatlieoutsidethe general patterns of the data set • observationsthatstronglyinfluenceregressionresults Types of InfluentialObservations • Outliers – observationsthathavelargeresiduals (based on dependentvariables) • Leveragepoints – observationsthataredistinctfromtheremainingobservationsbased on theirindependentvariablevalues. • Influentialobservations – includingallobservationsthathave a disproportionateeffect on theregressionresults.
Outliers • Typicalboxplot test. • In «car» library • outlierTest(model) • rstudentunadjusted p-valueBonferonni p • 2 3.704906 0.0010527 0.032634
Leverage An observation with an extreme value on a predictor variable is calleda point with high leverage. Leverage is a measure of how far an IV deviates from itsmean. These leverage points can have an unusually large effect on the estimate ofregression coefficients. We hope to see very few (if any) points in the plot representinghigh values of leverage. High leverage can also point toward outliers, which are definedas observations with large residuals in regression. You should say something about thenumber of cases that appear to represent high leverage. Leverage: Cutoffpoint : p: # of independentvariables n: # of observations
0.30 cooks.distance(model) 0.15 0.00 0 5 10 15 20 25 30 Index Cook’s Distance: Cutoffpoint: p: # of independentvariables n: # of observations
R-Code # InfluentialObservations # addedvariableplots av.Plots(model) # Cook's D plot # identify D values > 4/(n-k-1) cutoff <- 4/((nrow(leslie_salt)-length(model$coefficients)-2)) plot(fit, which=4, cook.levels=cutoff) # InfluencePlot influencePlot(model, id.method="identify", main="InfluencePlot", sub="Circle size is proportialtoCook'sDistance" )
Residuals vs Leverage 2 3 1 2 0.5 1 Standardized residuals 0 9 4 Cook's distance 0.5 -2 0.0 0.1 0.2 0.3 0.4 Leverage lm(leslie_salt[, 1] ~ leslie_salt[, 4] + leslie_salt[, 6] + leslie_salt[, 7 ... Cook’s Distance: Cutoffpoint: p: # of independentvariables n: # of observations
AssessingMulticollinearity***** A keyissue in interpretingtheregressionvariate is thecorrelationamongtheindependentvariables. Ourtask in a regressionanalysisincludesthefollowing: • Assessthedegree of multicollinearity • Determineitsimpact on results • Applythenecessaryremediesifneeded
Assessthedegree of multicollinearity • Thesimplestandmostobviousway: Identifyingcollinearity in correlationmatrix. Checkforcorrelation >90%. • A directmeasure of multicollinearity is tolerance (1/VIF). • Theamount of variability of theselectedindependentvariablenot explainedbytheotherindependentvariables.Computation: • Takeeachindependentvariable. Assume it as thedependentvariable. Computeadjusted R2. • Tolerance is then 1-R2. • Forexampleifothervariablesexplain 25% of an independentvariablethentolerence of thisvariable is 75%. Tolerenceshould be morethan 10% > 1/vif(model) leslie_salt[, 4] leslie_salt[, 6] leslie_salt[, 7] leslie_salt[, 8] 0.8081325 0.9959058 0.7650806 0.7715437
Further… • seepagehttp://www.statmethods.net/stats/rdiagnostics.htmlfordiagonistictestswith R
PartialCorrelation • A partial correlation coefficient is away of expressing the unique relationship between the criterion and a predictor. Partial correlation represents the correlation between the criterion and a predictor after common variance with other predictors has been removed from both the criterion and the predictor of interest. t.values <- model$coeff / sqrt(diag(vcov(model))) partcorr <- sqrt((t.values^2) / ((t.values^2) + model$df.residual)) partcorr ***************************************************** leslie_salt[, 4] leslie_salt[, 6] leslie_salt[, 7] leslie_salt[, 8] 0.6562662 0.8043296 0.6043579 0.5740840
Part (Semi-partial) Correlation • A semipartial correlation coefficient represents the correlation between the criterion and a predictor that has been residualized with respect to all other predictors in the equation. Note that the criterion remains unaltered in the semipartial. Only the predictor is residualized. After removing variance that the predictor has in common with other predictors, the semipartial expresses the correlation between the residualized predictor and the unaltered criterion. An important advantage of the semipartial is that the denominator of the coefficient (the total variance of the criterion, Y) remains the same no matter which predictor is being examined. This makes the semipartial very interpretable. The square of the semipartial can be interpreted as the proportion of the criterion variance associated uniquely with the predictor. It is also possible to use the semipartial to fully deconstruct the variance components in a regression analysis.
Project (Step1): Gotoweb page: http://luna.cas.usf.edu/~mbrannic/files/regression/Partial.html Replicatetheresultsthereusing a dataset of yourown. Be creative in problem formulation. Data may be imaginary. Use at least 5 independentvariables.
ComparingRegressionModels • Inmultipleregressionthehardest problem is deciding on whichvariablestoenterintoequationevenaftercheckingassumptionsuch as multicollinearity. • Adjusted is not a properway of model comparison. • Nextwelearn a betterway.
StepwiseRegression • Start withthemostbasic model. Pickyourfavouriteindependentvariableandconstructthe model. Test it. Remembercorrelationmatrix (price in logs)
Ourfocus is theimprovement in RSS. Soweneedresidualsum of squares. But it is not given in thereportdirectly (given in SPSS). > anova(m1) Analysis of Variance Table Response: leslie_salt[, 1] Df Sum Sq Mean Sq F value Pr(>F) leslie_salt[, 6] 1 5.9282 5.9282 18.124 0.0001982 *** Residuals 29 9.4858 0.3271 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Call: lm(formula = leslie_salt[, 1] ~ leslie_salt[, 6]) Residuals: Min 1Q Median 3Q Max -1.12046 -0.34364 0.04853 0.39719 1.00081 Coefficients: EstimateStd. Error t value Pr(>|t|) (Intercept) 3.322336 0.269975 12.306 4.9e-13 *** leslie_salt[, 6] 0.018124 0.004257 4.257 0.000198 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residualstandarderror: 0.5719 on 29 degrees of freedom Multiple R-squared: 0.3846, Adjusted R-squared: 0.3634 F-statistic: 18.12 on 1 and 29 DF, p-value: 0.0001982
Call: lm(formula = leslie_salt[, 1] ~ leslie_salt[, 6] + leslie_salt[, 5]) Residuals: Min 1Q Median 3Q Max -1.21681 -0.21980 0.08597 0.29875 0.81520 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.442e+00 2.442e-01 14.093 3.07e-14 *** leslie_salt[, 6] 1.643e-02 3.841e-03 4.278 0.000199 *** leslie_salt[, 5] -1.105e-04 3.797e-05 -2.910 0.007013 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.51 on 28 degrees of freedom Multiple R-squared: 0.5275, Adjusted R-squared: 0.4937 F-statistic: 15.63 on 2 and 28 DF, p-value: 2.766e-05 Analysis of Variance Table Response: leslie_salt[, 1] Df Sum Sq Mean Sq F value Pr(>F) leslie_salt[, 6] 1 5.9282 5.9282 22.7903 5.146e-05 *** leslie_salt[, 5] 1 2.2024 2.2024 8.4671 0.007013 ** Residuals 28 7.2833 0.2601 --- • Nowletsaddanothervariable say SEWER andassumewehave done alltesting
How muchimprovement do wehave? Our aim is tocheckwhethertheimrovement in RSS is statisticallysignificantor not. Define Numeratormeasuresaverageimprovement as weadd a newvariable (wemayadd a bunch of newvariables) andscalestheimprovementwithrespecttooriginal model. Thedegrees of freedom of thestatistic is (1,degrees of freedom of old model)
Inourcase So, new model is superior, theimprovement is statisticallysignificant.
Backtomoderatoreffect. To test themoderatoreffectweuse, as thesimple model and as theextended model andthendecideaccordingly.
Project (Step2,3 and 4): • Findthebestregressionequationforyour Project. • Test moderatoreffects • Test mediationeffects.