Statistics for the Social Sciences

Statistics for the Social Sciences Psychology 340 Fall 2006 Prediction cont.

Outline (for week) • Simple bi-variate regression, least-squares fit line • The general linear model • Residual plots • Using SPSS • Multiple regression • Comparing models, (?? Delta r2) • Using SPSS

Y 6 5 4 3 2 1 X 1 2 3 4 5 6 From last time • Review of last time Y = intercept + slope(X) + error

Y 6 5 4 3 2 1 X 1 2 3 4 5 6 From last time • The sum of the residuals should always equal 0. • The least squares regression line splits the data in half • Additionally, the residuals to be randomly distributed. • There should be no pattern to the residuals. • If there is a pattern, it may suggest that there is more than a simple linear relationship between the two variables.

Seeing patterns in the error • Useful tools to examine the relationship even further. • These are basically scatterplots of the Residuals (often transformed into z-scores) against the Explanatory (X) variable(or sometimes against the Response variable) • Residual plots

Seeing patterns in the error Residual plot Scatterplot • The scatter plot shows a nice linear relationship. • The residual plot shows that the residuals fall randomly above and below the line. Critically there doesn't seem to be a discernable pattern to the residuals.

Seeing patterns in the error Residual plot Scatterplot • The residual plot shows that the residuals get larger as X increases. • This suggests that the variability around the line is not constant across values of X. • This is referred to as a violation of homogeniety of variance. • The scatter plot also shows a nice linear relationship.

Seeing patterns in the error Residual plot Scatterplot • The scatter plot shows what may be a linear relationship. • The residual plot suggests that a non-linear relationship may be more appropriate (see how a curved pattern appears in the residual plot).

Regression in SPSS • Variables (explanatory and response) are entered into columns • Each row is an unit of analysis (e.g., a person) • Using SPSS

Regression in SPSS • Analyze: Regression, Linear

Predictor variable into the Independent Variable field Regression in SPSS • Enter: • Predicted (criterion) variable into Dependent Variable field

Slope (indep var name) • Intercept (constant) Regression in SPSS • The variables in the model • r • r2 • We’ll get back to these numbers in a few weeks • Unstandardized coefficients

 (indep var name) Regression in SPSS • Recall that r = standardized  in bi-variate regression • Standardized coefficient

Multiple Regression • Typically researchers are interested in predicting with more than one explanatory variable • In multiple regression, an additional predictor variable (or set of variables) is used to predict the residuals left over from the first predictor.

Multiple Regression • Bi-variate regression prediction models Y = intercept + slope (X) + error

“residual” “fit” Multiple Regression • Multiple regression prediction models • Bi-variate regression prediction models Y = intercept + slope (X) + error

whatever variability is left over First Explanatory Variable Second Explanatory Variable Third Explanatory Variable Fourth Explanatory Variable Multiple Regression • Multiple regression prediction models

whatever variability is left over First Explanatory Variable Second Explanatory Variable Third Explanatory Variable Fourth Explanatory Variable Multiple Regression • Predict test performance based on: • Study time • Test time • What you eat for breakfast • Hours of sleep

versus versus Multiple Regression • Predict test performance based on: • Study time • Test time • What you eat for breakfast • Hours of sleep • Typically your analysis consists of testing multiple regression models to see which “fits” best (comparing r2s of the models) • For example:

Response variable Total variability it test performance Total study time r = .6 Multiple Regression Model #1: Some co-variance between the two variables • If we know the total study time, we can predict 36% of the variance in testperformance R2 for Model = .36 64% variance unexplained

Multiple Regression Model #2: Add test time to the model • Little co-variance between these test performance and test time • We can explain more the of variance in test performance R2 for Model = .49 Response variable Total variability it test performance Total study time r = .6 51% variance unexplained Test time r = .1

Multiple Regression Model #3: No co-variance between these test performance and breakfast food • Not related, so we can NOT explain more the of variance in test performance R2 for Model = .49 Response variable Total variability it test performance breakfast r = .0 Total study time r = .6 51% variance unexplained Test time r = .1

Multiple Regression Model #4: Some co-variance between these test performance and hours of sleep • We can explain more the of variance • But notice what happens with the overlap (covariation between explanatory variables), can’t just add r’s or r2’s R2 for Model = .60 Response variable Total variability it test performance breakfast r = .0 Total study time r = .6 40% variance unexplained Hrs of sleep r = .45 Test time r = .1

Multiple Regression in SPSS Setup as before: Variables (explanatory and response) are entered into columns • A couple of different ways to use SPSS to compare different models

Regression in SPSS • Analyze: Regression, Linear

Predicted (criterion) variable into Dependent Variable field • All of the predictor variables into the Independent Variable field Multiple Regression in SPSS • Method 1:enter all the explanatory variables together • Enter:

Multiple Regression in SPSS • The variables in the model • r for the entire model • r2 for the entire model • Unstandardized coefficients • Coefficient for var1 (var name) • Coefficient for var2 (var name)

Coefficient for var1 (var name) • Coefficient for var2 (var name) Multiple Regression in SPSS • The variables in the model • r for the entire model • r2 for the entire model • Standardized coefficients

Multiple Regression • Which  to use, standardized or unstandardized? • Unstandardized ’s are easier to use if you want to predict a raw score based on raw scores (no z-scores needed). • Standardized ’s are nice to directly compare which variable is most “important” in the equation

First Predictor variable into the Independent Variable field • Click the Next button Multiple Regression in SPSS • Method 2: enter first model, then add another variable for second model, etc. • Enter: • Predicted (criterion) variable into Dependent Variable field

Second Predictor variable into the Independent Variable field • Click Statistics Multiple Regression in SPSS • Method 2 cont: • Enter:

Multiple Regression in SPSS • Click the ‘R squared change’ box

Multiple Regression in SPSS • Shows the results of two models • The variables in the first model (math SAT) • The variables in the second model (math and verbal SAT)

Multiple Regression in SPSS • Shows the results of two models • The variables in the first model (math SAT) • The variables in the second model (math and verbal SAT) • r2 for the first model • Model 1 • Coefficients for var1 (var name)

Coefficients for var1 (var name) • Coefficients for var2 (var name) Multiple Regression in SPSS • Shows the results of two models • The variables in the first model (math SAT) • The variables in the second model (math and verbal SAT) • r2 for the second model • Model 2

Multiple Regression in SPSS • Shows the results of two models • The variables in the first model (math SAT) • The variables in the second model (math and verbal SAT) • Change statistics: is the change in r2 from Model 1 to Model 2 statistically significant?

Cautions in Multiple Regression • We can use as many predictors as we wish but we should be careful not to use more predictors than is warranted. • Simpler models are more likely to generalize to other samples. • If you use as many predictors as you have participants in your study, you can predict 100% of the variance. Although this may seem like a good thing, it is unlikely that your results would generalize to any other sample and thus they are not valid. • You probably should have at least 10 participants per predictor variable (and probably should aim for about 30).

Statistics for the Social Sciences