Lecture 2: Linear Regression

Lecture 2:Linear Regression

Bivariate Statistics (when two variables are involved)

Basic Univariate Statistics • Mean • Standard deviation • Variance • (you know this stuff)

Reason for doing a correlation • To determine whether two variables are related • Does one increase as the other increases? • Does one decrease as the other increases?

Not causality • Correlation doesn’t distinguish between IV and DV. • Causal relationship could go either way (or both ways, or neither way)

Correlations can be visualized with scatter plots: r = .00

r =.15

r = .40

r = .81

r = .99

120 100 80 60 40 20 0 0 20 40 60 80 100 120 r = -.79

Another way to visualize correlation Variance in A Variance in b Variance in b Variance in A Small correlation Large correlation

Correlation Coefficient • A measure of degree of relationship • Degree to which large scores go with large scores, and small scores with small scores • Based on covariance • The correlation is the covariance, standardized

What is covariance again?(Review) • Variance of one variable • How spread out the datapoints are • Average (squared) distance between the points and their mean • S2 = ∑ (X – mean)2 / N-1

Bell Curve

Low vs. High in Variance

Variance vs. covariance • Variance = variation of a set of scores from their mean • Covariance = covariation of two sets of scores from their means • If someone’s X is a lot higher than the mean of X, their Y should be a lot higher than the mean of Y.

Covariance CovXY = ∑ ( X - ̅X ) ( Y - ̅Y ) N-1

Standardize the covariance to get the correlation • What does standardize mean? • Subtract the mean (so the scores are centered around 0) • Divide by the standard deviation (so the standard deviation and variance are rescaled to 1)

Correlation • r = CovXY / SDX SDY • Would come out the same if you standardized X and Y separately and took the covariance of them.

R2 • R2 is the proportion of variance in Y that is explained by X. • Important for regression, because you’re trying to explain as much of the variance in Y as possible, by combining a lot of Xs.

X2 X1 Y X3 X4 X5

Purpose of Regression Analysis • Predict the values of a DV based on values of one or more IVs

Differences between correlation and regression • Correlation just tells you whether there is a linear association between two variables. • Regression actually gives you an equation for the line. • In regression, you can have multiple predictors. • Regression implies (but doesn’t prove) directionality or causality.

Simple Linear Regression • Simple: • One independent (predictor) variable • Linear: • Straight-line relationship between IV and DV

Simple Linear Regression • Evaluates the relationship between two continuous variables. • X is the predictor, explanatory, independent variable. • Y is the response, outcome, dependent variable. • Which is X and which is Y depends on your hypothesis about causation.

Linear Regression Model Y = 0 + 1X +  0 = Y intercept 1 = slope  = disturbance (variance in Y that can’t be explained by X)

β0 • The Y-intercept • Where the line crosses the Y-axis • The expected value of Y when X=0

β1 • The slope • The expected change in Y for each one-unit change in X

ε(Disturbance) • Variance in Y that cannot be explained by our X’s • Its expected value is 0 • (but in real life, it’s usually not 0 because the points don’t sit exactly on the line) • The ε for each X is independent of the ε’s of the other X’s

Example of a line without an error term(A function: Y = 1.8 X + 32) All points lie exactly on the line.

Example of a statistical relationship Points are near the line, but not exactly on it. A relationship with some “trend”, but also with some “scatter.”

Goal of regression • Find the best-fitting line that describes a scatterplot • Make the line as close as possible to all the points • Best-fitting line is the one with the lowest total residuals • (Residuals are the distances between each point and the line)

What this looks like

A scatterplot

PowerPoint’s automatic trendline

Equation of the trendline

Most of the points are close to the line, but not exactly on the line. How do you know if this is the best-fitting line?

Goal • Find the line that makes the total residuals (distance from the points to the line) as small as possible

(Painless) review of calculus • How to find the smallest Something as possible: • Write an equation for it • Take the derivative (slope) • Solve for the value of X where the derivative=0 • That will be a minimum or maximum

Strategy • Write an equation for the distance between each point and the line • Sum it up over all points • Solve for the values of β0 and β1 such that the sum of the distances is a minimum

Use geometry and trigonometry to find the distance between the point and the line X-distance Y-distance

Equations…. ? Y=0+1X+ The Black Box

Using calculus, minimize (take derivative with respect to b0 and b1, set to 0, and solve for b0 and b1): The least squares regression line and get the least squares estimates b0 and b1:

Another way to think about it:Sums of Squares

Each point’s Y-value depends on 2 things: • Where the regression equation predicts it will be, given its X-value • How far away it is from its predicted position (the residual)

Our goal is to get the regression line to predict the point’s Y-value as closely as possible. • The unexplained part of the point’s location (residual) should be small in comparison to the explained part of the point’s location (predicted distance above or below the overall mean of Y).

Sums of Squares tell you how much of the point’s Y-value was explained by the regression line, and how much was unexplained.

Sum of Squares SSTotal = SSModel + SSError Total Sample Variability Unexplained Variability = Explained Variability +

Mean Squares • Divide each type of Sum of Squares by its degrees of freedom • Model df comes from the number of variables in the regression equation • Error df comes from the number of observations

Lecture 2: Linear Regression