Chapter 13 Linear Regression
DEFINITIONS: • Studies are often conducted to attempt to show that some explanatory variable “causes” the values of some response variable to occur. • The response or dependent variable is the response of interest, the variable we want to predict, and is usually denoted by y. • The explanatory or independent variable attempts to explain the response and is usually denoted by x. • A scatterplot shows the relationship between two quantitative variables x and y. The values of the x variable are marked on the horizontal axis, and the values of the y variable are marked in the vertical axis. Each pair of observations (xi, yi), is represented as a point in the plot. • Two variables are said to be positively associated if, as x increases, the values of y tends to increase. Two variables are said to be negatively associated if, as x increases, the values of y tends to decrease. • When a scatterplot does not show a particular direction, neither positive, nor negative, we say that there is no linear association.
Let’s Do It! 4.19 Possible Explanations: For each of the following response or dependent variables, list one or more possible explanatory or independent variables. The first one is done for you. • Response VariablePossible Explanatory Variable(s) • Height of Son Height of the father, height of the mother, age • Weight • Blood pressure • GPA at end of a semester • Exam 2 Score • Quantity demand for a product
Scatter Plots • A plot of ordered pair (bivariate) data. • A scatter plot shows the relationship between two quantitative variables, x and y. Each ordered pair represents a single data point. • Scatter plots must be done on graph paper or using a computer.
Notes of Caution • An observed relationship between two variables does not imply that there is some causal link between variables • A relationship between two variables can be influenced by confounding variables. • Unusual data points (outliers) can mislead the association, in particular when the data set is small.
Simple Linear Regression • Least squares regressionis one method to get the line of best fit if the data show a linear trend. This method minimizes the square of the residuals. • The residual is the difference between the observed response y and the predicted response y. Each point has an associated residual. ^
How To Regress • Open to page 816 to see the equations needed to calculate the slope and y-intercept for the linear regression equation. YUCKO! We’ll let the TI handle this messy work for us. • The TI Quick Steps are on page 912.
Let’s Do It • LDI 13.2 • LDI 13.3
Residuals • A residual is calculated by first putting the observed value of x into the regression equation and then subtracting the result from the observed y. • Let’s do example 13.4 (p824)
Residual Analysis • If we plot the x by the residual we’ll create a residual plot. This should look like an unstructured band of points centered around 0 if the data is best fit by a linear model. (page 826).
Let’s Do It! • LDI 13.4 • LDI 13.5
Outliers and Influential Data • In regression there are two types of individual data points that can lie outside the pattern. An outlier is an observation that is far from the predicted line and produces a “large” residual. An influential data point is one that if removed would cause the regression equation to change drastically. Page 832.
Let’s Do It • LDI 13.6
Correlation Coefficient • The samplelinear correlation coefficient r measures how strong the linear relation is between two quantitative variables. It describes the direction of the linear association and indicates how closely the points in the scatterplot are to the least squares regression line. • The values of r range between -1 and 1
Properties of r • The sign of the correlation coefficient indicates direction of association negative [-1 , 0) or positive (0 , +1]. • The magnitude of the correlation coefficient indicates the strength of the linear association. If the data follow a straight line, then r = +1 (if the slope is positive) or r = -1 (if the slope is negative), indicating a perfect linear association. If r = 0 then there is no linear association.
Properties of r • The correlation coefficient only measures the strength of the linear association. It is important to look at the scatterplot first to examine the type of association present. • The correlation coefficient is computed using standard scores of the two variables. It has no units of measure. The correlation coefficient will not change if x and y are reversed.
Let’s Do It • Turn your diagnostics on, do example 13.11 on page 848. • LDI: 13.8, 13.11 • Take the quiz
Is the Relation Significant? • Recall that the regression equation is generated from data. So the values of a and b are statistics, estimates of the population values. Could we see a relationship that doesn’t exist just from dumb luck of our data choice?
^ • Think About It: ConsiderSuppose the y-intercept is equal to 10 and the slope was 0. • What would the value of y be for for any value of value of x? • What would it mean if the slope of the regression line were 0?
Hypothesis Test for Regression • If the slope is 0 then there is no “dependency” on x and y and the regression equation is useless. • The parameter for the slope is given by:
Hypothesis Test for Regression • LinRegTTest on the TI
Let’s Do It • LDI 13.7
Coefficient of Determination • A statistic that is widely used to determine how well a regression fits is the coefficient of determination r2. It represents the fraction of variability in y that can be explained by the variability in x. In other words, r2 explains how much of the variability in the y's can be explained by the fact that they are related to x, i.e., how close the points are to the line.
Let’s Do It • LDI 13.17 a-d