180 likes | 289 Vues
2.2 Correlation. Correlation measures the direction and strength of the linear relationship between two quantitative variables.
E N D
2.2 Correlation Correlation measures the direction and strength of the linear relationship between two quantitative variables.
Idea: Given two quantitative variables, we would like to be able to associate some number to those variables which tells us “how close” the data is to forming a straight line. We will call this number the correlation coefficient. It is denoted by r.
Which graph has stronger correlation? We would like to be able to answer this mathematically rather than just appeal to the graphs.
The Formula • The formula for the correlation coefficient is difficult to motivate, so take it for granted: Given n pairs of data values, the correlation coefficientr is given by the formula • r=.636 in the first graph, and r=.543 in the second.
What does r tell us? • Now that we have a paradigm case, we may discuss some properties of r, the correlation coefficient. • r is always between -1 and 1 (inclusive). Hence if your correlation coefficient falls outside of this range, something has gone awry. • If r=1 or -1, then all the points of the scatter diagram lie on the regression line. When r>0, the slope of the regression line is positive (positive association). When r<0, the slope of the regression line is negative (negative association). • Thus the closer r is to -1 or 1, the stronger the relationship. If r=0, then there is no linear relationship. If r is close to 0, then there is little to no linear relationship. • Let’s draw a few examples on the board to illustrate.
Other Properties of r r is not a resistant measure. r does not distinguish between the explanatory and the response variable. This is easily seen from looking at the formula for r.
The Questions • Given a data set, does it seem to conform to some sort of pattern? In particular, can we find an equation that more or less “fits” the data? • If so, this can be used to predict values. • The easiest equation is a linear equation (a line), so this is what we concentrate on in this section. • Here there is a distinction between explanatory and response variables.
Linear Equations • Suppose Irving Oil charges a $40 flat rate to send someone out to a job and $6 for each hour they work on that job. What equation models the data? • This is an example of a linear equation; that is, any equation of the form y=b0x+b1where b0and b1are fixed numbers. The number m is the slope of the linear equation and b is the y-intercept. When b0>0, we say there is a positive linear relationship between x and y. When b0<0, we say there is a negative linear relationship between x and y.
Tables and Graphs • Let’s make a table for the linear equation we just found on the board. • Now we’ll plot the points, and draw the line through them. Identify graphically the slope and y-intercept. • In this case, there is a perfect linear relationship between the x and y values.
Consider the following table of data values. • Draw a scatter plot of the data and a line that approximately fits the data. Use a simple method to write down an equation for the approximating line. Note that we can now make rudimentary predictions. • How do we mathematically find an equation of such a line, and how do we find the best one?
Residuals • We use the notation y to denote an observed value and ŷ (y hat) to denote an estimate of the observed value. It follows that the closer y is to ŷ, the better our estimate is. • Hence, we define the residual (or error) of an estimate to be y- ŷ=e. • Compute the residuals in the example on the board. • It now becomes clear that coming up with the line of best fit is equivalent to minimizing the residuals in some way.
How to minimize • What do we mean by “minimizing the residuals”? One idea is to add up all the residuals. • But recall that when we were discussing variance, we ran into the problem of negatives cancelling with positives when we summed over all differences. The case here is similar. • We agree to sum over the squares of the residuals. This is the idea of the least squares regression line of y on x which is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Line of Best Fit • Using technical machinery including calculus and magic, the line of best fit can be found as follows: • Suppose we have a scatter diagram with n points. The line of best fit or regression line has the form ŷ=b0+b1x where
Example • Find the regression line for the data y=7.21-.681x
Correlation vs. Regression • Recall that the correlation r ignores the distinction between explanatory and response variables, while regression does not. • But r is in the formula for the regression line. • It turns out that r2 is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x. • Let’s look at an example.
r=.996 The straight-line relationship between length of icicles and the time it takes them to grow that length explains about r2=(.996)2= 99.2016 % of the vertical scatter in time.