380 likes | 480 Vues
Explore correlations between related data sets using scatter plots and the correlation coefficient. Learn how to interpret associations, calculate correlation, and make predictions based on data relationships.
E N D
Overview • Scatter Plots • Associations • The Correlation Coefficient • Sketching Scatter Plots • Changes of Scale • Summary
Scatter Plots • Often, we are interested in comparing two related data sets • Heights and weights of students • SAT scores and freshman GPA • Age and fuel efficiency of vehicles • We can draw a scatter plot of the data set • Plot paired data points on a Cartesian plane
Scatter Plots • Scatter plot for the heights of 1,078 fathers and their adult sons • From HANES study
Scatter Plots • What does the dashed diagonal line represent? • Find the point representing a 5'3¼" father who has a 5'6½" son
Scatter Plots • What does the vertical dashed column represent? • Consider the families where the father was 72" tall, to the nearest inch • How tall was the tallest son? • Shortest?
Scatter Plots • Was the average height of the fathers around 64”, 68” or 72”? • Was the SD of the fathers’ heights around 3", 6" or 9"?
Scatter Plots • The points form a swarm that is more or less football-shaped • This indicates that there is a linear association between the fathers’ heights and the sons’ heights
Scatter Plots • Short fathers tend to have short sons, and tall fathers tend to have tall sons • We say there is a positive association between the heights of fathers and sons • What would it mean for there to be a negative association between the heights?
Scatter Plots • Does knowing the father’s height give a precise prediction of his son’s height? • Does knowing the father’s height let you better predict his son’s height?
Scatter Plots • We will generally assume the scatter plots are football-shaped • Association is linear in nature • Each data set is approximately normal
Scatter Plots • Key features of scatter plots • Given two data sets X and Y, … • The point of averages is the point (x, y) • The average of a data set is denoted by μ (Greek mu, for mean) • The subscript indicates which set is being referenced • It will be in the center of the cloud • Due to the normal approximation, the vast majority (95%) of the cloud should fall within 2 SD’s less than and greater than average for both X and Y
Associations • When given a value in one data set, we often want to make a prediction for the other data set • We call our given value the independent variable • We call the value we are trying to predict the dependent variable
Associations • If there is indeed a relationship between the two data sets, we can say various things about their association: • Strong: Knowing X helps you a lot in predicting Y, and vice versa • Weak: Knowing X doesn’t really help you predict Y, and vice versa • Positive:X and Y are directly proportional • The higher in one you look, the higher in the other you should be • Negative:X and Y are inversely proportional • The higher in one you look, the lower in the other you should be
Positive associations Study time/final grade Height/weight SAT score/GPA Clouds in sky/chance of rain Bowling practice/bowling score Age of husband/age of wife Negative associations Age of car/fuel efficiency Golfing practice/golf score Dental hygiene/cavities formed Pollution/air quality Speed/mile time Associations
Associations • What kind of association is this?
Associations • What kind of association is this?
Associations • Remember that even a very strong association does not necessarily imply a causal relationship • There may be a confounding influence at play
The Correlation Coefficient • While strong/weak and positive/negative give a sense of the association, we want a way to quantify the strength and direction of the association • The correlation coefficient (r) is the statistic which accomplishes this
The Correlation Coefficient • The correlation coefficient is always between –1 and 1 • A positive r means that there is a positive association between the sets • A negative r means that there is a negative association between the sets • If r is close to 0, then there is only a weak association between the sets • If r is close to 1 or –1, then there is a strong association between the sets
The Correlation Coefficient • The following plots have and , with 50 points in them • The only difference between them is the correlation coefficient • Note how the points fall into a line as r approaches 1 or –1
The Correlation Coefficient • To calculate r… • Find the average and SD of each data set • Multiply the data sets pairwise and find the average • The correlation is the average of the product minus the product of the averages, all divided by the product of the SD’s
XY 5 27 28 5 91 The Correlation Coefficient
The Correlation Coefficient • Compute r for the following data 1 0.8214
The Correlation Coefficient • Estimate the correlation
The Correlation Coefficient • Estimate the correlation
Sketching Scatter Plots • The SD line is the line consisting of all the points where the standard score in X equals the standard score in Y • zX = zY • To sketch the SD line, draw a line bisecting the long axis of the football shape • Note that the SD line always goes through the point of averages
Sketching Scatter Plots • Given the five-statistic summary (averages, SD’s, and correlation) for a pair of data sets, we can sketch the scatter plot • Plot the point of averages in the center • Mark two SD’s in both directions, on both axes • Plot the point 1 SD above average for both data sets • draw a line connecting this point and the point of averages • This is the SD line • Draw an ellipse with the SD line as its long axis • Ellipse should go just beyond the 2 SD marks in all directions • The value of r determines how oblong the ellipse is
Sketching Scatter Plots • A study of the IQs of husbands and wives obtained the following results • Husbands: average IQ = 100, SD = 15 • Wives: average IQ = 100, SD = 15 • r = 0.6 • Sketch the scatter plot
Changes of Scale • The correlation coefficient is not affected by changes of scale • Moving: adding the same number to all of the values of one variable • Stretching: multiplying the same positive number to all the values of one variable • Would r change if we multiplied by a negative number? • The correlation coefficient is also unaffected by interchanging the two data sets
Changes of Scale • Compute r for each of the following data sets r = -0.15
Summary • The relationship between two variables, X and Y, can be graphed in a scatter plot • When the scatter plot is tightly clustered around a line, there is a strong linear association between X and Y • A scatter plot can be characterized by its five-statistic summary • Average and SD of the X values • Average and SD of the Y values • Correlation coefficient
Summary • When the correlation coefficient gets closer to 1 or –1, the points cluster more tightly around a line • Positive association has a positive r-value • Negative association has a negative r-value • Calculating the correlation coefficient • Take the average of the product • Subtract the product of the averages • Divide the difference by the product of the SD’s
Summary • The correlation coefficient is not affected by changes of scale or transposing the variables • Correlation does not measure causation!