Correlation and Simple Linear Regression

Correlation and Simple Linear Regression

Pearson’s Product Moment Correlation (sample correlation r estimates population correlation r) • Measures the strength of linear association between two numeric variables X and Y. • The correlation (r) is a unit-less quantity. • -1 < r < 1 • If r < 0 then there is a negative association between X and Y, i.e. as X increases Y generally decreases • If r > 0 then there is a positive association between X and Y, i.e. as X increases Y generally increases

Pearson’s Product Moment Correlation (sample correlation r estimates population correlation r) • The close r is to 0 the weaker the linear association between X and Y. • Sample Correlation (r) Adjective scale for the sample correlation coefficient ( r ).

Pearson’s Product Moment Correlation ( r ) Some examples of various positive and negative correlations.

Pearson’s Product Moment Correlation (sample correlation r estimates population correlation r) • Never calculate a correlation coefficient without plotting the data! • Correlation is NOT causation! • Beware of influential points. r = .11 r = .72

Pearson’s Product Moment Correlation (sample correlation r estimates population correlation r) • Beware of outliers r = .66 r = .86

Examples where no form of correlation coefficient is appropriate

Pearson’s Product Moment Correlation ( r ) • The formula (and equivalencies)

Pearson’s Product Moment Correlation ( r ) zx , zy > 0 zx < 0 & zy > 0 zx > 0 & zy < 0 zx , zy < 0

Pearson’s Product Moment Correlation ( r ) zx < 0 & zy > 0 zx , zy > 0 zx , zy < 0 zx > 0 & zy < 0

Nonlinear Relationships Pulmonary Artery Pressure vs. Relative Flow Velocity Change Not all relationships are linear. In cases where there is clear evidence of a nonlinear relationship DO NOT use Pearson’s Product Moment Correlation ( r ) to summarize the strength of the relationship between Y and X. Clearly artery pressure (Y) and change in flow velocity (X) are nonlinearly related.

Testing for Significant Correlation Testing Population Correlation (r) Most software packages will conduct this test for any correlation you are interested in. When looking at multiple correlations be sure consider “Bonferroni correcting”.

Other measures of correlation/association • Spearman’s Rank Correlation ( rs ) – can be used with ordinal data where the levels are in some sense equidistant. Can also be used when relationships are nonlinear but monotonic. It involves ranking the x’s and y’s and finding Pearson’s correlation based on the ranks. • Kendall’s Tau (a, b, & c) – is used when X and Y are both ordinal and they need not be “equidistant”. a – does notadjust for ties b – adjusts for ties, X and Y must have same levels c – adjust for ties, X and Y don’t have same levels Data for Kendall’s Tau-a, b, or c could summarized using a contingency table.

Other measures of correlation/association • Biserial Correlation – correlation when X is continuous and Y is naturally dichotomous (e.g. male/female, smoker/non-smoker) or a created (e.g. Age < 18/Age > 18). • Polyserial Correlation – correlation when X is continuous and Y is ordinal. • JMP/SPSS compute Spearman’s and Kendall’s tau however the others require specialized software.

Examples: NC Births Data Age of Father vs. Age of Mother The Pearson Product Moment Correlation (r = .7543, p < .0001) suggests a fairly strong correlation between father’s age and mother’s age. The line represents the mean age of the father given the mother’s age.

Examples: NC Births Data Birth weight (g) and Gestational Age (wks) Is the relationship between the birth weight and gestational age linear? Hard to say, consider adding a smoothing spline to the plot. The smoothing spline helps us see the relationship between the mean birth weight and gestational age. The notation we use for the mean of Y given X is E(Y|X). Here we have… E(Birth Weight|Gest. Age) Smooth Estimate of E(Birth Weight|Gest. Age)

Examples: NC Births Data Birth weight (g) and Gestational Age (wks) The Pearson Product Moment Correlation (r = .5515, p < .0001) suggests a moderate correlation between gestational age and birth weight. However the smooth curve estimate of the mean birth weight suggests the relationship is not linear, thus the Pearson correlation may not be appropriate. We could consider using Spearman’s rank correlation instead (rs = .3056, p < .0001).

Examples: NC Births Data The biserial correlation between smoking status and birth weight is (rbs = -.1175, p = .0002) suggesting a weak negative association between smoking and birth weight. Birth Weight vs. Smoking Status The biserial correlation is found by computing Pearson’s correlation between birth weight and smoking status, where smoking status is coded as 0 = no, 1 = yes and treated as a numeric quantity.

Medicare Survey Data – General Health at Baseline & Follow-up Association between baseline & follow-up general health (revisited) Kendall’s Tau can be used measure the degree of association between two ordinal variables, here (t =.5933, p < .0001)

Simple Linear Regression • Regression refers to the estimation of the mean of a response (Y) given information about single predictor X in case of simple regression or multiple X’s in the case of multiple regression. • We denote this mean as E(Y|X) in the case simple regression and E(Y|X1,X2,…,Xp) in the case multiple regression with p potential predictors.

Simple Linear Regression • In the case of simple linear regression we assume that the mean can modeled using function that is a linear function of unknown parameters, e.g. These are all example of simple linear regression models. When many people think of simple linear regression they think of the first mean function because it is the equation of a line. This model will be our primary focus.

E(Y|X) + SD(Y|X) E(Y|X) E(Y|X) - SD(Y|X) Simple Linear Regression Model observed y = trend + scatter The regression model… The data model The fitted model Residual or random error Estimated mean function

Assumptions 1. The assumed functional form for the mean function E(Y|X) is correct, e.g. we might assume a line for the mean function, E(Y|X) = bo+b1X. • The random scatter, i.e. ei’s “random errors” We assume that ei ~ N(0,s2) and independent. This means for a given X, Y is normal with constant variation for all X. • 3. Error standard deviation ( ) • Std. dev of random process producing the “errors” ei • Governs amount of scatter about the mean E(Y|X) • big   lots of scatter,  no scatter

Assumptions (cont’d) Note: Normality is required for inference, i.e. t-tests, F-tests, and CI’s for model parameters and predictions.

Gestational Age and Weight In a study of premature infants researchers looked at their gestational age weeks and weight and the following data was gathered:

E(Weight|Age) = bo + b1Age = -1404.36 + 80.06Age The regression equation is estimated from the data by minimizing the squared vertical distances from the data points to the fitted line. These vertical distances are called residuals. Some of the residuals have been added to the plot. Simple Linear Regression Example 1: Gestational Age (weeks) and Weight of Premature Babies (g) r = .80 The fitted values are the points that lie on the line or each of the xi values. They represent the estimated mean for that value of x.

?? ?? ?? ?? Which line? ?? Fitting a line by least squares

(a) The data (b) Which line? Fitting a line by least squares ?? • Choose line with smallest sum of squared prediction errors • i.e. smallest Residual Sum of Squares, RSS

i th data point i.e. with smallest (xi , yi) sum of squared prediction errors yi Choose line to Minimize S(yi-)2 Prediction error x x . . x . . . x 1 2 i n Fitting a line by least squares Least-squares line The prediction errors (residuals) Place on any line i.e. with the smallest sum of squared lengths of the “error” arrows

Method of Least Squares To estimate the regression line we choose This requires calculus but the solutions are easy to express in terms of standard summary statistics Next we look at the estimated coefficients and their interpretation.

= -value at x = 0 b1 w units b0 = Change in for every w units unit increase in x x 0 The Regression Line i.e. y = mx + b ^ b0 = Intercept Interpretable only if x = 0 is a value of particular interest. ^ ^ ^ b1 = Slope Always interpretable !

Interpretation of Coefficients and the Estimated Regression Equation What are the units on the intercept and slope parameters? grams grams/week How do we interpret the estimate values? bo = y-intercept, the mean of Y when X = 0, usually of no interest unless 0 is reasonable value for X. Here X = 0 is meaningless. b1 = slope, the change in the mean of Y when X increases by 1. Here we estimate the mean weight increases 86 grams/week.

Interpretation of Coefficients and the Estimated Regression Equation Estimating the mean weight as a function of the gestational age. Use the equation to estimate the mean weight of infants born with a gestational age of 30 weeks. Use the equation to estimate the mean weight of infants born with a gestational age of 42 weeks. This is beyond the range of the data as none of infants in these data were full term or longer.

Sum Source of Sqs df Mean SS F-statistic p-val. MSReg F = / P( F F ) Regression k MSReg  MSE 0 0 Residual n-k-1 MSE Total n-1 F-test in Regression ANOVA Summary Tests H0: The regression is NOT useful i.e., H0: b1 = 0 (and b2 = 0 and …. bk= 0 if we are performing multiple regression) • Almost always significant (P-value almost always small) • Very rare that an investigator’s intuition is so bad that none of her or his explanatory variables have any predictive value Note: F0 = MSReg /MSE

F-test in Regression ANOVA Summary Fo= 67.49, p-value < .0001 Thus we conclude that the regression is useful and that gestational age helps explain the variation in the observed birth weights.

Summarizing the Fit R2 = proportion of variation explained by the regression of Y on X. Here this proportion is .6398 or 63.98%. This will be discussed in next few slides. Estimate of residual or error variance (s) This is also called Root Mean Square Error (RMSE)

Shows the variation in the y's y Shows the variation in the ’s x Towards “Percent Variation Explained” y Actual y-observations x Fitted or predicted values

Percent of Variation Explained x y x In a situation where we had a perfectly fitting model, we would get this much variation in the y’s transmitted from the variation in the x’s

Our data has slightly more variation in the y’s than that. Percent of Variation Explained x y x

We see some additional variation in the y-values here. The excess (residual variation) is not explained by the model. Variation in the ’s: This amount of variation can be “explained” as transmitted from the variation in the x’s Percent of Variation Explained x y x

R-squared: Percent variation Explained (R2 is also called the “Coefficient of determination”) • When expressed as a percentage, R2 is “percent variation explained ” • It is the percentage of variation in the y-values that the model can explain from the variation in the x-values.

Summarizing the Fit R2 = 63.98% of the variation in the birth weights can be explained by the regression on the gestational age.

Inference for Model Parameters Testing Parameters (bj) Confidence Interval for bj These both apply for multiple regression as well.

Inference for Model Parameters Testing Parameters (bj) We have strong evidence that the slope is not 0 and hence conclude that gestational age is a statistically significant predictor.

Inference for Model Parameters Confidence Interval for bj We estimate that infants in this population gained between 64.85 g and 107.27 g per week of gestation with 95% confidence.

Estimating E(Y|X) and Predicting for Response Values for Individuals • We can construct CI’s for the mean of Y for a given value of X or for an individual with a given value of X. For the latter case we refer to the interval as a prediction interval (PI). • Both intervals we use the same point estimate from the regression equation as the center of the confidence interval.

Estimating E(Y|X) and Predicting for Response Values for Individuals Confidence Interval for the mean birth weight of infants in this population with a gestational age of 34 weeks. Prediction Interval for the birth weight of value of an infant with a gestational age of 30 weeks.

Estimating E(Y|X) and Predicting for Response Values for Individuals Gestational Age = 30 weeksWe estimate the mean birth weight of infants born with a gestational age of 30 weeks is between 1113.78 g and 1241.19 g. (CI) We estimate that 95% of all infants born with a gestational age of 30 weeks will have a birth weight between 795.89 g and 1559.08 g. (PI) Gestational Age = 34 weeksWe estimate that the mean birth weight of infants born with a gestational age of 34 weeks is between 1435.76 g and 1607.67 g. (CI) We estimate that 95% of all infants born with a gestational age of 34 weeks will have a birth weight between 1135.79 g and 1907.66 g. (PI)

Checking Assumptions 1. The assumed functional form for the mean function E(Y|X) is correct, e.g. we might assume a line for the mean function, E(Y|X) = bo+b1X. Plot of residuals vs. fitted values Lack of fit tests (when available) 2.The errors are independent, normally distributed, with constant variance, i.e. ei ~ N(0,s2). Plot of residuals vs. fitted values Normal quantile plot of the residuals

Correlation and Simple Linear Regression