CHAPTER 7 Linear Correlation & Regression Methods

CHAPTER 7Linear Correlation & Regression Methods 7.1 - Motivation 7.2 - Correlation / Simple Linear Regression 7.3 - Extensions of Simple Linear Regression

Parameter Estimation via SAMPLE DATA … Testing for association between two POPULATION variables X and Y… • Categorical variables • Numerical variables Chi-squared Test ??????? PARAMETERS • Means: • Variances: • Covariance: Examples: X = Disease status (D+, D–) Y = Exposure status (E+, E–) X = # children in household (0, 1-2, 3-4, 5+) Y = Income level (Low, Middle, High)

Parameter Estimation via SAMPLE DATA … • Numerical variables ??????? PARAMETERS PARAMETERS STATISTICS • Means: • Means: • Variances: • Variances: • Covariance: • Covariance: (can be +, –, or 0)

Parameter Estimation via SAMPLE DATA … • Numerical variables ??????? PARAMETERS PARAMETERS STATISTICS Y • Means: • Means: • Variances: • Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) • Covariance: • Covariance: (can be +, –, or 0) X

Parameter Estimation via SAMPLE DATA … • Numerical variables ??????? PARAMETERS PARAMETERS STATISTICS Y • Means: • Means: • Variances: • Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) • Covariance: • Covariance: (can be +, –, or 0) Does this suggest a linear trend between X and Y? If so, how do we measure it? X

LINEAR Testing for association between two population variables X and Y… ^ • Numerical variables ??????? PARAMETERS • Means: • Variances: • Covariance: • Linear Correlation Coefficient: Always between –1 and +1

Parameter Estimation via SAMPLE DATA … • Numerical variables ??????? PARAMETERS PARAMETERS STATISTICS Y • Means: • Means: • Variances: • Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) • Covariance: • Covariance: (can be +, –, or 0) • Linear Correlation Coefficient: Always between –1 and +1 X

Parameter Estimation via SAMPLE DATA … Example in R (reformatted for brevity): • Numerical variables > pop = seq(0, 20, 0.1) > x = sort(sample(pop, 10)) 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 > y = sample(pop, 10) 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ??????? PARAMETERS PARAMETERS STATISTICS Y > c(mean(x), mean(y)) 7.05 12.08 > var(x) 29.48944 > var(y) 43.76178 • Means: • Means: • Variances: • Variances: JAMA. 2003;290:1486-1493 plot(x, y, pch = 19) Scatterplot n = 10 (n data points) • Covariance: • Covariance: > cov(x, y) -25.86667 (can be +, –, or 0) • Linear Correlation Coefficient: Always between –1 and +1 > cor(x, y) -0.7200451 X

Parameter Estimation via SAMPLE DATA … • Numerical variables • Linear Correlation Coefficient: Always between –1 and +1 Y r measures the strength of linear association JAMA. 2003;290:1486-1493 Scatterplot (n data points) X

Parameter Estimation via SAMPLE DATA … • Numerical variables • Linear Correlation Coefficient: Always between –1 and +1 Y r measures the strength of linear association JAMA. 2003;290:1486-1493 Scatterplot (n data points) r –1 0 +1 X negative linear correlation positive linear correlation

Parameter Estimation via SAMPLE DATA … • Numerical variables • Linear Correlation Coefficient: Always between –1 and +1 Y r measures the strength of linear association r measures the strength of linear association JAMA. 2003;290:1486-1493 Scatterplot (n data points) r –1 0 +1 X negative linear correlation positive linear correlation

Parameter Estimation via SAMPLE DATA … • Numerical variables • Linear Correlation Coefficient: Always between –1 and +1 Y r measures the strength of linear association > cor(x, y) -0.7200451 JAMA. 2003;290:1486-1493 Scatterplot (n data points) r –1 0 +1 X negative linear correlation positive linear correlation

Testing for linear association between two numerical population variables X and Y… Now that we have r, we can conduct HYPOTHESIS TESTING on  • Linear Correlation Coefficient Test Statistic for p-value • Linear Correlation Coefficient p-value = .0189 < .05 2 * pt(-2.935, 8)

Parameter Estimation via SAMPLE DATA … If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… • Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) -0.7200451 in what sense??? Residuals Find estimates and for the “best” line

Parameter Estimation via SAMPLE DATA … SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… • Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) -0.7200451 “Least Squares Regression Line” in what sense??? i.e., that minimizes Residuals Find estimates and for the “best” line

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… • Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) -0.7200451 i.e., that minimizes Residuals Find estimates and for the “best” line Check 

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor observed response > cor(x, y) -0.7200451 i.e., that minimizes Residuals Find estimates and for the “best” line

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor observed response fitted response > cor(x, y) -0.7200451 i.e., that minimizes Residuals Find estimates and for the “best” line

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor observed response fitted response residuals > cor(x, y) -0.7200451 i.e., that minimizes Residuals Find estimates and for the “best” line

Testing for linear association between two numerical population variables X and Y… Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 • Linear Regression Coefficients “Response = Model + Error” Test Statistic for p-value? • Linear Regression Coefficients

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor observed response fitted response residuals > cor(x, y) -0.7200451 i.e., that minimizes Residuals Find estimates and for the “best” line

Testing for linear association between two numerical population variables X and Y… Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 • Linear Regression Coefficients “Response = Model + Error” Test Statistic for p-value • Linear Regression Coefficients p-value = .0189 Same t-score as H0:  = 0!

> plot(x, y, pch = 19) > lsreg = lm(y ~ x) # or lsfit(x,y) > abline(lsreg) > summary(lsreg) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -8.6607 -3.2154 0.8954 3.4649 5.7742 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.2639 2.6097 6.999 0.000113 *** x -0.8772 0.2989 -2.935 0.018857 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.869 on 8 degrees of freedom Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886 BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM??? Because this second method generalizes…

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” “main effects” For now, assume the “additive model,” i.e., main effects only.

 Y True response yi Residual Fitted response X2 0 (x1i , x2i) Predictors X1 Multilinear Regression Least Squares calculation of regression coefficients is computer-intensive. Formulas require Linear Algebra (matrices)! Once calculated, how do we then test the null hypothesis? ANOVA

ANOVA Table

ANOVA Table In our example, k = 2 regression coefficients and n = 10 data points. ? ? ?

Parameter Estimation via SAMPLE DATA … STATISTICS • Means: • Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) SSTotis a measure of the total amount of variability in the observed responses (i.e., before any model-fitting).

Parameter Estimation via SAMPLE DATA … STATISTICS • Means: • Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) SSRegis a measure of the total amount of variability in the fitted responses (i.e., after model-fitting.)

Parameter Estimation via SAMPLE DATA … STATISTICS • Means: • Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) SSErris a measure of the total amount of variability in the resulting residuals (i.e., after model-fitting).

ANOVA Table In our example, k = 2 regression coefficients and n = 10 data points. ? ? ?

ANOVA Table In our example, k = 2 regression coefficients and n = 10 data points.

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor observed response fitted response residuals > cor(x, y) -0.7200451 = 204.2 = 189.656 = 9 (43.76178) Residuals = 393.856

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES Tot Err predictor observed response Reg fitted response residuals > cor(x, y) -0.7200451 = 204.2 = 189.656 = 393.856 Residuals minimum SSTot = SSReg + SSErr

ANOVA Table In our example, k = 2 regression coefficients and n = 10 data points.

ANOVA Table In our example, k = 2 regression coefficients and n = 10 data points. Same as before!

> summary(aov(lsreg)) Df Sum Sq Mean Sq F value Pr(>F) x 1 204.20 204.201 8.6135 0.01886 * Residuals 8 189.66 23.707

Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining. Moreover,

> cor(x, y) -0.7200451 Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining. Moreover,

> plot(x, y, pch = 19) > lsreg = lm(y ~ x) > abline(lsreg) > summary(lsreg) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -8.6607 -3.2154 0.8954 3.4649 5.7742 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.2639 2.6097 6.999 0.000113 *** x -0.8772 0.2989 -2.935 0.018857 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.869 on 8 degrees of freedom Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886 Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Summary of Linear Correlation and Simple Linear Regression Means Variances Covariance X Y Given: • Linear Correlation Coefficient Y JAMA. 2003;290:1486-1493 –1 r +1 measures the strength of linear association • Least Squares Regression Line minimizesSSErr = X = SSTot – SSReg (ANOVA)

Summary of Linear Correlation and Simple Linear Regression Means Variances Covariance X Y Given: • Linear Correlation Coefficient Y JAMA. 2003;290:1486-1493 –1 r +1 measures the strength of linear association • Least Squares Regression Line minimizesSSErr = X = SSTot – SSReg (ANOVA) All point estimates can be upgraded to CIs for hypothesis testing, etc. proportion of total variability modeled by the regression line’s variability. • Coefficient of Determination

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” “main effects” R code example: lsreg= lm(y ~ x1+x2+x3)

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” “main effects” quadratic terms, etc. (“polynomial regression”) R code example: lsreg= lm(y ~ x1+x2+x3) R code example: lsreg= lm(y ~ x+x^2+x^3)

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” “main effects” quadratic terms, etc. (“polynomial regression”) “interactions” “interactions” R code example: lsreg= lm(y ~ x+x^2+x^3) R code example: lsreg= lm(y ~ x1+x2+x1:x2) R code example: lsreg= lm(y ~ x1*x2)

CHAPTER 7 Linear Correlation & Regression Methods