1 / 55

Lecture 13: Multiple linear regression

Lecture 13: Multiple linear regression. When and why we use it The general multiple regression model Hypothesis testing in multiple regression The problem of multicollinearity Multiple regression procedures Polynomial regression Power analysis in multiple regression. Some GLM procedures.

saad
Télécharger la présentation

Lecture 13: Multiple linear regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 13: Multiple linear regression • When and why we use it • The general multiple regression model • Hypothesis testing in multiple regression • The problem of multicollinearity • Multiple regression procedures • Polynomial regression • Power analysis in multiple regression Bio 4118 Applied Biostatistics

  2. Some GLM procedures *either categorical or treated as a categorical variable Bio 4118 Applied Biostatistics

  3. Log Production Log [P] Log Production Log [P] Log [Zoo] When do we use multiple regression? • to compare the relationship between a continuous dependent (Y) variable and several continuous independent (X1, X2,…) variables • e.g. relationship between lake primary production, phosphorous concentration and zooplankton abundance Bio 4118 Applied Biostatistics

  4. Y ^ Y, X1, X2 Y, X1, X2 eY X , X . 1 2 X1 X2 X2 X1 The multiple regression model: general form • The general model is: which defines a k-dimensional plane, where a = intercept, bj = partial regression coefficient of Y on Xj, Xijis value of ith observation of dependent variable Xj, and ei is the residual of the ith observation. Bio 4118 Applied Biostatistics

  5. 8 X2 = -3 4 X2 = -1 X2 = 1 Y 0 -4 X2 = 3 Simple (pooled) regression -8 -4 -2 0 2 4 Partial regression X1 What is the partial regression coefficient anyway? • bj is the rate of change in Y per change in Xj with all other variables held constant; this is not the slope of the regression of Y on Xj, pooled over all other variables! Bio 4118 Applied Biostatistics

  6. 4 Y 2 bj = 2 0 1 2 4 Y 2 bj = .02 0 200 100 Xj The effect of scale • Two independent variables on different scales will have different slopes, even if the proportional change in Y is the same. • So, if we want to measure the relative strength of the influence of each variable on Y, we must eliminate the effect of different scales. Bio 4118 Applied Biostatistics

  7. The multiple regression model: standardized form • Since bjdepends on the size of Xj, to examine the relative effect of each independent variable we must standardize the regression coefficients by first transforming all variables and fitting the regression model based on the transformed variables. • The standardized coefficients bj* estimate the relative strength of the influence of variable Xj on Y. Bio 4118 Applied Biostatistics

  8. Regression coefficients: summary • Partial regression coefficient: equals the slope of the regression of Y on Xj when all other independent variables are held constant. • Standardized partial regression coefficient:the rate of change of Y in standard deviation units per one standard deviation of Xjwith all other independent variables held constant. Bio 4118 Applied Biostatistics

  9. Assumptions • independence of residuals • homoscedasticity of residuals • linearity (Y on all X) • no error on independent variables • normality of residuals Bio 4118 Applied Biostatistics

  10. Y = + Hypothesis testing in simple linear regression: partitioning the total sums of squares Total SS Unexplained (Error) SS Model (Explained) SS Bio 4118 Applied Biostatistics

  11. Y X1 Total SS X2 Model SS Residual SS Hypothesis testing in multiple regression I: partitioning the total sums of squares • Partition total sums of squares into model and residual SS: Bio 4118 Applied Biostatistics

  12. Hypothesis testing I: partitioning the total sums of squares • So, MSmodel = s2Y and MSerror= 0 if observed = expected for all i. • Calculate F = MSmodel/MSerrorand compare with F distribution with 1 and N - 2 df. • H0: F = 1 Bio 4118 Applied Biostatistics

  13. X2= 1 X2= 2 Y Y H01: b1 = 0, rejected X1, X2 fixed X1= 2 Y Y X1= 3 H02: b2= 0, accepted X2, X1 fixed Hypothesis testing II: testing individual partial regression coefficients • Test each hypothesis by a t-test: • Note: these are 2-tailed hypotheses! Bio 4118 Applied Biostatistics

  14. X1 colinear X2 X2 independent Covariance Variance X3 Multicollinearity • Independent variables are correlated, and therefore, not independent: evaluate by looking at covariance or correlation matrix. Bio 4118 Applied Biostatistics

  15. Multicollinearity: problems • If two independent variables X1 and X2 are uncorrelated, then the model sums of squares for a linear model with both included equals the sum of the SSmodelfor each considered separately. • But if they are correlated, the former will be less than the latter. • So, the real question is: given a model with X1 included, how much does SSmodel increase when X2 is also included (or vice versa)? Bio 4118 Applied Biostatistics

  16. Multicollinearity: consequences • inflated standard errors for regression coefficients • sensitivity of parameter estimates to small changes in data • But, estimates of partial regression coefficients remain unbiased. • One or more independent variables may not appear in the final regression model not because they do not covary with Y, but because they covary with another X. Bio 4118 Applied Biostatistics

  17. Detecting multicollinearity • high R2 but few or no significant t-tests for individual independent variables • high pairwise correlations between X’s • high partial correlations among regressors (independent variables are a linear combination of others) • Eigenvalues, condition index, tolerance and variance inflation factors Bio 4118 Applied Biostatistics

  18. E1 X1 E2 X2 l2 E2 X1 E1 l1 X2 Quantifying the effect of multicollinearity • Eigenvectors: a set of “lines” E1, E2,…, Ek in a k-dimensional space which are orthogonal to each other • Eigenvalue: the magnitude (length) l of the corresponding eigenvector Bio 4118 Applied Biostatistics

  19. Low correlationl1 = l2 X1 X2 High correlationl1 >> l2 X1 X2 Quantifying the effect of multicollinearity • Eigenvalues: if all k eigenvalues are approximately equal, multicollinearity is low. • Condition index: sqrt(ll /ls); near 1 indicates low multicollinearity. • Tolerance: 1 - proportion of variance in each independent variable accounted for by all other independent variables: near 1 indicates low multicollinearity. Bio 4118 Applied Biostatistics

  20. Remedial measures • Get more data to reduce correlations. • Drop some variables. • Use principal component or ridge regression, which yield biased estimates but with smaller standard errors. Bio 4118 Applied Biostatistics

  21. Multiple regression: the general idea Model A (X1 in) • Evaluate significance of a variable by fitting two models: one with the term in, the other with it removed. • Test for change in model fit (D MF) associated with removal of the term in question. • Unfortunately, D M F may depend on what other variables are in model if there is multicollinearity! D M F (e.g. D R2) Model B (X2 out) Retain X1 (D large) Delete X1 (D small) Bio 4118 Applied Biostatistics

  22. Fitting multiple regression models • Goal: find the “best” model, given the available data. • Problem 1: what is “best”? • highest R2? • lowest RMS? • highest R2 but contains only individually significant independent variables? • maximizes R2 with minimum number of independent variables? Bio 4118 Applied Biostatistics

  23. Selection of independent variables (cont’d) • Problem 2: even if “best” is defined, by what method do we find it? • Possibilities: • compute all possible models (2k -1) and choose the best one. • use some procedure for winnowing down the set of possible models. Bio 4118 Applied Biostatistics

  24. {X1} {X2,X3} {X2} {X3} {X1,X3} Strategy I: computing all possible models • Compute all possible models and choose the “best” one. • cons: • time-consuming • leaves definition of “best” to researcher • pros: • if the “best” model is defined, you will find it! {X1, X2, X3} {X1,X2} {X1,X2,X3} Bio 4118 Applied Biostatistics

  25. {X1, X2, X3} {X1,X2} {X1,X2,X3} {X2} {X2} {X1, X2} {X1,X2,X3} Strategy II: forward selection r2 > r1 > r3 • Start with variable that has highest (significant) R2, i.e. highest partial correlation coefficient r. • Add others one at a time until no further significant increase in R2 with bjs recomputed at each step. • problem: if Xj is included, it stays in even if it contributes little to the SSmodel once other variables are included. R2= R22 Final model R2= R212 R212 >R22 R212 = R22 R1232= R212 R1232> R212 Bio 4118 Applied Biostatistics

  26. {X1, X2, X3, X4} {X2,X4} {X2,X1} {X2,X3} {X2} {X2,X3} {X2,X1} Forward selection: order of entry p to enter = .05 r2 > r1 > r3 >r4 • Begin with the variable with the highest partial correlation coefficient. • Next entry is that variable which gives largest increase in overall R2 by F-test of significance of increase, above some specified F-to-enter (below specified p to enter) value. p[F(X2)] = .001 p[F(X2, X4)] = .55 p[F(X2, X1)] = .002 p[F(X2, X3)] = .04 X4 eliminated ... Bio 4118 Applied Biostatistics

  27. {X1, X2, X3} {X1,X3} {X1,X3} {X3} {X3} {X1,X2,X3} Strategy III: backward selection R2= R1232 Final model r2 < r1 < r3 • Start with all variables. • Drop variables whose removal does not significantly reduce R2, one at a time, starting with the one with the lowest partial correlation coefficient. • But, once Xj is dropped, it stays out even if it explains a significant amount of the remaining variability once other variables are excluded. R2= R132 R132 = R1232 R132 < R1232 R32 = R132 R32 < R132 Bio 4118 Applied Biostatistics

  28. {X1, X2, X3, X4} {X2, X1, X3} {X2,X1} {X1,X3} {X2,X3} Backward selection: order of entry p to remove = .10 r2 > r1 > r3 >r4 • Begin with the variable with the smallest partial correlation coefficient. • Next removal is that variable which gives the smallest increase in overall R2 by F-test of significance of increase, below some specified F-to-remove (above specified p to remove)value. p[F(X2,X1, X3)] = .44 X4 removed X2, X3, X1still in p[F(X1,X3)] = .009 p[F(X2,X3)] = .001 p[F(X2,X1)] = .25 ... X3 removed X1 , X2 still in Bio 4118 Applied Biostatistics

  29. {X1, X2, X3, X4} {X1,X4} {X2,X1} {X2,X4} {X2,X3} {X2} {X1,X2, X4} {X1,X2, X3} Strategy IV: stepwise selection p to enter = .10 p to remove = .05 r2 > r1 > r4 >r3 • Once a variable is included (removed), set of remaining variables is scanned for other variables that should now be deleted (included), including those added (removed) at earlier stages. • To avoid infinite loops, we usually set p to enter > p to remove. p[F(X2)] = .001 p[F(X2, X4)] = .03 p[F(X2, X1)] = .002 p[F(X2, X3)] = .09 p[F(X1, X2, X3)] = .19 p[F(X1, X2, X4)] = .02 Bio 4118 Applied Biostatistics

  30. Example • log of herptile species richness (logherp) as a function of log wetland area (logarea), percentage of land within 1 km covered in forest (cpfor2) and density of hard-surface roads within 1 km (thtdens) Bio 4118 Applied Biostatistics

  31. Example (all variables) DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.740 SQUARED MULTIPLE R: 0.547 ADJUSTED SQUARED MULTIPLE R: .490 STANDARD ERROR OF ESTIMATE: 0.162 VARIABLE COEFF. SE STD COEF. TOL. T P CONSTANT 0.285 0.191 0.000 . 1.488 0.150 LOGAREA 0.228 0.058 0.551 0.978 3.964 0.001 CPFOR2 0.001 0.001 0.123 0.744 0.774 0.447 THTDEN -0.036 0.016 -0.365 0.732 -2.276 0.032 Bio 4118 Applied Biostatistics

  32. Example (cont’d) ANALYSIS OF VARIANCE SOURCE SS DF MS F-RATIO P REGRESSION 0.760 3 0.253 9.662 0.000 RESIDUAL 0.629 24 0.026 Bio 4118 Applied Biostatistics

  33. Example: forward stepwise DEPENDENT VARIABLE LOGHERP MINIMUM TOLERANCE FOR ENTRY INTO MODEL = .010000 FORWARD STEPWISE WITH ALPHA-TO-ENTER= .10 AND ALPHA-TO-REMOVE= .05 STEP # 0 R= .000 RSQUARE= .000 VARIABLE COEFF. SE. STD COEF. TOL. F 'P' IN --- 1 CONSTANT OUT PART. CORR --- 2 LOGAREA 0.596 . . .1E+01 14.321 0.001 3 CPFOR2 0.305 . . .1E+01 2.662 0.115 4 THTDEN -0.496 . . .1E+01 8.502 0.007 Bio 4118 Applied Biostatistics

  34. Forward stepwise (cont’d) STEP # 1 R= .596 RSQUARE= .355 TERM ENTERED: LOGAREA VARIABLE COEFF. SE. STD COEF. TOL. F 'P' IN --- 1 CONSTANT 2 LOGAREA 0.247 0.065 0.596 .1E+01 14.321 0.001 OUT PART. CORR --- 3 CPFOR2 0.382 . . 0.99 4.273 0.049 4 THTDEN -0.529 . . 0.98 9.725 0.005 Bio 4118 Applied Biostatistics

  35. Forward stepwise (cont’d) STEP # 2 R= .732 RSQUARE= .536 TERM ENTERED: THTDEN VARIABLE COEFF. SE. STD COEF .TOL. F 'P' IN --- 1 CONSTANT 2 LOGAREA 0.225 0.057 0.542 0.98 15.581 0.001 4 THTDEN -0.042 0.013 -0.428 0.98 9.725 0.005 OUT PART. CORR --- 3 CPFOR2 0.156 . . 0.74380 0.599 0.447 Bio 4118 Applied Biostatistics

  36. Forward stepwise: final model FORWARD STEPWISE: P TO INCLUDE = .15 DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.732 SQUARED MULTIPLE R: 0.536 ADJUSTED SQUARED MULTIPLE R: .490 STANDARD ERROR OF ESTIMATE: 0.161 VARIABLE COEFF. SE STD COEF. TOL. T P CONSTANT 0.376 0.149 0.000 . 2.521 0.018 LOGAREA 0.225 0.057 0.542 0.984 3.947 0.001 THTDEN -0.042 0.013 -0.428 0.984 -3.118 0.005 Bio 4118 Applied Biostatistics

  37. Example: backward stepwise (final model) BACKWARD STEPWISE: P TO REMOVE = .15 DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.732 SQUARED MULTIPLE R: 0.536 ADJUSTED SQUARED MULTIPLE R: .499 STANDARD ERROR OF ESTIMATE: 0.161 VARIABLE COEFF. SE STD COEF. TOL. T P CONSTANT 0.376 0.149 0.000 . 2.521 0.018 LOGAREA 0.225 0.057 0.542 0.984 3.947 0.001 THTDEN -0.042 0.013 -0.428 0.984 -3.118 0.005 Bio 4118 Applied Biostatistics

  38. Example: subset model DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.670 SQUARED MULTIPLE R: 0.449 ADJUSTED SQUARED MULTIPLE R: .405 STANDARD ERROR OF ESTIMATE: 0.175 VARIABLE COEFF. SE STD COEF. TOL. T P CONSTANT 0.027 0.167 0.000 . 0.162 0.872 LOGAREA 0.248 0.062 0.597 1.000 4.022 0.000 CPFOR2 0.003 0.001 0.307 1.000 2.067 0.049 Bio 4118 Applied Biostatistics

  39. What if relationship between Y and one or more X’s is nonlinear? • Option 1: transform data. • Option 2: use non-linear regression. • Option 3: use polynomial regression. Bio 4118 Applied Biostatistics

  40. 1000 Black fly biomass (mgDM/m²) 100 Linear model 2nd order polynomial model 10 10 30 50 70 90 110 Current velocity (cm/s) The polynomial regression model • In polynomial regression, the regression model includes terms of increasingly higher powers of the dependent variable. Bio 4118 Applied Biostatistics

  41. 1000 Black fly biomass (mgDM/m²) 100 Linear model 2nd order polynomial model 10 10 30 50 70 90 110 Current velocity (cm/s) The polynomial regression model: procedure • Fit simple linear model. • Fit model with quadratic, test for increase in SSmodel . • Continue with higher order (cubic, quartic, etc.) until there is no further significant increase in SSmodel . • Include terms of order up to the power of (number of points of inflexion plus 1). Bio 4118 Applied Biostatistics

  42. The biological significance of the higher order terms in a polynomial regression (if any) is generally not known. By definition, polynomial terms are strongly correlated; hence, standard errors will be large (precision is low), and increase with the order of the term. Extrapolation of polynomial models is always nonsense. Polynomial regression: caveats Y = a+ b1X1- b2X12 Y X1 Bio 4118 Applied Biostatistics

  43. Power analysis in GLM (including MR) • In any GLM, hypotheses are tested by means of an F-test. • Remember: the appropriate SSerror and dferrordepends on the type of analysis and the hypothesis under investigation. • Knowing F,we can compute R2,the proportion of the total variance in Y explained by the factor (source) under consideration. Bio 4118 Applied Biostatistics

  44. Partial and total R2 Proportion of variance accounted for by both A and B (R2Y•A,B) • The totalR2 (R2Y•B) is the proportion of variance in Y accounted for (explained by) a set of independent variables B. • The partialR2 (R2Y•A,B- R2Y•A ) is the proportion of variance in Y accounted for by B when the variance accounted for by another set A is removed. Proportion of variance accounted for by B independent of A (R2Y•A,B- R2Y•A ) (partial R2) Proportion of variance accounted for by A only (R2Y•A)(total R2) Bio 4118 Applied Biostatistics

  45. Y A A B Partial and total R2 Proportion of variance independent of A (R2Y•A,B- R2Y•A ) (partial R2) Proportion of variance accounted for by B (R2Y•B)(total R2) • The totalR2 (R2Y•B) for set B equals the partialR2 (R2Y•A,B- R2Y•A ) with respect to set B if either (1) the total R2 for A (R2Y•A) is zero, or (2) if A and B are independent (in which case R2Y•A,B= R2Y•A + R2Y•B). Equal iff Bio 4118 Applied Biostatistics

  46. Log Production Log [P] Log [Zoo] Partial and total R2 in multiple regression • Suppose we have three independent variables X1 ,X2 andX3 . Bio 4118 Applied Biostatistics

  47. Defining effect size in multiple regression • The effect size, denoted f2 is given by the ratio of the factor (source) R2factor and the appropriate error R2error. • Note: both R2factor and R2error depend on the null hypothesis under investigation. Bio 4118 Applied Biostatistics

  48. Case 1: a set B of variables {X1, X2, …} is related to Y, and the totalR2 (R2Y•B) is determined. The error variance proportion is then 1- R2Y•B . H0: R2Y•B = 0 Example: effect of wetland area, surrounding forest cover, and surrounding road densities on herptile species richness in southeastern Ontario wetlands B ={LOGAREA, CPFOR2,THTDEN } Defining effect size in multiple regression: case 1 Bio 4118 Applied Biostatistics

  49. DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.740 SQUARED MULTIPLE R: 0.547 ADJUSTED SQUARED MULTIPLE R: .490 STANDARD ERROR OF ESTIMATE: 0.162 VARIABLE COEFF. SE STD COEF. TOL. T P CONSTANT 0.285 0.191 0.000 . 1.488 0.150 LOGAREA 0.228 0.058 0.551 0.978 3.964 0.001 CPFOR2 0.001 0.001 0.123 0.744 0.774 0.447 THTDEN -0.036 0.016 -0.365 0.732 -2.276 0.032 Bio 4118 Applied Biostatistics

  50. Case 2: the proportion of variance of Y due to B over and above that due to A is determined (R2Y•A,B- R2Y•A ). The error variance proportion is then 1- R2Y•A,B . H0: R2Y•A,B- R2Y•A = 0 Example: herptile richness in southeastern Ontario wetlands B ={THTDEN}, A = {LOGAREA, CPFOR2},AB = {LOGAREA, CPFOR2, THTDEN} Defining effect size in multiple regression: case 2 Bio 4118 Applied Biostatistics

More Related