1 / 49

Simple linear regression

Simple linear regression. What regression analysis does The simple regression model Hypothesis testing in regression Residual analysis Inverse prediction, replicated regression and weighted regression Regression caveats Power considerations in simple linear regression. D Y. Y.

margo
Télécharger la présentation

Simple linear regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Simple linear regression • What regression analysis does • The simple regression model • Hypothesis testing in regression • Residual analysis • Inverse prediction, replicated regression and weighted regression • Regression caveats • Power considerations in simple linear regression Bio 4118 Applied Biostatistics

  2. DY Y b = DY/DX DX X What regression does • Fits a straight line through a cloud of data. • Tests and quantifies the effect of an independent variable X on a dependent variable Y. • Intensity of the effect is given by the slope (b) of the regression. • The importance of the effect is given by the coefficient of determination (r2). Bio 4118 Applied Biostatistics

  3. The slope b is estimated as: The correlation r is: So, b = r if X and Y have the same variance… and if b = 0, r = 0 and vice versa. Regression and correlation coefficients Bio 4118 Applied Biostatistics

  4. Y ei X How it does it • by the method of least squares, which involves minimizing the sum of squared deviations between the observations and the regression line, i.e. minimizing the residuals • Squared deviation of an observation given by: Residual: Bio 4118 Applied Biostatistics

  5. Regression or correlation? • Correlation: degree of association between two variables X and Y; no causal relationship assumed! • Regression: to predict the value of the dependent variable if the independent variable were changed; causal relationship assumed! Bio 4118 Applied Biostatistics

  6. X2 Correlation X1 When do we use regression? • Don’t use it to determine the strength of association between to variables. • Do use it if you want to predict the value of Y given X. Y Regression X Bio 4118 Applied Biostatistics

  7. ei Yi DY a (intercept) Xi DX X Observed Expected The simple regression model • The regression model is: • So, all simple regression models are described by 2 parameters, the intercept (a) and slope (b). b = DY/DX (slope) Bio 4118 Applied Biostatistics

  8. Assumptions • Residuals are independent and normally distributed. • The variance of the residuals is equal for all X (homoscedasticity). • The relationship between Y and X is linear. • There is no measurement error on X (Model I regression). Bio 4118 Applied Biostatistics

  9. Measurement error • Assumption of no error on X can be examined beforehand, and is almost invariably violated. • Only of concern when measurement error is large relative to magnitude of X (say, > 10%). • If assumption is invalid, then Model II regression is required. Bio 4118 Applied Biostatistics

  10. Residual Estimate Residual analysis I: independence • Plot residuals against estimates, look for patterns. • Do ACF plot. Bio 4118 Applied Biostatistics

  11. Residual NEDs Normal Estimate Non-normal Residual Residual analysis II: Normality • Plot residuals against estimates; look for patterns. • Do normal probability plot. • Check with Lilliefors test. Bio 4118 Applied Biostatistics

  12. Residual Residual Estimate Group 1 Estimate Group 2 Group 3 Residual analysis III: Homoscedasticity • Plot residuals against estimates; look for patterns. • Check with Levene’s test by grouping Y’s into several classes. Bio 4118 Applied Biostatistics

  13. Y Estimate X Residual analysis IV: Linearity • Plot residuals against estimates; look for patterns. Residual Bio 4118 Applied Biostatistics

  14. Robustness of regression with respect to violation of assumptions Bio 4118 Applied Biostatistics

  15. What to do when assumptions aren’t met • Try transforming the data, but remember: (1) for some data, no transformation will work; (2) finding an appropriate transformation may not be easy. • Use non-linear regression. Bio 4118 Applied Biostatistics

  16. 8.0 7.2 Weight versus length in the beetle Scorpaenichthys marmoratus 6.0 1.0 4.8 Weight (kg; log scale) 0.1 Weight (kg) 3.6 2.4 0.01 1.2 0.001 0 200 400 600 10 100 1000 Length (mm; log scale) Length (mm) Transformations in regression Bio 4118 Applied Biostatistics

  17. 150 160 120 Chirps/min Chirps/min (log scale) 100 80 50 40 10 20 10 20 oC Transformations in regression Chirp rate as a function of temperature in males of the cricket Oecanthus fultoni. oC Bio 4118 Applied Biostatistics

  18. 7 7 6 6 5 5 4 4 Millivolts Millivolts Electrical resistance as a function of illumination in cephalopod eyes. 3 3 2 2 1 1 0 0 1 2 5 10 20 50 70 0 10 20 30 40 50 60 70 Relative brightness (times) in log scale Relative brightness (times) Transformations in regression Bio 4118 Applied Biostatistics

  19. Y + = Total SS Model (Explained) SS Unexplained (Error) SS Hypothesis testing I: partitioning the total sums of squares Bio 4118 Applied Biostatistics

  20. Hypothesis testing I: partitioning the total sums of squares • So, MSregression = s2Y and MSerror= 0 if observed = expected. • Calculate F = MSR/MSeand compare with F distribution with 1 and N - 2 df. • H0: F = 0 Bio 4118 Applied Biostatistics

  21. Y sb larger sb smaller X Standard error of the slope • The standard error sb and 100(1- a) CIs of the slope are: • So, for fixed N, can decrease sb by expanding range of X values sampled. Y Bio 4118 Applied Biostatistics

  22. Standard error of the intercept Y • The standard error sa of the intercept a is: • So, for fixed N, we can decrease sa by expanding range of X values sampled. a sa larger Y a sa smaller X Bio 4118 Applied Biostatistics

  23. Y Y a H01: a = 0 Y = 0 Y Y a a H02: b = 0 Observed Expected X X Hypothesis testing II: testing model parameters • Test each hypothesis by a t-test: • Note: these are 2-tailed hypotheses! Bio 4118 Applied Biostatistics

  24. Y Y H0 accepted Y Y H0 rejected X Hypothesis testing III: one-tailed hypotheses • Biological theory predicts that Y should increase with X. • So, H0: b  0 (one-tailed) • Calculate: • Reject if tb > 0 and p (one-tailed) < a. Bio 4118 Applied Biostatistics

  25. Confidence intervals in regression 100 (1-a) CI for estimated values 100 (1-a) CI for observations Bio 4118 Applied Biostatistics

  26. Y Estimates Y Observations X Confidence intervals in regression • CI for observations is larger than CI for estimated values. • CIs for both estimated values and observations increase with increasing distance between X value and mean of sample. Bio 4118 Applied Biostatistics

  27. Outlier? Y Outlier? X Outliers • points that appear to lie well off the fitted line • Issue 1: are “apparent” outliers really outliers? • Issue 2: do they significantly affect the statistical conclusions? Bio 4118 Applied Biostatistics

  28. Outlier analysis I: Studentized residuals • Plot Studentized residuals against estimated values. • “Large” residuals are those with value > 3.0 . • Such cases make large contributions to residual mean square of the regression. Bio 4118 Applied Biostatistics

  29. Small leverage Large leverage Outlier analysis II: Leverage Y • Leverage measures the potential influence of the case on the regression line. • Determined by X value only, so that points far from the mean have higher leverage. • “Large” = anything greater than 4/N. X Bio 4118 Applied Biostatistics

  30. Y X Smaller Cook’s Larger Cook’s Outlier analysis III: Cook’s distance • Cook’s distance: measures both leverage and contribution to residual mean square, i.e. actual influence of a point. • “Large” = anything greater than 1. Bio 4118 Applied Biostatistics

  31. Do they have a significant effect on regression results? To determine, delete them, rerun analyses and compare results. Are slope and intercept estimates significantly affected, i.e. still lie within 95% CI’s of original estimates? Y No significant effect Y Significant effect Outliers in Outliers out X Resolving outlier problems Bio 4118 Applied Biostatistics

  32. 1 N larger sb fixed N smaller Power (1 - b) sbsmaller N fixed sb larger 0 0 b The effects of outlier deletion • Reduces sample size (N), thereby reducing power. • Decreases MSe, so sb decreases, and power increases. • If N is small, the former effect will probably outweigh the latter unless outliers are very aberrant. Bio 4118 Applied Biostatistics

  33. Reading Concentration Error in “X” Concentration Reading Inverse prediction • Regression of Y on X, but want to predict X, given Y. • Regression of X on Y not possible due to error in Y. • e.g. calibration curves: want to predict concentration from reading, based on regression of reading on known solute concentrations. Bio 4118 Applied Biostatistics

  34. Y Upper 95% limit Lower 95% limit Predicted “X” Inverse prediction • Regress Y on X. • Generate predicted value of X given Y. • Calculate 95% confidence limits for “X” estimate based on 95% confidence limits for “Y” estimate from standard regression. Bio 4118 Applied Biostatistics

  35. Regression SS Within-group SS Error SS SS due to nonlinearity Group SS Regression with replication • When several Y’s are measured for each X. • In this case, we can test the linearity assumption directly by testing the MS due to deviations from linearity over MS within groups. Bio 4118 Applied Biostatistics

  36. Y X Weighted regression • Used when our confidence in the values of individual observations varies, e.g. different measurement error, precision. • In replicated designs, variance of Y for given X may vary among X’s, as may sample size (N). • So, weight by N or inverse of sample variance. Bio 4118 Applied Biostatistics

  37. Z Y X Regression caveats I: causation Y • A statistically significant regression of Y on X need not imply a causal relationship between the two. • A non-significant linear regression need not imply the lack of a causal relationship if the causal relationship is non-linear. X Accept linear H0 Y X Bio 4118 Applied Biostatistics

  38. Y X True regression (H0 accepted) Sample regression (H0 rejected) Regression caveats II: small samples • Significant regressions can be obtained by chance, i.e. even when no (linear) causal relationship exists. • This is especially true if sample sizes are small. • So when doing multiple simple regressions, control ae. Bio 4118 Applied Biostatistics

  39. Y X True regression (H0 rejected but R2 small) Regression caveats III: large samples • When N is large, only very small regression coefficients are required to reject H0 (power is large). • So, be careful of “overinterpreting” the observed relationship if R2 is small. Bio 4118 Applied Biostatistics

  40. Y Estimated relation True relation X Y Predicted value True value Observations X Regression caveats IV: extrapolation and interpolation • Be careful when (1) predictions lie outside range of sample; (2) when predictions are for values where data are sparse. Bio 4118 Applied Biostatistics

  41. The final word on extrapolation In the space of one hundred and seventy-six years the Lower Mississippi has shortened itself two hundred and forty-six miles. That is an average of a trifle over one mile and a third per year. Therefore, any calm person, who is not blind or idiotic, can see that in the Old Oölitic Silurian period, just a million years ago next November, the Lower Mississippi River was upwards of one million three hundred thousand miles long, and stuck over the Gulf of Mexico like a fishing rod. And by the same token, any person can see that seven hundred and forty-two years from now, the lower Mississippi will be only a mile and three-quarters long, and Cairo and New Orleans will have joined their streets together, and be plodding comfortably along under a single mayor and a mutual board of aldermen. Mark Twain, Life on the Mississippi Bio 4118 Applied Biostatistics

  42. Y X Power and sample size in simple linear regression • Because the correlation coefficient r and the regression coefficient b are closely related, i.e. • … we can transform b to r and evaluate power using r. Bio 4118 Applied Biostatistics

  43. Y X Power and sample size regression • If we test H0: b = 0 with sample size n, we can determine 1 - b by calculating the z-transformed values for the critical value of the corresponding r (at specified a) (za) and the sample regression coefficient b (zr),and the one-tailed probability of the normal deviate: Bio 4118 Applied Biostatistics

  44. Y X p b Zb(1) Power and sample size in regression • Once Zb(1) is determined, we can calculate the probability of obtaining a Z-value of this size or greater, i.e. b. • Power is then 1-b. Bio 4118 Applied Biostatistics

  45. Power and sample size in regression: an example • Changes in wing length with age in a sample of 13 birds • So 1 - b = 1.00. Bio 4118 Applied Biostatistics

  46. Y Reject H0? Y Reject H0? X1 Observed Expected under H0: b = 0 True regression (b0) Minimal sample size in regression • Given desired power 1 - b, how large a sample is required to reject H0: b= 0 if it is false and the true regression coefficient is at least b0 ? • To do so, first calculate regression coefficient r0corresponding to b0 . Bio 4118 Applied Biostatistics

  47. Y Reject H0? Y Reject H0? X1 Observed Expected under H0: b = 0 True regression (b0) Minimal sample size in regression (cont’d) • …then calculate: Bio 4118 Applied Biostatistics

  48. We want to reject H0: b= 0 99% of the time when b0> 0.2anda(2)= .05. So b(1) = .01 and For b = .20, we have... Minimal sample size: an example Bio 4118 Applied Biostatistics

  49. So… …and So, a sample size of at least 8 should be used. Minimal sample size (cont’d) Bio 4118 Applied Biostatistics

More Related