1 / 54

AMS 572 Presentation

AMS 572 Presentation. CH 10 Simple Linear Regression. Introduction. Example:. Brad Pitt: 1.83m Angelina Jolie: 1.70m. George Bush :1.81m Laura Bush: ?. David Beckham: 1.83m Victoria Beckham: 1.68m. ● To predict height of the wife in a couple, based on the husband’s height.

edith
Télécharger la présentation

AMS 572 Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMS 572 Presentation CH 10 Simple Linear Regression

  2. Introduction Example: Brad Pitt: 1.83m Angelina Jolie: 1.70m George Bush :1.81m Laura Bush: ? David Beckham: 1.83m Victoria Beckham: 1.68m ● To predict height of the wife in a couple, based on the husband’s height Response (out come or dependent) variable (Y): height of the wife Predictor (explanatory or independent) variable (X): height of the husband

  3. Regression analysis: ●regression analysis is a statistical methodology to estimate the relationship of a response variable to a set of predictor variable. ●when there is just one predictor variable, we will use simple linear regression. When there are two or more predictor variables, we use multiple linear regression. ● The earliest form of linear regression was the method of least squares, which was published by Legendre in 1805, and by Gauss in 1809. ● The method was extended by Francis Galton in the 19th century to describe a biological phenomenon. ● This work was extended by Karl Pearson and Udny Yule to a more general statistical context around 20th century. ● when it is not clear which variable represents a response and which is a predictor, correlation analysis is used to study the strength of the relationship History:

  4. A probabilistic model • Specific settings of the predictor variable • Corresponding values of the response variable

  5. ASSUME: - Observed value of the random variable Yi depends on xi - random error with unknown mean of Yi True Regression Line Unknown Slope Unknown Intercept

  6. 4 BASIC ASSUMPTIONS Linear function of Have a common variance, Same for all values of x. Normally distributed Independent

  7. Comments: 1. Linear not because of x Linear in the parameters and Example: linear, logx = x 2. Predictor variable is not set as predetermined fixed values, is random along with Y Example: Height and Weight of the children. Height (X) – given Weight (Y) – predict Conditional expectation of Y given X = x

  8. 10.2 Fitting the Simple Linear Regression Model 10.2.1 Least Squares (LS) Fit

  9. Example 10.1(Tires Tread Wearvs. Mileage: Scatter Plot)

  10. The “best” fitting straight line in the sense of minimizing Q: LS estimate One way to find the LS estimate and Setting these partial derivatives equal to zero and simplifying, we get

  11. Solve the equations and we get

  12. To simplify, we introduce • We get The equation is known as the least squares line, which is an estimate of the true regression line.

  13. Example 10.2 (Tire Tread vs. Mileage: LS Line Fit) Find the equation of the line for the tire tread wear data from Table10.1,we have and n=9.From these we calculate

  14. The slope and intercept estimates are Therefore, the equation of the LS line is Conclusion: there is a loss of 7.281 mils in the tire groove depth for every 1000 miles of driving. Given a particular We can find Which means the mean groove depth for all tires driven for 25,000miles is estimated to be 178.62 miles.

  15. 10.2.2 Goodness of Fit of the LS Line • Coefficient of Determination and Correlation • The residuals: are used to evaluate the goodness of fit of the LS line.

  16. We define: Note: total sum of squares (SST) Regression sum of squares (SSR) Error sum of squares (SSE) r is called the coefficient of determination 0<r<1

  17. Example 10.3(Tire Tread Wear vs. Mileage: Coefficient of Determination and Correlation • For the tire tread wear data, calculate and using the result s from example 10.2 We have • Next calculate • Therefore where the sign of r follows from the sign of since 95.3% of the variation in tread wear is accounted for by linear regression on mileage, the relationship between the two is strongly linear with a negative slope.

  18. 10.2.3 estimation of s2 An unbiased estimate ofis given by Example 10.4(Tire Tread Wear Vs. Mileage: Estimate of Find the estimate of for the tread wear data using the results from Example 10.3 We have SSE=2351.3 and n-2=7,therefore Which has 7 d.f. The estimate of is miles.

  19. Statistical Inference on b0 and b1 , Con’t Point estimators: Sampling distributions of and : For mathematical derivations, please refer to the text book, P331.

  20. Statistical Inference on b0 and b1 , Con’t P.Q.’s: CI’s:

  21. Statistical Inference on b0 and b1 , Con’t Hypothesis test: -- Test statistic: -- At the significance level , we reject in favor of iff --Can be used to show whether there is a linear relationship between x and y

  22. Analysis of Variance (ANOVA), Con’t Mean Square: -- a sum of squares divided by its d.f.

  23. Analysis of Variance (ANOVA) ANOVA Table Example:

  24. 10.4 Regression Diagnostics10.4.1 Checking for Model Assumptions • Checking for Linearity • Checking for Constant Variance • Checking for Normality • Checking for Independence

  25. Checking for Linearity Xi =Mileage Y=β0+ β1x Yi =Groove Depth ^ ^ ^ ^ Y=β0+ β1x Yi =fitted value ^ ei =residual Residual = ei = Yi- Yi

  26. Checking for Normality

  27. Checking for Constant Variance Var(Y) is not constant. A sample residual plots when Var(Y) is constant.

  28. Does not apply for Simple Linear Regression Model Only apply for time series data Checking for Independence

  29. 10.4.2 Checking for Outliers & Influential Observations • What is OUTLIER • Why checking for outliers is important • Mathematical definition • How to deal with them

  30. 10.4.2-A. Intro Recall Box and Whiskers Plot (Chapter 4) • Where (mild) OUTLIER is defined as any observations that lies outside of Q1-(1.5*IQR) and Q3+(1.5*IQR) (Interquartile range, IQR = Q3 − Q1) • (Extreme) OUTLIER as that lies outside of Q1-(3*IQR) and Q3+(3*IQR) • Observation "far away" from the rest of the data

  31. 10.4.2-B. Why are outliers a problem? • May indicate a sample peculiarity or a data entry error or other problem ; • Regression coefficients estimated that minimize the Sum of Squares for Error (SSE) are very sensitive to outliers >>Bias or distortion of estimates; • Any statistical test based on sample means and variances can be distorted In the presence of outliers >>Distortion of p-values; • Faulty conclusions. Example: ( Estimators not sensitive to outliers are said to be robust)

  32. 10.4.2-C. Mathematical Definition • Outlier The standardized residual is given by • If |ei*|>2, then the corresponding observation may be regarded an outlier. • Example: (Tire Tread Wear vs. Mileage) • STUDENTIZED RESIDUAL: a type of standardized residual calculated with the current observation deleted from the analysis. • The LS fit can be excessively influenced by observation that is not necessarily an outlier as defined above.

  33. eg.1 with without eg.2 scatter plot residual plot 10.4.2-C. Mathematical Definition • Influential Observation Observation with extreme x-value, y-value, or both. • On average hii is (k+1)/n, regard any hii>2(k+1)/n as high leverage; • If xi deviates greatly from mean x, then hii is large; • Standardized residual will be large for a high leverage observation; • Influence can be thought of as the product of leverage and outlierness. • Example: (Observation is influential/ high leverage, but not an outlier)

  34. 10.4.2-C. SAS code of the examples SAS code procregdata=tire; model y=x; outputout=resid rstudent=r h=lev cookd=cd dffits=dffit; procprintdata=resid; where abs(r)>=2 or lev>(4/9) or cd>(4/9) or abs(dffit)>(2*sqrt(1/9)); run; SAS output

  35. 10.4.2-D. How to deal with Outliers & Influential Observations • Investigate (Data errors? Rare events? Can be corrected?) • Ways to accommodate outliers • Non Parametric Methods (robust to outliers) • Data Transformations • Deletion (or report model results both with and without the outliers or influential observations to see how much they change)

  36. 10.4.3 Data Transformations Reason • To achieve linearity • To achieve homogeneity of variance • To achieve normality or symmetry about the regression equation

  37. Type of Transformation • Linearzing Transformation transformation of a response variable, or predicted variable, or both, which produces an approximate linear relationship between variables. • Variance Stabilizing Transformation make transformation if the constant variance assumption is violated

  38. Method of Linearizing Transformation • Use mathematical operation, e.g. square root, power, log, exponential, etc. • Only one variable needs to be transformed in the simple linear regression. Which one? Predictor or Response? Why?

  39. e.g. We take a exponential transformation on Y = a exp (-bx) <=> log Y = log a - b x

  40. Method of Variance Stabilizing Transformation Delta method :Two terms Taylor-series approximations Var( h(Y)) ≈ [h(m)]2 g2 (m) whereVar(Y) = g2(m), E(Y) = m • set [h’(m)]2 g2 (m) = 1 • h’(m) = • h(m) =  h(y) = e.g. Var(Y) = c2 m2 , where c > 0, g(m) = cm ↔ g(y) = cy h(y) = = = Therefore it is the logarithmic transformation

  41. Correlation Analysis • Correlation: a measurement of how closely two variables share a linear relationship. • Useful when it is not possible to determine which variable is the predictor and which is the response. • Health vs wealth. Which is predictor? Which is response?

  42. Statistical Inference on the Correlation Coefficient ρ • We can derive a test on the correlation coefficient in the same way that we have been doing in class. • Assumptions • X, Y are from the bivariate normal distribution • Start with point estimator • R: sample estimate of the population correlation coefficient ρ • Get the pivotal quantity • The distribution of R is quite complicated • T: transform the point estimator into a p.q. • Do we know everything about the p.q.? • Yes: T ~ tn-2 under H0 : ρ=0

  43. Bivariate Normal Distribution • pdf: • Properties • μ1, μ2 means for X, Y • σ12, σ22 variances for X, Y • ρ the correlation coeffbetween X, Y

  44. Derivation of T • Therefore, we can use t as a statistic for testing against the null hypothesis H0: β1=0 • Equivalently, we can test against H0: ρ=0

  45. Exact Statistical Inference on ρ • Test • H0 : ρ=0 , Ha : ρ<>0 • Test statistic: • Reject H0 if t0 > tn-2 • Example (from textbook) • A researcher wants to determine if two test instruments give similar results. The two test instruments are administered to a sample of 15 students. The correlation coefficient between the two sets of scores is found to be 0.7. Is this correlation statistically significant at the .01 level? • H0 : ρ=0 , Ha : ρ<>0 • for α = .01, 3.534 = t0 > t13, .005 = 3.012 • ▲ Reject H0

  46. Approximate Statistical Inference on ρ • There is no exact method of testing ρ vs an arbitrary ρ0 • Distribution of R is very complicated • T ~ tn-2 only when ρ = 0 • To test ρ vs an arbitrary ρ0 use Fisher’s Normal approximation • Transform the sample estimate

  47. Approximate Statistical Inference on ρ • Test : • Sample estimate: • Z statistic: • CI: • reject H0 if |z0| > zα/2

  48. Approximate Statistical Inference on ρusing SAS • Code: • Output:

More Related