Chapter 2 Simple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung
2.1 Simple Linear Regression Model • y = 0 + 1 x + • x: regressor variable • y: response variable • 0: the intercept, unknown • 1: the slope, unknown • : error with E() = 0 and Var() = 2 (unknown) • The errors are uncorrelated.
Given x, E(y|x) = E(0 + 1 x + ) = 0 + 1 x Var(y|x) = Var(0 + 1 x + ) = 2 • Responses are also uncorrelated. • Regression coefficients: 0, 1 • 1: the change of E(y|x) by a unit change in x • 0: E(y|x=0)
2.2 Least-squares Estimation of the Parameters 2.2.1 Estimation of 0 and 1 • n pairs: (yi, xi), i = 1, …, n • Method of least squares: Minimize
The fitted simple regression model: • A point estimate of the mean of y for a particular x • Residual: • An important role in investigating the adequacy of the fitted regression model and in detecting departures from the underlying assumption!
Example 2.1: The Rocket Propellant Data • Shear strength is related to the age in weeks of the batch of sustainer propellant. • 20 observations • From scatter diagram, there is a strong relationship between shear strength (y) and propellant age (x). • Assumption y = 0 + 1 x +
How well does this equation fit the data? • Is the model likely to be useful as a predictor? • Are any of the basic assumption violated and if so how serious is this?
2.2.2 Properties of the Least-Squares Estimators and the Fitted Regression Model • are linear combinations of yi • are unbiased estimators.
The Gauss-Markov Theorem: are the best linear unbiased estimators (BLUE).
Some useful properties: • The sum of the residuals in any regression model that contains an intercept 0 is always 0, i.e. • Regression line always passes through the centroid point of data,
2.2.3 Estimator of 2 • Residual sum of squares:
Since , the unbiased estimator of 2 is • MSE is called the residual mean square. • This estimate is model-dependent. • Example 2.2
2.2.4 An Alternate Form of the Model • The new regression model: • Normal equations: • The least-squares estimators:
Some advantages: • The normal equations are easier to solve • are uncorrelated.
2.3 Hypothesis Testing on the Slope and Intercept • Assume εi are normally distributed • yi ~ N(0 + 1 xi , 2 ) 2.3.1 Use of t-Tests • Test on slope: • H0: 1 = 10 v.s. H1: 110
If 2 is known, under null hypothesis, • (n-2) MSE/2 follows a 2n-2 • If 2 is unknown, • Reject H0 if |t0| > t/2, n-2
Test on intercept: • H0: 0 = 00 v.s. H1: 000 • If 2 is unknown • Reject H0 if |t0| > t/2, n-2
2.3.2 Testing Significance of Regression • H0: 1 = 0 v.s. H1: 10 • Accept H0: there is no linear relationship between x and y.
Reject H0: x is of value in explaining the variability in y. • Reject H0 if |t0| > t/2, n-2
Example 2.3:The Rocket Propellant Data • Test significance of regression • MSE = 9244.59 • the test statistic is • t0.0025,18 = 2.101 • Reject H0
2.3.3 The Analysis of Variance (ANOVA) • Use an analysis of variance approach to test significance of regression
SST: the corrected sum of squares of the observations. It measures the total variability in the observations. • SSRes: the residual or error sum of squares • The residual variation left unexplained by the regression line. • SSR: the regression or model sum of squares • The amount of variability in the observations accounted for by the regression line • SST = SSR + SSRes
The degree-of-freedom: • dfT = n-1 • dfR = 1 • dfRes = n-2 • dfT = dfR + dfRes • Test significance regression by ANOVA • SSRes = (n-2) MSRes ~ n-2 • SSR = MSR ~ 1 • SSR and SSRes are independent
E(MSRes) = 2 • E(MSR) = 2 + 12 Sxx • Reject H0 if F0 > F/2,1, n-2 • If 1 0, F0 follows a noncentral F with 1 and n-2 degree of freedom and a noncentrality parameter
More About the t Test • The square of a t random variable with f degree of freedom is a F random variable with 1 and f degree of freedom.
2.4 Interval Estimation in Simple Linear Regression 2.4.1 Confidence Intervals on 0, 1 and 2 • Assume that εi are normally and independently distributed
100(1-)% confidence intervals on 0, 1are given: • Interpretation of C.I. • Confidence interval for 2:
2.4.2 Interval Estimation of the Mean Response • Let x0 be the level of the regressor variable for which we wish to estimate the mean response. • x0 is in the range of the original data on x. • An unbiased estimator of E(y| x0) is
The interval width is a minimum for and widens as increases. • Extrapolation
2.5 Prediction of New Observations • is the point estimate of the new value of the response • follows a normal distribution with mean 0 and variance
The 100(1-)% confidence interval on a future observation at x0 (a prediction interval for the future observation y0)
2.6 Coefficient of Determination • The coefficient of determination: • The proportion of variation explained by the regressor x • 0 R2 1
In Example 2.1, R2 = 0.9018. It means that 90.18% of the variability in strength is accounted for by the regression model. • R2 can be increased by adding terms to the model. • For a simple regression model, • E(R2) increases (decreases) as Sxx increases (decreases)
R2 does not measure the magnitude of the slope of the regression line. A large value of R2 imply a steep slope. • R2 does not measure the appropriateness of the linear model.