1 / 171

Linear Regression

Linear Regression. Hypothesis testing and Estimation. Assume that we have collected data on two variables X and Y. Let ( x 1 , y 1 ) ( x 2 , y 2 ) ( x 3 , y 3 ) … ( x n , y n ) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population).

ailsa
Télécharger la présentation

Linear Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Regression Hypothesis testing and Estimation

  2. Assume that we have collected data on two variables X and Y. Let (x1, y1) (x2, y2) (x3, y3) … (xn, yn) denote thepairs of measurements on the on two variables X and Y for n cases in a sample (or population)

  3. The Statistical Model

  4. slope = b Y = a + bX yi s a + bxi a xi Each yi is assumed to be randomly generated from a normal distribution with mean mi = a + bxi and standard deviation s. (a, b and s are unknown)

  5. Y = a + bX The DataThe Linear Regression Model • The data falls roughly about a straight line. unseen

  6. The Least Squares Line Fitting the best straight line to “linear” data

  7. Let Y = a + b X denote an arbitrary equation of a straight line. a and b are known values. This equation can be used to predict for each value of X, the value of Y. For example, if X = xi (as for the ith case) then the predicted value of Y is:

  8. The residual can be computed for each case in the sample, The residual sum of squares (RSS) is a measure of the “goodness of fit of the line Y = a + bX to the data

  9. The optimal choice of a and b will result in the residual sum of squares attaining a minimum. If this is the case than the line: Y = a + bX is called the Least Squares Line

  10. The equation for the least squares line Let

  11. Linear Regression Hypothesis testing and Estimation

  12. The Least Squares Line Fitting the best straight line to “linear” data

  13. Computing Formulae:

  14. Then the slope of the least squares line can be shown to be:

  15. and the intercept of the least squares line can be shown to be:

  16. Computing formula The residual sum of Squares

  17. Computing formula Estimating s, the standard deviation in the regression model : This estimate of s is said to be based on n – 2 degrees of freedom

  18. Sampling distributions of the estimators

  19. The sampling distribution slope of the least squares line : It can be shown that b has a normal distribution with mean and standard deviation

  20. Thus has a standard normal distribution, and has a t distribution with df = n - 2

  21. (1 – a)100% Confidence Limits for slope b : ta/2 critical value for the t-distribution with n – 2 degrees of freedom

  22. Testing the slope The test statistic is: - has a t distribution with df = n – 2 if H0 is true.

  23. Reject The Critical Region df = n – 2 This is a two tailed tests. One tailed tests are also possible

  24. The sampling distribution intercept of the least squares line : It can be shown that a has a normal distribution with mean and standard deviation

  25. Thus has a standard normal distribution and has a t distribution with df = n - 2

  26. (1 – a)100% Confidence Limits for intercept a : ta/2 critical value for the t-distribution with n – 2 degrees of freedom

  27. Testing the intercept The test statistic is: - has a t distribution with df = n – 2 if H0 is true.

  28. Reject The Critical Region df = n – 2

  29. Example

  30. The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950.TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.Country (i) Xi Yi Australia 48 18 Canada 50 15 Denmark 38 17 Finland 110 35 Great Britain 110 46 Holland 49 24 Iceland 23 6 Norway 25 9 Sweden 30 11 Switzerland 51 25 USA 130 20

  31. Fitting the Least Squares Line

  32. Fitting the Least Squares Line First compute the following three quantities:

  33. Computing Estimate of Slope (b), Intercept (a) and standard deviation (s),

  34. 95% Confidence Limits for slope b : 0.0706 to 0.3862 t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom

  35. 95% Confidence Limits for intercept a : -4.34 to 17.85 t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom

  36. Y = 6.756 + (0.228)X 95% confidence Limits for slope 0.0706 to 0.3862 95% confidence Limits for intercept -4.34 to 17.85

  37. Testing the positive slope The test statistic is:

  38. Reject The Critical Region df = 11– 2 = 9 A one tailed test

  39. we reject and conclude

  40. Confidence Limits for Points on the Regression Line • The intercept a is a specific point on the regression line. • It is the y – coordinate of the point on the regression line when x = 0. • It is the predicted value of y when x = 0. • We may also be interested in other points on the regression line. e.g. when x = x0 • In this case the y – coordinate of the point on the regression line when x = x0is a + bx0

  41. y = a + bx a + bx0 x0

  42. (1- a)100% Confidence Limits for a + b x0: ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom

  43. Prediction Limits for new values of the Dependent variable y • An important application of the regression line is prediction. • Knowing the value of x (x0) what is the value of y? • The predicted value of y when x = x0is: • This in turn can be estimated by:.

  44. The predictor • Gives only a single value for y. • A more appropriate piece of information would be a range of values. • A range of values that has a fixed probability of capturing the value for y. • A (1- a)100% prediction interval for y.

  45. (1- a)100% Prediction Limits for y when x = x0: ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom

  46. Example In this example we are studying building fires in a city and interested in the relationship between: • X = the distance of the closest fire hall and the building that puts out the alarm and • Y = cost of the damage (1000$) The data was collected on n = 15 fires.

  47. The Data

  48. Scatter Plot

  49. Computations

More Related