410 likes | 510 Vues
Simple linear regression is a statistical method used to model the relationship between two quantitative variables. It utilizes one predictor variable, X, to predict a response variable, Y, through a straight-line relationship. Examples include using height to predict weight or warehouse space to estimate rental costs. This method incorporates a probabilistic model to account for variation and helps estimate parameters like the y-intercept and slope, which indicate how changes in X affect Y. Understanding this foundational technique is crucial in statistics.
E N D
Relationship Between Two Quantitative Variables • If we can model the relationship between two quantitative variables, we can use one variable, X, to predict another variable, Y. • Use height to predict weight. • Use percentage of hardwood in pulp to predict the tensile strength of paper. • Use square feet of warehouse space to predict monthly rental cost. L. Wang, Department of Statistics University of South Carolina; Slide 2
Simple Linear Regression • Simple: only one predictor variable • Linear: Straight line relationship • Regression: Fit data to (straight line) model y (Response or Dependent Variable) x (Predictor, Regressor, or Independent Variable) L. Wang, Department of Statistics University of South Carolina; Slide 3
Use Scatter Plot to See Relationship L. Wang, Department of Statistics University of South Carolina; Slide 4
Absorbed Liquid Data • In a chemical process, batches of liquid are passed through a bed containing an ingredient that is absorbed by the liquid. • We will attempt to relate the absorbed percentage of the ingredient (y) to the amount of liquid in the batch (x). L. Wang, Department of Statistics University of South Carolina; Slide 5
Absorbed Liquid Data L. Wang, Department of Statistics University of South Carolina; Slide 6
Absorbed Liquid Data L. Wang, Department of Statistics University of South Carolina; Slide 7
Abs% = -1822 + 435(Amt) The regression line or model is deterministic. L. Wang, Department of Statistics University of South Carolina; Slide 8
We are going to use a probabilistic model which accounts for the variation around the line. L. Wang, Department of Statistics University of South Carolina; Slide 9
Probabilistic Model • Probabilistic Model: deterministic plus error component for unexplained variation. L. Wang, Department of Statistics University of South Carolina; Slide 10
Regression Equation y = deterministic model + random error β0 = y-intercept β1 = slope ε = random error Regression line is estimate of the mean value of y at a given value of x. L. Wang, Department of Statistics University of South Carolina; Slide 11
Interpreting parameters • Once we determine that a straight line model is reasonable, we want to establish the best line by estimating β0 and β1. µ = E(y) = β0 + β1x • β1is the slope. It is the amount by which y will change with a unit increase in x. • β0 is the y-intercept. It is the expected (mean) value of y when x = 0. (This may or may not be meaningful.) L. Wang, Department of Statistics University of South Carolina; Slide 12
If Amount goes up by 1 unit, then the Absorb% is expected to go up by 435 %. If Amount = 0, the expected Absorb% = -1822 units. L. Wang, Department of Statistics University of South Carolina; Slide 13
Absorbed Liquid Data Do not consider x values outside the range of the data. L. Wang, Department of Statistics University of South Carolina; Slide 14
Errors of Prediction = Vertical Distance Between Points and Line L. Wang, Department of Statistics University of South Carolina; Slide 15
Method of Least Squares • Sum of prediction errors = 0. • Sum of the squared errors = Sum of Squares Error = SSE • Many lines for which the sum of errors = 0. • Only one line for which SSE is minimized. • Least squares line = regression line = line for which SSE is minimized. or L. Wang, Department of Statistics University of South Carolina; Slide 16
Least Squares Estimates • Deviation of ith point from estimated value: • The sum of the square of deviations for all n points: • Values of and that minimize SSE are called the least squares estimates. They are also the minimum variance unbiased estimates. L. Wang, Department of Statistics University of South Carolina; Slide 17
Formulas for Least Squares Estimates where L. Wang, Department of Statistics University of South Carolina; Slide 18
Assumptions of a Regression Analysis • Assumptions involve distribution of errors. • Actual errors: • Estimated errors - residuals • Use plots of residuals to check the assumptions. L. Wang, Department of Statistics University of South Carolina; Slide 19
There are Four Assumptions (1) The mean of the errors is 0 at each value of x. X values X values YES NO L. Wang, Department of Statistics University of South Carolina; Slide 20
Plot of Residuals vs X Values L. Wang, Department of Statistics University of South Carolina; Slide 21
There are Four Assumptions (2) Variance of errors is constant across all values of x. X values X values YES NO L. Wang, Department of Statistics University of South Carolina; Slide 22
StatCrunch Plot of Residuals vs X Values L. Wang, Department of Statistics University of South Carolina; Slide 23
There are Four Assumptions (3) Errors have normal distribution at each x. NO YES L. Wang, Department of Statistics University of South Carolina; Slide 24
QQ Plot of Residuals L. Wang, Department of Statistics University of South Carolina; Slide 25
There are Four Assumptions (4) Errors are independent – must know how data was gathered. NO YES L. Wang, Department of Statistics University of South Carolina; Slide 26
Estimate of Variance at each x, σ2 s is estimated standard error of regression model. L. Wang, Department of Statistics University of South Carolina; Slide 27
MSE and Root MSE L. Wang, Department of Statistics University of South Carolina; Slide 28
If the variation predicted by the model is significantly larger than the error variation, we have a significant model. L. Wang, Department of Statistics University of South Carolina; Slide 29
Coefficient of Determination • Coefficient of Determination, R2, measures the contribution of x in the predicting of y. • Proportion of total sample variation explained by linear relationship: L. Wang, Department of Statistics University of South Carolina; Slide 30
Coefficient of Determination • Recall: • SSyy is total sample variation around y. • SSE is unexplained sample variability after fitting regression line. L. Wang, Department of Statistics University of South Carolina; Slide 31
Coefficient of Determination = proportion of total sample variability around y that is explained by the linear relationship between y and x. R2 varies from 0 to 1 with large values indicating a good model fit. L. Wang, Department of Statistics University of South Carolina; Slide 32
ANOVA Table for Simple Linear Regression L. Wang, Department of Statistics University of South Carolina; Slide 33
Amt and Absorb% H0: Model is not significant Ha: Model is significant L. Wang, Department of Statistics University of South Carolina; Slide 34
Sampling Distribution of β1 Standard Error for : L. Wang, Department of Statistics University of South Carolina; Slide 35
Test of Model Utility H0: β1 = 0 Ha: β1 = 0 Test Statistic: Confidence Interval: L. Wang, Department of Statistics University of South Carolina; Slide 36
Amt and Absorb% H0: β1 = 0 Ha: β1 = 0 L. Wang, Department of Statistics University of South Carolina; Slide 37
Coefficient of Correlation • Correlation measures the linear relationship between two quantitative variables. • To get a visual picture, use a scatter plot. • To assign a numeric value: Pearson’s coefficient of correlation, r. r is scalar and will vary from –1 to +1. L. Wang, Department of Statistics University of South Carolina; Slide 38
Coefficient of Correlation r = -1 r = +1 L. Wang, Department of Statistics University of South Carolina; Slide 39
Coefficient of Correlation r = -.80 r = .95 L. Wang, Department of Statistics University of South Carolina; Slide 40 r = 0 r = 0