Simple Linear Regression

Simple Linear Regression

Relationship Between Two Quantitative Variables • If we can model the relationship between two quantitative variables, we can use one variable, X, to predict another variable, Y. • Use height to predict weight. • Use percentage of hardwood in pulp to predict the tensile strength of paper. • Use square feet of warehouse space to predict monthly rental cost. L. Wang, Department of Statistics University of South Carolina; Slide 2

Simple Linear Regression • Simple: only one predictor variable • Linear: Straight line relationship • Regression: Fit data to (straight line) model y (Response or Dependent Variable) x (Predictor, Regressor, or Independent Variable) L. Wang, Department of Statistics University of South Carolina; Slide 3

Use Scatter Plot to See Relationship L. Wang, Department of Statistics University of South Carolina; Slide 4

Absorbed Liquid Data • In a chemical process, batches of liquid are passed through a bed containing an ingredient that is absorbed by the liquid. • We will attempt to relate the absorbed percentage of the ingredient (y) to the amount of liquid in the batch (x). L. Wang, Department of Statistics University of South Carolina; Slide 5

Absorbed Liquid Data L. Wang, Department of Statistics University of South Carolina; Slide 6

Absorbed Liquid Data L. Wang, Department of Statistics University of South Carolina; Slide 7

Abs% = -1822 + 435(Amt) The regression line or model is deterministic. L. Wang, Department of Statistics University of South Carolina; Slide 8

We are going to use a probabilistic model which accounts for the variation around the line. L. Wang, Department of Statistics University of South Carolina; Slide 9

Probabilistic Model • Probabilistic Model: deterministic plus error component for unexplained variation. L. Wang, Department of Statistics University of South Carolina; Slide 10

Regression Equation y = deterministic model + random error β0 = y-intercept β1 = slope ε = random error Regression line is estimate of the mean value of y at a given value of x. L. Wang, Department of Statistics University of South Carolina; Slide 11

Interpreting parameters • Once we determine that a straight line model is reasonable, we want to establish the best line by estimating β0 and β1. µ = E(y) = β0 + β1x • β1is the slope. It is the amount by which y will change with a unit increase in x. • β0 is the y-intercept. It is the expected (mean) value of y when x = 0. (This may or may not be meaningful.) L. Wang, Department of Statistics University of South Carolina; Slide 12

If Amount goes up by 1 unit, then the Absorb% is expected to go up by 435 %. If Amount = 0, the expected Absorb% = -1822 units. L. Wang, Department of Statistics University of South Carolina; Slide 13

Absorbed Liquid Data Do not consider x values outside the range of the data. L. Wang, Department of Statistics University of South Carolina; Slide 14

Errors of Prediction = Vertical Distance Between Points and Line L. Wang, Department of Statistics University of South Carolina; Slide 15

Method of Least Squares • Sum of prediction errors = 0. • Sum of the squared errors = Sum of Squares Error = SSE • Many lines for which the sum of errors = 0. • Only one line for which SSE is minimized. • Least squares line = regression line = line for which SSE is minimized. or L. Wang, Department of Statistics University of South Carolina; Slide 16

Least Squares Estimates • Deviation of ith point from estimated value: • The sum of the square of deviations for all n points: • Values of and that minimize SSE are called the least squares estimates. They are also the minimum variance unbiased estimates. L. Wang, Department of Statistics University of South Carolina; Slide 17

Formulas for Least Squares Estimates where L. Wang, Department of Statistics University of South Carolina; Slide 18

Assumptions of a Regression Analysis • Assumptions involve distribution of errors. • Actual errors: • Estimated errors - residuals • Use plots of residuals to check the assumptions. L. Wang, Department of Statistics University of South Carolina; Slide 19

There are Four Assumptions (1) The mean of the errors is 0 at each value of x. X values X values YES NO L. Wang, Department of Statistics University of South Carolina; Slide 20

Plot of Residuals vs X Values L. Wang, Department of Statistics University of South Carolina; Slide 21

There are Four Assumptions (2) Variance of errors is constant across all values of x. X values X values YES NO L. Wang, Department of Statistics University of South Carolina; Slide 22

StatCrunch Plot of Residuals vs X Values L. Wang, Department of Statistics University of South Carolina; Slide 23

There are Four Assumptions (3) Errors have normal distribution at each x. NO YES L. Wang, Department of Statistics University of South Carolina; Slide 24

QQ Plot of Residuals L. Wang, Department of Statistics University of South Carolina; Slide 25

There are Four Assumptions (4) Errors are independent – must know how data was gathered. NO YES L. Wang, Department of Statistics University of South Carolina; Slide 26

Estimate of Variance at each x, σ2 s is estimated standard error of regression model. L. Wang, Department of Statistics University of South Carolina; Slide 27

MSE and Root MSE L. Wang, Department of Statistics University of South Carolina; Slide 28

If the variation predicted by the model is significantly larger than the error variation, we have a significant model. L. Wang, Department of Statistics University of South Carolina; Slide 29

Coefficient of Determination • Coefficient of Determination, R2, measures the contribution of x in the predicting of y. • Proportion of total sample variation explained by linear relationship: L. Wang, Department of Statistics University of South Carolina; Slide 30

Coefficient of Determination • Recall: • SSyy is total sample variation around y. • SSE is unexplained sample variability after fitting regression line. L. Wang, Department of Statistics University of South Carolina; Slide 31

Coefficient of Determination = proportion of total sample variability around y that is explained by the linear relationship between y and x. R2 varies from 0 to 1 with large values indicating a good model fit. L. Wang, Department of Statistics University of South Carolina; Slide 32

ANOVA Table for Simple Linear Regression L. Wang, Department of Statistics University of South Carolina; Slide 33

Amt and Absorb% H0: Model is not significant Ha: Model is significant L. Wang, Department of Statistics University of South Carolina; Slide 34

Sampling Distribution of β1 Standard Error for : L. Wang, Department of Statistics University of South Carolina; Slide 35

Test of Model Utility H0: β1 = 0 Ha: β1 = 0 Test Statistic: Confidence Interval: L. Wang, Department of Statistics University of South Carolina; Slide 36

Amt and Absorb% H0: β1 = 0 Ha: β1 = 0 L. Wang, Department of Statistics University of South Carolina; Slide 37

Coefficient of Correlation • Correlation measures the linear relationship between two quantitative variables. • To get a visual picture, use a scatter plot. • To assign a numeric value: Pearson’s coefficient of correlation, r. r is scalar and will vary from –1 to +1. L. Wang, Department of Statistics University of South Carolina; Slide 38

Coefficient of Correlation r = -1 r = +1 L. Wang, Department of Statistics University of South Carolina; Slide 39

Coefficient of Correlation r = -.80 r = .95 L. Wang, Department of Statistics University of South Carolina; Slide 40 r = 0 r = 0

Simple Linear Regression