220 likes | 347 Vues
Review Session. Linear Regression. Correlation. Pearson’s r Measures the strength and type of a relationship between the x and y variables Ranges from -1 to +1. Correlation printout in Minitab. Top number is the correlation Bottom number is the p-value. Simple Linear Regression.
E N D
Review Session Linear Regression
Correlation • Pearson’s r • Measures the strength and type of a relationship between the x and y variables • Ranges from -1 to +1
Correlation printout in Minitab • Top number is thecorrelation • Bottom number is thep-value
Simple Linear Regression y=b0 + b1x1 + e
Simple Linear RegressionMaking A Point Prediction y = b0 + b1x1 + e GPA = 1.47 + 0.00323(GMAT) For a person with a GMAT Score of 400, what is the expected 1st year GPA? GPA = 1.47 + 0.00323(GMAT) GPA = 1.47 + 0.00323(400) GPA = 1.47 + 1.292 GPA = 2.76
Simple Linear Regression y = b0 + b1x1 + e GPA = 1.47 + 0.00323(GMAT) What’s the 95% CI for the GPA of a person with a GMAT score of 400? GPA = 2.76 SE = 0.26 2.76 +/- 2(0.26) 95% CI = (2.24, 3.28)
Coefficient CI’s and Testing y = b0 + b1x1 + e GPA = 1.47 + 0.00323(GMAT) Find the 95% CI for the coefficients. b0 = 1.47 +/- 2(0.22) = 1.47 +/- 0.44 = (1.03, 1.91) b1 = 0.0032 +/- 2(0.0004) = 0.0032 +/- 0.0008 = 0.0026, 0.0040
Coefficient Testing y = b0 + b1x1 + e GPA = 1.47 + 0.00323(GMAT) The p-value for each coefficient is the result of a hypothesis test H0: b = 0 H1: b <> 0 If p-value <= 0.05, reject H0 and accept the coefficient.
R2 • r2 and R2 • Square of Pearson's r • Little r2 is for simple regression • Big R2 is used for multiple regression
Sample R2 values R2 = 0.80 R2 = 0.60 R2 = 0.30 R2 = 0.20
Regression ANOVA • H0: b1 = b2 = …. = bk = 0 • Ha: at least one b <> 0 • F-statistic, df1, df2 p-value • If p <= 0.05, at least one of the b’s is not zero • If p > 0.05, it’s possible that all of the b’s are zero
Diagnostics - Residuals • Residuals = errors • Residuals should be normally distributed • Residuals should have a constant variance • Heteroscedasticity: pattern in the residual distribution • Autocorrelation: error magnitude increases or decreases with the magnitude of an independent variable • Heteroscedasticity and autocorrelation indicate problems with the model • Homoscedasticity: no pattern in the residual distribution • Use the 4-in-one plot for these diagnostics
Adding a Power Transformation • Each “bump” or “U” shape in a scatter plot indicates that an additional power may be involved. • 0 bumps: x • 1 bump: x2 • 2 bumps: x3 • Standard equation is y = b0 + b1x+ b2x2 • Don’t forget: Check to see if b1 and b2 are statistically significant, and that the model is also statistically significant.
Categorical Variables • Occasionally it is necessary to add a categorical variable to a regression model. • Suppose that we have a car dealership, and we want to model the sale price based on the time on the lot and the sales person (Tom, Dick, or Harry). • The time on the lot is a linear variable. • Salesperson is a categorical variable.
Categorical Variables • Categorical variables are modeled in regression using Boolean logic Example: y = b0 + btimextime + bTomxTom + bDickxDick
Categorical Variables Harry is the baseline category for the model Tom and Dick’s performance will be gauged in relation to Harry, but not each other. Example: y = b0 + btimextime + bTomxTom + bDickxDick
Categorical Variables y = b0 + btimextime + bTomxTom + bDickxDick • Interpretation • Tom’s average sale price is bTom more than Harry’s • Dick’s average sale price is bDick more than Harry’s
Multicolinearity • Multicolinearity: Predictor variables are correlated with each other. • Multicolinearity results in instability in the estimation of the b’s • P-values will be larger • Confidence in the b’s decreases or disappears (magnitude and sign may be different from the expected values) • A small change in the data results in large variations in the coefficients • Read 11.11
VIF-Variance Inflation Factor • Measures the degree to which the confidence in the estimate of the coefficient is decreased by multicolinearity. • The larger the VIF, the greater a problem multicolinearity is. • If VIF > 10 then there may be a problem • If VIF >=15 then there may be a serious problem
Model Selection • Start with everything. • Delete variables with high VIF factors one at a time. • Delete variables one at a time, deleting the one with the largest p-value. • Stop when all p-values are less than 0.05.
Demand Price Curve The demand-price function is nonlinear: D=b0Pb1 A log transformation makes it linear: ln(D)=ln(b0) +b1ln(P) Run the Regression on the transformed variables Plug the coefficients into the equation below: D=eb0Pb1 Make your projections on this last equation.
Demand Price Curve • Create a variable for the natural log of demand and the natural log of the independent variables. • In Excel : =ln(demand), =ln(price), =ln(income), etc. • Run the regression on the transformed variables. • Place the coefficients in the equation: d=econstantpb1ib2 • Simplify to: d=kpb1ib2 (Note that econstant=k) • If income is not included, then the equation is just: d=kpb1 The demand-price function is nonlinear: d=kpb1 A log transformation makes it linear: ln(d)=b0 +bpln(p)