Lecture 9: Diagnostics & Review

Lecture 9:Diagnostics & Review February 10, 2014

Question A least squares regression line is determined from a sample of values for variables x and y where x = size of a listed home (in sq feet) y = selling price of the home (in $) Which of the following is true about the model b0 + b1x? • If there is positive correlation r between x and y, then b1 must be positive • The units of the intercept and slope will be the same as the response variable, y. • If r2 = 0.85, then it is appropriate to conclude that a change in x will cause a change in y • None of the above, more than one of the above, or not enough information to tell.

Question A least squares regression line is determined from a sample of values for variables x and y where x = size of a listed home (in sq feet) y = selling price of the home (in $) Which of the following is true about the model b0 + b1x? • If there is positive correlation r between x and y, then b1 must be positive b1 = r * sy / sx So if r> 0, then b1 is positive because syand sx> 0

Administrative • Problem set 4 due (9am) • How was it? • Next week: Multiple Regression • Exam Wednesday • Sample question • Taken from Exam 1 - #37 last year

Last time • What did we talk about? • Outliers • Sensitivity analysis • Heteroscedasticity

Common problems and fixes: Say we’re estimating price of a lease by the size of the house: Price = β0 + β1 * SqFt + ε Interpretation of the estimates? • β0would be fixed costs and • β1would be marginal costs

Common Problems:Heteroscedasticity Heteroscedasticity: What does that mean for your analysis? • Point estimates for β’s? • Still OK. No bias. • Prediction and Confidence intervals? • Not reliable; too narrow or too wide. • Hypothesis tests regarding β0 and β1 are not reliable.

Common Problems:Heteroscedasticity Fixing the problem: • Revise the model: how will depend on the substance. • Try revising the model to estimate Price/SqFt by dividing the original eq by SqFt: • Notice the change in the • intercept and slope: • Don’t be locked into thinking the intercept is fixed cost • How to interpret them depends • Think about the data!

Common Problems:Heteroscedasticity Fixing the problem: Price/SqFt = M + F * (1/SqFt) + ε • Revise by thinking about the substance • Here it was predict price per sqft directly. • Don’t revise by doing weird things • Use theory! • After revising, check if the residuals have similar variances? • Sometimes they won’t. • In this case they do:

Common Problems:Heteroscedasticity Comparing the revised and original model: • Revised model may have different (and smaller) R2. • Again, so? R2 is great but it’s only one notion of fit. • In the example, the revised model provides a narrower (hence better) confidence interval for fixed and variable costs: Original Model Revised Model Original Model Revised Model

Common Problems:Heteroscedasticity Comparing the revised and original model: • It also provides a more sensible prediction interval • The data originally indicated that large homes varied in price more:

Common Problems:Heteroscedasticity How do you know how to remodel the problem? • Practice • Creativity; try different things. • There is no magic bullet; sometimes you can’t.

Common Problems:Correlated Errors Problem: Dependence between residuals (autocorrelation) • The amount of error (detected by the size of the residual) you make at observation x+ 1 is related to the amount of error you make at observation x. • Why is this a problem? • SRM assumes that the errors, ε, are independent. • Common problem for time series data, but not just a time series problem. • Recall the u-shaped pattern in one of the residual plots before

Common Problems:Correlated Errors Detecting the problem: • Easier with time series data: • plot the residuals versus time and look for a pattern (is t+1 related to t?). Not guaranteed to find it but often helpful. • Use the Durbin-Watson statistic to test for correlation between adjacent residuals (aka serial- or auto-correlation) • With time series data adjacency is temporal. • In non time series data, we’re still talking about errors next to one another being related. • For things like spatial autocorrelation, there are more advanced things like mapping the residuals and tests we can do

Durbin-Watson Statistic • Tests to see if the correlation between the residuals is 0 • Null hypothesis: H0: ρε = 0 • It’s calculated as: • From the Durbin-Watson, D,statistic and sample size you can calculate the p-value for the hypothesis test • You’ll see this more in multiple regression and forecasting

Common Problems:Correlated Errors Consequences of Dependence: • With autocorrelation in the errors the estimated standard errors are too small • Estimated slope and intercept are less precise than as indicated by the output

Common Problems:Correlated Errors How do you fix it? • Try to model it directly or transform the data. • Example: number of mobile phone users: • Growth rate isn’t linear; try different transformations Original data Transformed data

Common Problems:Correlated Errors Does this fix the problem? • Linear pattern looks better • You still need to check the other SRM conditions!! • Omitted variables? • Analysis of residuals. Might still be a problem. Original data Transformed data

Exam Review • Download diamonds.xlsx • Regress price on weight • Are the residuals distributed Normal? • Yes • No • Maybe? • I have no idea how to verify that

Exam Review • Using your regression model from the last slide, predict the price of a diamond that weighs 0.44 carats • What is the approximate 95% confidence interval? • [$877.75, $1558.61] • [$2324.80, $3014.69] • [$-97.97, $184.95] • [$2330.41, $3009.09] • I have no idea

Exam Review • Using your regression model from the last slide, predict the price of a diamond that weighs 0.28 carats • What is the prediction interval? • [$877.75, $1558.61] • [$452.57, $1129.46] • [$764.38, $1058.25] • [$345.61, $678.34] • I have no idea

Exam Review • Question about transformations: • Again, no magic bullet. Try different ones. • How do you decide if you transform the X or Y? • Often depends on the substance.

Exam Review • Transformations • A common mistake is to forget to convert back to the appropriate units. • Say your data and interest is in km/l and you transform the response to be liters / 100 km. Don’t forget to transform back to the correct units. Similarly for ln(x) [ in excel e is =exp() ]

Exam Review • Conditions for the SRM • Know them. • Don’t be hesitant to try to fit a model if they are violated; just be cautious. • Some of you might think a regression model is inappropriate if you don’t see a pattern in the data, i.e.,: • Totally fine to try to fit a model • The slope will probably be 0.

Exam Review Check list: • Is the association between y and x linear? • Maybe one could exist but you don’t obviously see it (much more common in multiple regression) • Have omitted/lurking variables been ruled out? • In the exam, I’ll try to give you the necessary info. • Are the errors evidently independent? • How do you verify this? • Are the variances of the residuals similar? • How do you verify this? • Are the residuals nearly normal? • How do you verify this?

Exam Review • What do you need to know? • Everything from chapters 19 through 22… • No CAPM; we’ll come back to it. • What do you need to know from last semester? • Statistics builds on itself. I’ll assume you’re comfortable with some basic concepts (confidence intervals, hypothesis tests, z-scores, means, etc., etc.) • Will there be decision problems like those on Quiz 1? Maybe, but probably not. I want this to be more applied data analysis.

Exam Review • Types of Questions? • Possibly homework like. • Some business related decision making • Some non-business related analysis • Best way to study? • Do the problems. Then do more.

Lecture 9: Diagnostics & Review