Regression Models

Regression Models
Professor William Greene Stern School of Business IOMS Department Department of Economics

Regression and Forecasting Models
Part 3 – Model Fit and Correlation

Correlation and Linear Association Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86. Ht. Inc. Ht. Inc. Ht. Inc. 70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050 Correlation = 0.845

Correlation Coefficient for Two Variables

Correlation and Linear Association Standard Deviation Height = 2.978Standard Deviation Income = 176.903Covariance of Height and Income = 445.034 Correlation = 445.034 / (2.978 x 176.903) = 0.845 Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86. Ht. Inc. Ht. Inc. Ht. Inc. 70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050

Sample Correlation Coefficients rxy = -.06 (close to 0) rxy = 0.723 rxy = +1.000 rxy = -.402

Inference About a Correlation Coefficient

Correlation and Linear Association Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86. Ht. Inc. Ht. Inc. Ht. Inc. 70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050 Correlation = 0.845 t = .845 / sqr((1-.8452)/(30-2)) = 8.361

Correlation is Not Causality Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86. Ht. Inc. Ht. Inc. Ht. Inc. 70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050 Correlation = 0.845

Linear regression is about correlation Regression of salary vs. Regression of fuel bill vs. number years of experience of rooms for a sample of homes The variables are highly correlated because the regression does a good job of predicting changes in the y variable associated with changes in the x variable.

Regression Algebra

Variance Decomposition

ANOVA Table

Fit of the Model to the Data

Explained Variation The proportion of variation “explained” by the regression is called R-squared (R2) It is also called the Coefficient of Determination (It is the square of something – to be shown later.)

Movie Madness Fit R2

Pretty Good Fit: R2 = .722 Regression of Fuel Bill on Number of Rooms

Regression Fits R2 = 0.924 R2 = 0.522 R2 = 0.424 R2 = 0.880

R2 is still positive even if the correlation is negative. R2 = 0.338

R Squared Benchmarks Aggregate time series: expect .9+ Cross sections, .5 is good. Sometimes we do much better. Large survey data sets, .2 is not bad. R2 = 0.924 in this cross section.

R-Squared is rxy2 R-squared is the square of the correlation between yi and the predicted yi which is a + bxi. The correlation between yi and (b0 +b1xi) is the same as the correlation between yi and xi. Therefore,…. A regression with a high R2 predicts yi well.

Squared Correlations rxy2 = 0.522 rxy2 = .161 rxy2 = .924

Regression Fits Regression of salary vs. Regression of fuel bill vs. number years of experience of rooms for a sample of homes

Is R2 Large? Is there really a relationship between x and y? We cannot be 100% certain. We can be “statistically certain” (within limits) by examining R2. F is used for this purpose.

The F Ratio

Is R2 Large? Since F = (N-2)R2/(1 – R2), if R2 is “large,” then F will be large. For a model with one explanatory variable in it, the standard benchmark value for a ‘large’ F is 4.

Movie Madness Fit R2 F

Why Use F and not R2? When is R2 “large?” we have no benchmarks to decide. We have a table for F statistics to determine when F is statistically large: yes or no.

F Table n2 is N-2 The “critical value” depends on the number of observations. If F is larger than the value in the table, conclude that there is a “statistically significant” relationship. There is a huge table on pages 826-833 of your text. Analysts now use computer programs, not tables like this, to find the critical values of F for their model/data.

Internet Buzz Regression n2 is N-2 Regression Analysis: BoxOffice versus Buzz The regression equation is BoxOffice = - 14.4 + 72.7 Buzz Predictor Coef SE Coef T P Constant -14.360 5.546 -2.59 0.012 Buzz 72.72 10.94 6.65 0.000 S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4% Analysis of Variance Source DF SS MS F P Regression 1 7913.6 7913.6 44.16 0.000 Residual Error 60 10751.5 179.2 Total 61 18665.1

Inference About a Correlation Coefficient This is F

$135 Million Klimt, to Ronald Lauder http://www.nytimes.com/2006/06/19/arts/design/19klim.html?ex=1308369600&en=37eb32381038a749&ei=5088&partner=rssnyt&emc=rss

$100 Million … sort of Stephen Wynn with a Prized Possession, 2007

An Enduring Art Mystery Graphics show relative sizes of the two works. The Persistence of Econometrics. Greene, 2011 Why do larger paintings command higher prices? The Persistence of Memory. Salvador Dali, 1931

Monet in Large and Small Sale prices of 328 signed Monet paintings The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model. Log of $price = a + b log surface area + e

The Data Note: Using logs in this context. This is common when analyzing financial measurements (e.g., price) and when percentage changes are more interesting than unit changes. (E.g., what is the % premium when the painting is 10% larger?)

Application: Monet Paintings Does the size of the painting really explain the sale prices of Monet’s paintings? Investigate: Compute the regression Hypothesis: The slope is actually zero. Rejection region: Slope estimates that are very far from zero. The hypothesis that β = 0 is rejected

An Equivalent Test Is there a relationship? H0: No correlation Rejection region: Large R2. Test: F= Reject H0 if F > 4 Math result: F = t2. Degrees of Freedom for the F statistic are 1 and N-2

Monet Regression: There seems to be a regression. Is there a theory?

Regression Models