Regression Models

Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics

Regression and Forecasting Models Part 2 – Inference About the Regression

The Linear Regression Model 1. The linear regression model 2. Sample statistics and population quantities 3. Testing the hypothesis of no relationship

A Linear Regression Predictor: Box Office = -14.36 + 72.72 Buzz

Data and Relationship • We suggested the relationship between box office and internet buzz is Box Office = -14.36 + 72.72 Buzz • Note the obvious inconsistency in the figure. This is not the relationship. The observed points do not lie on a line. • How do we reconcile the equation with the data?

Modeling the Underlying Process • A model that explains the process that produces the data that we observe: • Observed outcome = the sum of two parts • (1) Explained: The regression line • (2) Unexplained (noise): The remainder • Regression model • The “model” is the statement that part (1) is the same process from one observation to the next. Part (2) is the randomness that is part of real world observation.

The Population Regression • THE model: A specific statement about the parts of the model • (1) Explained: Explained Box Office = β0 + β1 Buzz • (2) Unexplained: The rest is “noise, ε.” Random ε has certain characteristics • Model statement • Box Office = β0 + β1 Buzz + ε

The Data Include the Noise

The Data Include the Noise  0+ 1Buzz Box = 41, 0+ 1Buzz = 10,  = 31

Model Assumptions • yi = β0 + β1xi + εi • β0 + β1xi is the ‘regression function’ • Contains the ‘information’ about yi in xi • Unobserved because β0 and β1 are not known for certain • εi is the ‘disturbance.’ It is the unobserved random component • Observed yi is the sum of the two unobserved parts.

Regression Model Assumptions About εi • Random Variable • (1) The regression is the mean of yi for a particular xi. εi is the deviation of yi from the regression line. • (2)εi has mean zero. • (3) εi has variance σ2. • ‘Random’ Noise • (4) εi is unrelated to any values of xi (no covariance) – it’s “random noise” • (5) εi is unrelated to any other observations on εj (not “autocorrelated”) • (6) Normal distribution - εi is the sum of many small influences

Regression Model

Conditional Normal Distribution of 

A Violation of Point (4) c = 0+ 1 q + ? Electricity Cost Data

A Violation of Point (5) - Autocorrelation Time Trend of U.S. Gasoline Consumption

No Obvious Violations of Assumptions Auction Prices for Monet Paintings vs. Area

Samples and Populations • Population (Theory) • yi = β0 + β1xi + εi • Parameters β0, β1 • Regression • β0 + β1xi • Mean of yi | xi • Disturbance, εi • Expected value = 0 Standard deviation σ • No correlation with xi • Sample (Observed) • yi = b0 + b1xi + ei • Estimates, b0, b1 • Fitted regression • b0 + b1xi • Predicted yi|xi • Residuals, ei • Sample mean 0, Sample std. dev. se • Sample Cov[x,e] = 0

Disturbances vs. Residuals =y- 0 - 1Buzz e=y-b0 –b1Buzz

Standard Deviation of Residuals • Standard deviation of εi = yi- β0– β1xi is σ • σ = √E[εi2] (Mean of εi is zero) • Sample b0 and b1 estimate β0 and β1 • Residual ei = yi – b0– b1xi estimates εi • Use √(1/N)Σei2 to estimate σ? Close, not quite. Why N-2? Relates to the fact that two parameters (β0,β1) were estimated. Same reason N-1 was used to compute a sample variance.

Linear Regression Sample Regression Line

Residuals

Regression Computations

Results to Report

The Reported Results

Estimated equation

Estimated coefficients b0and b1

Sum of squared residuals, Σiei2 

S = se = estimated std. deviation of ε

Interpreting  (Estimated by se) Remember the empirical rule, 95% of observations will lie within mean ± 2 standard deviations? We show (b0 +b1x) ±2sebelow.) This point is 2.2 standard deviations from the regression. Only 3.2% of the 62 observations lie outside the bounds. (We will refine this later.)

yi = β0 + β1xi + εi No Relationship: 1 = 0 Relationship: 1  0 How to Distinguish These Cases Statistically?

Assumptions • (Regression) The equation linking “Box Office” and “Buzz” is stable E[Box Office | Buzz] = α + β Buzz • Another sample of movies, say 2012, would obey the same fundamental relationship.

Sampling Variability Samples 0 and 1 are a random split of the 62 observations. Sample 0: Box Office = -16.09 + 79.11 Buzz Sample 1: Box Office = -13.25 + 68.51 Buzz

Sampling Distributions

n = N-2 Small sample Large sample

Standard Error of Regression Slope Estimator 

Internet Buzz Regression Range of Uncertainty for b is 72.72+1.96(10.94)to72.72-1.96(10.94)= [51.27 to 94.17] If you use 2.00 from the t table, the limits would be [50.1 to 94.6] Regression Analysis: BoxOffice versus Buzz The regression equation is BoxOffice = - 14.4 + 72.7 Buzz Predictor Coef SE Coef T P Constant -14.360 5.546 -2.59 0.012 Buzz 72.72 10.94 6.65 0.000 S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4% Analysis of Variance Source DF SS MS F P Regression 1 7913.6 7913.6 44.16 0.000 Residual Error 60 10751.5 179.2 Total 61 18665.1 

Some computer programs report confidence intervals automatically; Minitab does not.

Uncertainty About the Regression Slope Hypothetical Regression Fuel Bill vs. Number of Rooms The regression equation is Fuel Bill = -252 + 136 Number of Rooms Predictor Coef SE Coef T P Constant -251.9 44.88 -5.20 0.000 Rooms 136.2 7.09 19.9 0.000 S = 144.456 R-Sq = 72.2% R-Sq(adj) = 72.0% This is b1, the estimate of β1 This “Standard Error,” (SE) is the measure of uncertainty about the true value. The “range of uncertainty” is b ± 2 SE(b). (Actually 1.96, but people use 2) 

Sampling Distributions and Test Statistics

t Statistic for Hypothesis Test

Alternative Approach: The P value • Hypothesis: 1 = 0 • The ‘P value’ is the probability that you would have observed the evidence on this hypothesis that you did observe if the null hypothesis were true. • P = Prob(|t| would be this large | 1 = 0) • If the P value is less than the Type I error probability (usually 0.05) you have chosen, you will reject the hypothesis. • Interpret: It the hypothesis were true, it is ‘unlikely’ that I would have observed this evidence.

P value for hypothesis test

Intuitive approach: Does the confidence interval contain zero? • Hypothesis: 1 = 0 • The confidence interval contains the set of plausible values of 1 based on the data and the test. • If the confidence interval does not contain 0, reject H0: 1 = 0.

More General Test

Summary: Regression Analysis • Investigate: Is the coefficient in a regression model really nonzero? • Testing procedure: • Model: y = β0 + β1x + ε • Hypothesis: H0: β1 = B. • Rejection region: Least squares coefficient is far from zero. • Test: • α level for the test = 0.05 as usual • Compute t = (b1 – B)/StandardError • Reject H0 if t is above the critical value • 1.96 if large sample • Value from t table if small sample. • Reject H0 if reported P value is less than α level Degrees of Freedom for the t statistic is N-2

Regression Models