Statistics and Data Analysis

Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Statistics and Data Analysis Part 15 – Regression Models

1/49 Linear Regression Models • Analyzing residuals • Violations of assumptions • Unusual data points • Hints for improving the model • Model building • Linear models – cost functions • Semilog models – growth models • Logs and elasticities

2/49 Model Assumptions • Assumptions about disturbances (noise) • Zero mean • Constant variance • No correlation across observations • Normality • Disturbances are assumed to be pure noise. Residuals should appear that way also.

3/49 An Enduring Art Mystery Graphics show relative sizes of the two works. The Persistence of Statistics. Hildebrand, Ott and Gray, 2005 Why do larger paintings command higher prices? The Persistence of Memory. Salvador Dali, 1931

4/49 Monet in Large and Small Sale prices of 328 signed Monet paintings The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model. Log of $price = a + b log surface area + e

5/49 Speaking of Monet… Monet. Le Pont d'Argenteuil, 1874. Modified, Musée d’Orsay, Paris, October 7, 2007, anon. vandal.

6/49 Speaking of owner modified $100,000,000 paintings…

7/49 The Data

8/49 Monet Regression

9/49 Using the Residuals • How do you know the model is “good?” • Various diagnostics to be developed over the semester. • But, the first place to look is at the residuals.

10/49 Residuals Can Signal a Flawed Model • Standard application: Cost function for output of a production process. • Compare linear equation to a quadratic model (in logs) • (124 American Electric Utilities)

11/49 Candidate Model for Cost Log c = a + b log q + e Most of the points in this area are above the regression line. Most of the points in this area are above the regression line. Most of the points in this area are below the regression line.

12/49 A Missing Variable? Residuals from the (log)linear cost model

13/49 A Better Model? Log Cost = α + β1 logOutput + β2 [logOutput]2 + ε (Developed more fully after the midterm)

14/49 Candidate Models for Cost The quadratic equation is the appropriate model. Logc = a + b1 logq + b2 log2q + e

15/49 Missing Variable Included Residuals from the quadratic cost model

16/49 Heteroscedasticity • Hetero - differences • Scedastic - function, variation around the mean • Arises when y is “proportional” to x • Arises sometimes when there are natural, heterogeneous groups

17/49 Heteroscedasticity Residuals from a regression of salaries on years of experience. Standard deviation of the residuals seems not to be constant.

18/49 Problem with the Model? This usually suggests the model should be defined in terms of logs of the variable.

19/49 Sometimes Heteroscedasticity Can Be Cured By Taking Logs Residuals from a regression of logs of salaries on years of experience. Salary = αeβteε We will explore this model below.

20/49 Sometimes Not … Countries are ordered by the standard deviation of their 19 residuals. Regression of log of per capita gasoline use on log of per capita income for 18 OECD countries for 19 years. The standard deviation varies by country. The “solution” is “weighted least squares.” (See text, page 659.)

21/49 Should I Worry About Heteroscedasticity? • Not a problem for using least squares to estimate α or β. • But, there is a better method than least squares. • Assessment of the uncertainty of the least squares estimates may be too optimistic. • (Not contagious)

22/49 Autocorrelation • Auto – self • Correlation – correlation • Correlated with itself? Obviously? • Noise in one observation is correlated with noise in other observations. • Usually a feature of time series data • Residuals correlated with recent past residuals • Typically streaks of unusually high or low observations (measured against the regression)

23/49 Time Series Regression Regression of log Gasoline on log Income (both per capita), U.S., 1953-2004. Residuals are highly autocorrelated.Same problems as heteroscedasticity. Autocorrelation can (also) be cured. Not by taking logs, however.

24/49 Unusual Data Points Outliers have (what appear to be) very large disturbances, ε Wolf weight vs. tail length The 500 most successful movies

25/49 Outliers (?) Remember the empirical rule, 99.5% of observations will lie within mean ± 3 standard deviations? We show (a+bx) ± 3se below.) Titanic is 8.1 standard deviations from the regression! Only 0.86% of the 466 observations lie outside the bounds. (We will refine this later.) These observations might deserve a close look.

26/49 What to Do About Outliers (1) Examine the data (2) Are they due to mismeasurement error or obvious “coding errors?” Delete the observations. (3) Are they just unusual observations? Do nothing. (4) Generally, resist the temptation to remove outliers.Especially if the sample is large. (500 movies islarge. 10 wolves is not.) (5) Question why you think it is an outlier. Is it really?

27/49 High Leverage Points “High leverage” points have unusual values of x. Problem? The regression slope is strongly influenced by these points.Response: Unless you are strongly convinced that these are bad data, strongly resist the temptation to pay any attention to these observations. This phenomenon is extremely hard to detect in a moderate to large sample. It is also extremely elusive when there is more than one variable in the model. Y X

28/49 Highly Influential Points • High leverage outliers(Unusual x and unusual y) • With Titanic: 6.693 + 1.051 Domestic • Without Titanic: 20.774 + 0.930 Domestic

29/49 Regression Options

30/49 Save Residuals

31/49 Residuals

32/49 Minitab’s Opinions Minitab uses ± 2S to flag “large” residuals.

33/49 On Removing Outliers Be careful about singling out particular observations this way. The resulting model might be a product of your opinions Removing outliers might create new outliers that were not outliers before. Statistical inferences from the model will be incorrect.

34/49 Normal Distribution of ei?

35/49 Probability Plot Graph -> Probability Plots …

36/49 Using and Interpreting the Model • Interpreting the linear model • Semilog and growth models • Log-log model and elasticities

37/49 Statistical Cost Analysis The units of the LHS and RHS must be the same. $M cost = a + b MKWH Y = $ cost a = $ cost = 2.444 $M b = $M /MKWH = 0.005291 $M/MKWH So,….. a = fixed cost = total cost if MKWH = 0 b = marginal cost = dCost/dMKWH b * MKWH = variable cost Generation cost ($M) and output (Millions of KWH) for 124 American electric utilities. (1970).

38/49 Semilog Models and Growth Rates LogSalary = 9.84 + 0.05 Years + e

39/49 Growth in a Semilog Model

40/49 Using Semilog Models for Trends Frequent Flyer Flights for 72 Months. (Text, Ex. 11.1, p. 508)

41/49 Regression Approach logFlights = α + β Months + ε a = 2.770, b = 0.03710, s = 0.06102

42/49 Loglinear Models • logY = α + βlogx + ε • Elasticities • Gasoline income elasticity • The linear and loglinear models give similar answers • Price elasticity

43/49 Elasticity and Loglinear Models • The “responsiveness” of one variable to changes in another • E.g., in economics demand elasticity = (%ΔQ) / (%ΔP) • Math: Ratio of percentage changes • %ΔQ / %ΔP = {100%[(ΔQ )/Q] / {100%[(ΔP)/P]} • Units of measurement and the 100% fall out of this eqn. • Elasticity = (ΔQ/ΔP)*(P/Q) • Elasticities are units free

44/49 Linear Demand Curves

45/49 Loglinear Demand Curves Q = αPβeεso logQ=a+βlogP+ε Thenβ =dlogQ/dlogPis the elasticity

46/49 DemandModels Regression Analysis: Log-Gas_t versus LogPG_t The regression equation is Log-Gas_t = 0.372 - 0.169 LogPG_t Predictor Coef SE Coef T P Constant 0.372140 0.008433 44.13 0.000 LogPG_t -0.16949 0.03827 -4.43 0.000 S = 0.0608113 R-Sq = 28.2% Using Logs Regression Analysis: Gas_t versus PGas_t The regression equation is Gas_t = 1.66 - 0.199 PGas_t Predictor Coef SE Coef T P Constant 1.65874 0.05803 28.58 0.000 PGas_t -0.19928 0.05516 -3.61 0.001 S = 0.0941783 R-Sq = 20.7% Using Levels

47/49 Linear and Loglinear Models Elasticity in the loglinear model isb = -0.1695.Elasticity in the linear model at the mean of G of 1.4545 and mean of PG1.0251 is -0.1993(1.0251/1.4545)= -0.1404.

48/49 Income Elasticity

Statistics and Data Analysis