5. Endogenous right hand side variables

5. Endogenous right hand side variables • 5.1 The problem of endogeneity bias • 5.2 The basic idea underlying the use of instrumental variables • 5.3 When the endogenous right hand side variable is continuous • 5.4 When the endogenous right hand side variable is binary

5.1 Endogeneity bias • Consider a simple OLS regression: • Yit = a0 + a1 X1it + uit • Recall that our estimate of a1 will be unbiased only if we can assume that X1it is uncorrelated with the error term (uit) • We have discussed two ways to help ensure that this assumption is true • First, we should control for any observable variables that affect Yit and which are correlated with X1it. For example, we should control for X2it if X2it affects Yit and X2it is correlated with X1it (see Chapter 2): • Yit = a0 + a1 X1it + a2 X2it + uit

5.1 Endogeneity bias • Second, if we have panel data, we can control for any unobservable firm-specific characteristics (ui) that affect Yit and which are correlated with the X variables. • From Chapter 4: • Yit = a0 + a1 X1it + a2 X2it + ui + eit • We control for the correlations between ui and the X variables by estimating fixed effects models. • Our estimates of a1 and a2 are unbiased if the X variables are uncorrelated with eit. In this case, we say that the X variables are “exogenous”.

5.1 Endogeneity bias • Unfortunately, multiple regression and fixed effects models do not always ensure that the X variables are uncorrelated with the error term: • if we do not observe all the variables that affect Y and that are correlated with X, multiple regression will not solve the problem. • if we do not have panel data, the fixed effects models cannot be estimated. • even if we have panel data, the Y and X variables may display little variation over time in which case the fixed effects models can be unreliable (Zhou, 2001). • even if we have panel data and the Y and X variables display sufficient variation over time, the unobservable variables that are correlated with X may not be constant over time in which case the fixed effects models will not solve the problem.

A variable is more likely to be correlated with the error term if it is “endogenous” • “Endogenous” means that the variable is determined within the economic model that we are trying to estimate. • For example, suppose that Y2it is an endogenous explanatory variable: • Y1it = a0 + a1 Y2it + a2 Xit + uit (1) • Y2it = b0 + b1 Xit + b2 Zit + vit (2) • Equations (1) and (2) have a “triangular” structure since Y2it is assumed to affect Y1it, but Y1it is assumed not to affect Y2it • Given this triangular structure, the OLS estimate of a1 in equation (1) is unbiased only if vit is uncorrelated with uit • If vit is correlated with uit, then Y2it is correlated with uit which means that the OLS estimate of a1 would be biased • To avoid this bias, we must estimate equation (1) “instrumental variables” (IV) regression rather than OLS.

Equations (1) and (2) are called “structural” equations because they describe the economic relationship between Y1it and Y2it • We can obtain a “reduced-form” equation by substituting eq. (2) into eq. (1): • Y1it = a0 + a1 (b0 + b1 Xit + b2 Zit + vit) + a2 Xit + uit • In this “reduced-form” equation, all the explanatory variables (Xit and Zit) are exogenous • The basic idea underlying IV regression is to remove vit from the Y1it model so that our estimate of a1 is unbiased.

5.2 The basic idea underlying the use of instrumental variables • Note that vit is removed from the Y1it model if we use the predicted rather than the actual values of Y2it on the right hand side. • We predict Y2it using all the exogenous variables in the system (in our example, we use the two exogenous variables Xit and Zit)

5.2 The basic idea • We then use the predicted rather than the actual values of Y2it when estimating the Y1it model: • The a1 estimate is biased in eq. (3) but it is unbiased in eq. (4) because the vit term has been removed.

In eq. (4) the estimated coefficient for the Zit variable is • We already know the value of from eq. (2): • Therefore • It is important to note that the coefficient can be estimated only if there is at least one exogenous variable in the structural model for Y2it that is excluded from the structural model for Y1it • This is the Zit variable in eq. (2)

In eq. (4) the coefficient is “just” identified because there is only one exogenous variable (Zit) that is in the Y2it model and that is excluded from the Y1it model

Suppose we had included Zit in both models • In this case, the coefficient cannot be identified because we estimate and • In other words, we cannot determine whether the effect of Zit on Y1it is a main effect (a3) or an indirect effect through Y2it (a1b2) • Here we say that the system of equations is “under-identified”

Suppose we had included two exogenous variables in the Y2it model and we excluded both these variables from the Y1it model • Now we have estimates of , , , and . • Therefore • Here we say that the system of equations is “over-identified” • In this example, the system is “triangular” because there are two equations and one endogenous right-hand side variable

5.3 When the endogenous right hand side variable is continuous • When the models have a triangular structure, the models can be estimated using the ivregress command • The models can be estimated using 2SLS or LIML or GMM • 2SLS is more commonly used in practice

5.3.1 Estimating triangular models using 2SLS (ivregress) • Go to MySite • Open up the housing.dta file which provides data from 50 U.S. states (1980 Census) • use "J:\phd\housing.dta", clear • pct_population_urban = the % of the population that lives in urban areas • family_income = median annual family income • housing_value = median value of private housing • rent = median monthly housing rental payments • region1 – region 4 = dummy variables for four regions in the U.S.

Suppose we want to estimate the following: • rent = a0 + a1 pct_population_urban + a2 housing_value + u • housing_value = b0 + b1 family_income + b2 region2 + b3 region3 + b4 region4+ v • This is a triangular system because there are two equations and one endogenous right hand side variable (housing_value) • If u and v are correlated, the OLS estimate of a2 will be biased in the rent model

If we ignore the endogeneity problem and estimate the rent model using simple OLS: • reg rent housing_valuepct_population_urban • To take account of the potential endogeneity problem we use the ivregress command: • ivregress estimator depvar1 [varlist1] (depvar2 = varlistiv) • estimator is either 2sls or liml orgmm • depvar1 is the dependent variable for the model which has an endogenous regressor • varlist1 are the exogenous variables in the model that has the endogenous regressor • depvar2 is the endogenous regressor • varlistiv are the exogenous variables that are believed to affect the endogenous regressor

The models that we want to estimate are: • rent = a0 + a1pct_population_urban + a2housing_value + u • housing_value = b0 + b1family_income + b2 region2 + b3 region3 + b4 region4+ v • The rent model has an endogenous regressor: • ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4) • ivregressliml rent pct_population_urban (housing_value = family_income region2 region3 region4) • ivregressgmm rent pct_population_urban (housing_value = family_income region2 region3 region4) • The housing_value model can be estimated using OLS as there are no endogenous regressors • reghousing_valuefamily_income region2 region3 region4

We should test whether: • our chosen instruments are exogenous (i.e., they should be uncorrelated with the error term) and • it is valid to exclude some of them from the model that has the endogenous regressor. • If they are not exogenous or they should not be excluded, they are not valid instruments.

The tests for instrument validity are also known as tests of “over-identifying” restrictions because the tests can only be performed if the model with the endogenous regressor is overidentified • the tests assume that at least one of the chosen instruments is valid (unfortunately this assumption cannot be tested) • In our example, the instrumented housing_value variable is overidentified because four of the exogenous variables (family_income region2 region3 region4) are excluded from the rent model. • If we had excluded only one of these variables, the instrumented housing_value variable would have been “just” identified in which case it would not be possible to test for instrument validity.

We obtain the tests for instrument validity by typing estatoverid after we run ivregress • ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4) • estatoverid • These tests are statistically significant, which means the chosen instruments are not valid.

This is not surprising because we did not have good reason to assume that they are exogenous and validly excluded from the rent model. • For example: • family_income is endogenous if family incomes depend on housing values and rents • Why would this be true? • rents may be different across the four regions, so the region dummies should not be excluded from the rent model

We obtain different statistics for the tests of instrument validity if the models are estimated using LIML or GMM • However, the conclusions are the same as in our previous example: • ivregressliml rent pct_population_urban (housing_value = family_income region2 region3 region4) • estatoverid • ivregressgmm rent pct_population_urban (housing_value = family_income region2 region3 region4) • estatoverid

Note that we cannot test for instrument validity when the endogenous regressor is just identified • This is because the test statistics are obtained under the assumption that at least one of the instruments is valid • For example: • ivregress 2sls rent pct_population_urban (housing_value = family_income) • estatoverid • ivregressliml rent pct_population_urban (housing_value = family_income) • estatoverid • ivregressgmm rent pct_population_urban (housing_value = family_income) • estatoverid

We can also test whether the coefficient of the “endogenous” regressor is biased under OLS. • We obtain two Hausman tests for endogeneity bias by typing estat endogenous after we run ivregress • ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4) • estat endogenous • (The Durbin statistic uses an estimate of the error term’s variance assuming that the variable being tested is exogenous whereas the Wu-Hausman statistic assumes that the variable being tested is endogenous) • Given these results, we may be tempted to reject the hypothesis that housing_value is exogenous • However, the Hausman tests for endogeneity bias are only reliable if the chosen instruments are valid. In our example they are not, and so we cannot draw conclusions about the potential for endogeneity bias.

Class exercise 5a • Using the fees.dta file, estimate the following models for audit fees and company size: • lnaf = a0 + a1lnta + a2 big6 + u • lnta = b0 + b1ln_age + b2 listed + v • where lnaf is the log of audit fees, lnta is the log of total assets, ln_age is the log of the company’s age in years, listed is a dummy variable indicating whether the company’s shares are publicly traded on a market. • Is the instrumented lnta variable over-identified, just-identified, or under-identified? Explain. • Estimate the audit fee model using 2SLS. • Test the validity of the chosen instrumental variables. • Test whether the lnta variable is affected by endogeneity bias. • Verify that the test for instrument validity is not available if you change the model so that it is just-identified.

The key to estimating IV models is to find one or more “exogenous” variables that explains the endogenous regressor and that can be safely excluded from the main equation. • Unfortunately, most accounting studies that use IV regression do not attempt to justify why their chosen instruments are exogenous or why they can be excluded from the structural model. • As a result, Larcker and Rusticus (2010) criticize the way in which accounting studies have applied IV regression • A key problem is that the IV results can be very sensitive to the researcher’s choice of which variables to exclude from the structural model and, in many studies, these variables have been chosen in a very arbitrary way

Larcker and Rusticus (2010) recommend that researchers justify their chosen instruments using theory or economic intuition • the estat overid test should not be used to select instruments on purely statistical grounds particularly as the test is invalid if all of the chosen instruments are also invalid • When testing instrument validity (estat overid) and endogeneity bias (estat endog), it is also important to consider your sample size: • in large samples, the tests may reject a null hypothesis that is “nearly true”. • in small samples, the tests may fail to reject a null hypothesis that is “very false”.

5.3.2 Estimating simultaneous equations using 3SLS (reg3) • So far we have been examining a triangular system. For example, Y2it affects Y1it but Y1it does not affect Y2it • Y1it = a0 + a1 Y2it + a2 Xit + a3 Z2it + uit • Y2it = b0 + b2 Xit + b3 Z1it + vit • In a simultaneous system, both dependent variables affect each other • Y1it = a0 + a1 Y2it + a2 Xit + a3 Z2it + uit • Y2it = b0 + b1 Y1it + b2 Xit + b3 Z1it + vit

Y1it = a0 + a1 Y2it + a2 Xit + a3 Z2it + uit • Y2it = b0 + b1 Y1it + b2 Xit + b3 Z1it + vit • In this case, the OLS estimates are biased because: • Eq. (1) shows that uit affects Y1it while eq. (2) shows that Y1it affects Y2it. As a result, it must be true that uit is correlated with Y2it in eq. (1). Therefore, the OLS estimate of a1 would be biased in eq. (1). • Eq. (2) shows that vit affects Y2it while eq. (1) shows that Y2it affects Y1it. As a result, it must be true that vit is correlated with Y1it in eq. (2). Therefore, the OLS estimate of b1 would be biased in eq. (2).

For example, it seems reasonable to argue that housing values depend on rents as well as rents depending on housing values: • rent = a0 + a1 housing_value + a2 pct_population_urban + u • housing_value = b0 + b1 rent + b2 family_income + b3 region2 + b4 region3 + b5 region4+ v • Note that for identification, each equation must contain at least one exogenous variable that is not included in the other equation. These are: • pct_population_urban in the rent model • family_income, region2 - region4 in the housing_value model

We estimate this kind of model using the reg3 command • reg3 (depvar1 varlist1) (depvar2 varlist2) • use "J:\phd\housing.dta", clear • reg3 (rent= housing_valuepct_population_urban) (housing_value = rent family_income region2 region3 region4) • Unfortunately, the overid and endog commands are not currently available with reg3

5.4 When the endogenous right hand side variable is binary • So far we have been dealing with the case where the endogenous regressor is continuous. • We may want to estimate a model in which the endogenous regressor is binary. • This brings us to a special class of models which are known as “self-selection” or “Heckman” models. “Selectivity” = “Endogeneity” where the endogenous regressor is binary • The basic idea is similar to the instrumental variable techniques that we have already discussed.

Examples of endogenous binary variables in accounting: • Companies decide whether to use hedge contracts (Barton, 2001; Pincus and Rajgopal, 2002). • Companies decide whether to grant stock options (Core and Guay, 1999). • Companies decide whether to hire Big 5 or non-Big 5 auditors (e.g., Chaney et al., 2004). • Governments decide whether to fully or partially privatize (Guedhami and Pittman, 2006). • Companies decide whether to follow international financial reporting strategy (Leuz and Verrecchia, 2000). • Companies decide whether to recognize financial instruments at fair value or disclose (Ahmed et al., 2006). • Companies decide whether or not to go private (Engel et al., 2002).

Selection model • Concerns about selectivity arise when the RHS dummy variable (D) is endogenous: • Endogeneity results in bias if E(u | D) ≠ 0. If u and v are correlated, then E(u | D) ≠ 0, in which case the OLS estimate of the effect of D on Y would be biased.

Selection model • The intuition underlying Heckman is to estimate and then control for E(u | D). First model the choice of D: • Z is a vector of exogenous variables that affect D but have no direct effect on Y.

Selection model D Z Y

Selection model • Estimate E(u | D) and include it as a control variable on the RHS of the Y model: • E(u | D) =  IMR where  captures the correlation between u and v while  is the standard deviation of u and:

Selection model • The MILLS variable is added as a “control for selectivity” in the Y model: • The OLS estimate of the effect of D on Y is now unbiased because E(ε | D) = 0. • The D and Y models can be estimated in two-steps or estimated jointly using maximum likelihood (ML) • ML yields separate estimates of  and . • The two-step yields an estimate of . • Under the null of no selectivity bias,  = 0 and  = 0.

Class exercise 5b • We are going to look at a fictional dataset on 2,000 women. • use "J:\phd\heckman.dta", clear • sum age education married children wage • Suppose we believe that older and more highly educated women earn higher wages. Why would it be wrong to estimate the following model? • reg wage age education • Estimate a probit model to test whether women are more likely to be employed if they are married, have children, are older and more highly educated.

5.4 When the endogenous right hand side variable is binary (heckman) • It is easy to estimate the two-step Heckman model in STATA: • heckman depvar1 [varlist1], select (depvar2 = varlist1), twostep • where depvar1 is the dependent variable in the main equation and depvar2 is the dependent variable in the selection model • Going back to our dataset on female wages: • heckman wage education age, select(emp= married children education age) twostep

The 657 censored observations are the women who are not in employment. • The Wald chi2 tests the overall significance of the model. • Women’s wages are higher if they are older and more highly educated • The probit model of employment is exactly the same as what we had before • Women are more likely to be in employment if they are married, have children, are more highly educated or older.

The lamba variable is simply the IMR that was estimated from the emp model • The IMR coefficient is 4.00 and statistically significant • there is statistically significant evidence of a selection effect. • The IMR coefficient is the product of rho and sigma () • Thus, 4.00 = 0.67 * 5.95

Class exercise 5c • Estimate the following audit fee models separately for Big 6 and Non-Big 6 audit clients: • lnaf = a0 + a1lnta + u (1) • lnaf = a0 + a1lnsales + u (2) • where lnaf = log of audit fees, lnta = log of total assets, lnsales = log of sales • Use the heckman command to “control” for endogeneity with respect to the company’s selected auditor. Your auditor choice models are as follows: • big6 = b0 + b1lnsales + b2lnta + v • nbig6 = c0 + c1lnsales + c2lnta + w • where big6 = 1 (big6 = 0) if the company chooses a Big 6 (Non-Big 6) auditor; and nbig6 = 1 (nbig6 = 0) if the company chooses a Non-Big 6 (Big 6) auditor.

Class exercise 5c • What exclusion restrictions are you imposing in equations (1) and (2)? • Is there statistically significant evidence of selectivity? • For the two different specifications of the audit fee model: • what are the signs of the MILLS coefficients? • what are the signs of rho?

Treatment effects model • In exercise 5c, we estimated the audit fee models separately for the Big 6 and non-Big 6 audit clients • To do this, we use the heckman command • Suppose that we want to estimate one audit fee model with Big 6 on the right hand side of the equation (i.e., we assume that the X coefficients have the same slope in the two equations)

Treatment effects model • We can estimate this model using the treatreg command • treatreglnaflnta, treat (big6= lntalnsales) twostep • treatreglnaflnsales, treat (big6= lntalnsales) twostep • If we don’t specify the twostep option we will get the ML estimates • sometimes the ML model will not converge due to a nonconcave likelihood function • treatreglnaflnta, treat (big6= lntalnsales)

5. Endogenous right hand side variables