Matt Dull ( mdull@vt ) Center for Public Administration & Policy

“Missing” Something? Using the Heckman Selection Model in Policy Research GWIPP Policy Research Methods WorkshopMarch 3, 2010 Matt Dull (mdull@vt.edu) Center for Public Administration & Policy

Sample Selection Bias • Selection bias is a pervasive problem in policy research. • This presentation offers a nontechnical introduction to models designed to correct sample selection bias in a regression context. • I’ll describe how variations on Heckman’s (1976) classic model, designed to correct for bias due to missingness in a regression model dependent variable, produces unbiased parameter estimates and yields potentially rich opportunities for (cautious) inference.

Problem = Opportunity • I’ll show how the full maximum likelihood Heckman model is implemented in Stata. • I’ll describe two applications from my own research, where theory predicts missingness in the dependent variable and variations on the Heckman model yield substantively interesting results. • The first application comes from analysis of survey data with a large number of “I don’t know” or “No-basis to judge” responses; • The second looks at the allocation of resources through a federal competitive grant program.

Missing! • Anyone who performs statistical analysis eventually encounters problems of missing data. • In Stata and other statistical packages the default strategy for dealing with missing observations is listwise deletion. Cases with missing values are dropped from the analysis. • There are advantages to this strategy. It is simple, can be applied to any kind of statistical analysis, and under a range of circumstances yields unbiased estimates (Allison 2001).

Missing! • There are also some clear disadvantages to listwise deletion. • Listwise deletion wastes information, often resulting in the loss of a substantial number of observations. • If missingness in the dependent variable does not meet fairly strict assumptions for randomness, listwise deletion yields biased parameter estimates. • The assumption that data on the dependent variable are “missing at random” is defined in quite precise terms in Allison (2001) and Rubin (1976). For today’s purposes it is enough to say that if missingness on Y is related the value of Y controlling for other variables in the model.

Why “No Basis”? • Contemporary survey researchers frame the decision to register “no basis to judge” (NB) or other non-response variants such as “don’t know” or “no opinion” as a function of three factors: cognitive availability or whether a clear answer can be easily retrieved; a judgment about the adequacy of an answer given expectations; and communicative intent or motivation (Beatty and Herrmann 2002). • NB respondents may feel uncertain or believe they lack necessary information, and in this sense the NB category enhances the validity of the measure.

Why “No Basis”? • Or, an NB response may instead indicate ambivalence; the respondent may feel less uncertain than conflicted about the prospects and usefulness of reform. NB respondents may also wish to avoid sending an undesirable or unflattering signal. • Or, they may engage in “survey satisficing,” responding NB to avoid the effort of synthesizing opinions for which they have all the necessary ingredients (Krosnick 2002; Krosnick et al. 2002).

tab gpra_answer gpra_answer | Freq. Percent Cum. ------------+----------------------------------- 0 | 1,064 42.44 42.44 1 | 1,443 57.56 100.00 ------------+----------------------------------- Total | 2,507 100.00 probitgpra_answer leadership conflict_index Iteration 0: log likelihood = -1614.6834 Iteration 1: log likelihood = -1597.7919 Iteration 2: log likelihood = -1597.7897 Probit regression Number of obs = 2387 LR chi2(2) = 33.79 Prob > chi2 = 0.0000 Log likelihood = -1597.7897 Pseudo R2 = 0.0105 ------------------------------------------------------------------------------ gpra_answer | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- leadership | .1476403 .0254997 5.79 0.000 .0976619 .1976187 conflict_i~x | .0512077 .0220635 2.32 0.020 .0079641 .0944513 _cons | -.4254927 .1288101 -3.30 0.001 -.6779558 -.1730296 ------------------------------------------------------------------------------

The Heckman Model The Heckman technique estimates a two stage model: First, a selection equation with a dichotomous dependent variable equaling 1 for observed and 0 for missing values Second, an outcome equation predicting the model’s dependent variable. The second stage includes an additional variable – the inverse Mills ratio – derived from the probit estimate.

Some Cautions Kennedy (1998) states the two-stage Heckman model does not perform well: • When the errors are not distributed normally; • The sample size is small; • The amount of censoring is small; • The correlation between errors of the regression and selection equations is small; and the degree of collinearity between the explanatory variables in the regression and selection models is high. NOTE: The heckman and heckprob commands in Stata do not estimate Heckman’s original two-stage model, but full maximum likelihood censored regression and censored probit models.

A Few Model Elements Lambda – The residuals produced by the first-stage estimates generate a new variable, the Inverse Mill’s Ratio or Lambda, which is included as a control variable in the second-stage equation. Rho – Correlation between the errors in the two equations. If rho =0 the likelihood function can be split into two parts: a probit for the probability of being selected and an OLS regression for the expected value of Y in the selected subsample. Sigma – The error from the outcome equation.

Censoring heckman USE_PM_WMIS conflict_indexhcongdata_index leadership resources know_gpra employ super_year , select (gpra_answer = conflict_indexhcongdata_indexclim_lead resources know_gpra employ super_yeargpra_inv_datagpra_inv_measuregpra_inv_goals head) nshazard(NS_Use) robust • Iteration 0: log pseudolikelihood = -4059.8985 • Iteration 1: log pseudolikelihood = -4057.3885 • Iteration 2: log pseudolikelihood = -4057.1623 • Iteration 3: log pseudolikelihood = -4057.1623 • Heckman selection model Number of obs = 1778 • (regression model with sample selection) Censored obs = 670 • Uncensored obs = 1108 • Wald chi2(8) = 242.08 • Log pseudolikelihood = -4057.162 Prob > chi2 = 0.0000 • ------------------------------------------------------------------------------ • | Robust • | Coef. Std. Err. z P>|z| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • USE_PM_WMIS | • conflict_i~x | .1928649 .168657 1.14 0.253 -.1376969 .5234266 • hcong | .3237611 .1536926 2.11 0.035 .0225292 .624993 • data_index | -1.477338 .1812304 -8.15 0.000 -1.832543 -1.122133 • leadership | 1.622898 .20833 7.79 0.000 1.214578 2.031217 • resources | -.2301818 .1710368 -1.35 0.178 -.5654079 .1050442 • know_gpra | -1.000888 .380547 -2.63 0.009 -1.746746 -.2550295 • employ | .4258168 .1341343 3.17 0.002 .1629184 .6887152 • super_year | .0006872 .1701473 0.00 0.997 -.3327954 .3341699 • _cons | 23.39185 2.590072 9.03 0.000 18.3154 28.4683 Outcome Equation

Selection Equation • -------------+---------------------------------------------------------------- • gpra_answer | • conflict_i~x | -.0245066 .0357184 -0.69 0.493 -.0945134 .0455002 • hcong | .0300361 .030737 0.98 0.328 -.0302072 .0902794 • data_index | .1263006 .0402627 3.14 0.002 .0473871 .2052141 • clim_lead | -.0007559 .0400518 -0.02 0.985 -.079256 .0777442 • resources | .0215541 .0376913 0.57 0.567 -.0523195 .0954276 • know_gpra | .7684442 .0442381 17.37 0.000 .6817391 .8551492 • employ | -.0048647 .0388891 -0.13 0.900 -.0810858 .0713564 • super_year | .0757006 .0375131 2.02 0.044 .0021764 .1492249 • gpra_inv_d~a | .1376488 .0954819 1.44 0.149 -.0494924 .3247899 • gpra_inv_m~e | .4304551 .1000318 4.30 0.000 .2343964 .6265138 • gpra_inv_g~s | .4039991 .0995959 4.06 0.000 .2087947 .5992034 • head | -.2251226 .0770209 -2.92 0.003 -.3760809 -.0741644 • _cons | -3.210954 .3191416 -10.06 0.000 -3.83646 -2.585448 • -------------+---------------------------------------------------------------- • /athrho | -.8394572 .1788731 -4.69 0.000 -1.190042 -.4888723 • /lnsigma | 1.726785 .0363139 47.55 0.000 1.655611 1.797959 • -------------+---------------------------------------------------------------- • rho | -.6855214 .0948136 -.8305919 -.4533209 • sigma | 5.62255 .2041765 5.23628 6.037313 • lambda | -3.854378 .6499205 -5.128199 -2.580558 • ------------------------------------------------------------------------------ • Wald test of indep. eqns. (rho = 0): chi2(1) = 22.02 Prob > chi2 = 0.0000 • ------------------------------------------------------------------------------ rho is significant !

Interpretation: Another Caution Sweeney notes: “If a variable appears ONLY in the outcome equation the coefficient on it can be interpreted as the marginal effect of a one unit change in that variable on Y. If, on the other hand, the variable appears in both the selection and outcome equations the coefficient in the outcome equation is affected by its presence in the selection equation as well”

Matt Dull ( mdull@vt ) Center for Public Administration & Policy