Labour Market Evaluation: Theory and Practice Seamus McGuinness 8th November 2013
Why is Evaluation Necessary • It assesses the extent that policy initiatives are achieving their expected targets and goals. • Drawing from this the evaluator will identify the nature of any shortfalls in either program delivery or the stated objectives. • Value for money from the perspective of the tax payer is also likely to prove a dominant feature of any evaluation. • Fulfils a vital policy challenge role within society and helps ensure that policy is evidence based and that ineffective programmes are modified or closed. Represents a key mechanism for policy challenge.
What are the most common forms of labour market evaluation ? • Generally labour economists tend to focus on impact evaluation (is the programme achieving its desired impacts ?). • Process evaluation (is the programme being delivered as intended ?) is less common. • However, in practice most impact evaluations will also consider the efficiency of programme delivery and implementation. • The bulk of impact evaluations focus on labour market programmes that are designed to improve outcomes related to employment, earnings and labour market participation.
Main Barriers to Effective Independent Evaluation • Lack of an evaluation culture: Policy makers may view evaluation as a threat and actively seek a less rigorous form of assessment. • The organisation being evaluated has the power to set the terms of reference and is invariably involved in choosing the evaluating body. • Stemming from this, often little consideration given to programme evaluation at the programme implementation and design stage (often a lack of viable a control group to assess the counterfactual). • Data constraints: Lack of available and “linkable” administrative datasets also make proper evaluation difficult.
Measuring a programmes impact • Not at all straightforward: There have been instances when different researchers have arrived at very different conclusions regarding a programmes impact. • We basically need to know what would happen to individuals had the programme not been in place i.e. we attempt to measure the counterfactual. • There are various methods used for estimating the counterfactual, however, they all generally rely on measuring the difference in outcomes between people participating in the study (the treatment group) and those eligible for the programme but not participating in it (the control group).
The Selection Problem • Comparison of a treatment and control group is not straightforward as substantial differences may exist between the two groups that must be factored out as assignment to either is rarely random. • Such differences can also arise as a consequence of ineffective control group construction. • Non-random selection refers to the possibility that (a) programme administrators engaged in “picking winners” in order to ensure the programmes success or (b) more capable individuals are more likely to put themselves forward for intervention. • Failure to account for this will result in a serious over-estimate of the programmes effectiveness.
Programme design and control group delivery • Piloting the programme. • By rolling out the programme to different areas at difference times. • Ensuring access to administrative data on the targeted population (for instance live register data). • Keeping records of unsuccessful applicants to the programme in instances where the demand for programme places exceeds supply.
Ineffective Control Group Construction • In evaluating the National Employment Action Plan (NEAP) in 2005 Indecon consultants compared a treatment of 1000 NEAP claimants (by definition first time claimants) and a control group of 225 unemployed (non-Neap) individuals taken from the ECHPS 58 % of whom were already LT unemployed at the initial point of observation. By definition none of the NEAP treatment group will have been LT unemployed. • Indecon then compared the unemployment rates of the control and treatment groups 24 months down the line and concluded that the control group faired much better and that the NEAP programme was , therefore, effective. • Does this represent a like for like comparison?
Methods Used for Overcoming the Selection Problem • Difference in Difference Estimator: This is a two-period estimator and requires that the treatment is introduced in a second time period. More powerful as it seems as will eradicate non-random selection based around unobserved attributes (picking winners etc). • Matching Estimators: Tries to match control and treatment group members on observable characteristics (education, age, labour market history etc) to ensure a like-for-like comparison (consider the earlier NEAP example). May still be prone to unobserved influences? • Other methods do exist such as the use of controlled experiments but these are rarely seen in the context of labour market evaluation.
Difference in Difference • Period 1: Outcome Y (say earnings) determined by observable characteristics X (age, education, labour market experience etc) and unobservable factors that do not change over time I (innate ability, motivation etc). • Period 2: Outcome variable determined as in period 1 but say a labour market training programme ( a treatment T) is now present. • By differencing across the same individuals in two periods we can both isolate T and remove the impact of time invariant (and often unobserved) factors.
Example of a difference in difference approach • Say we plan to introduce a new unemployment activation measure in June 2013 in the County Dublin. • Our control group would be the rest of the country that were not to receive the measure (until perhaps 2014). We would estimate a model comparing exits from unemployment in Dublin w.r.t. the Rest of Ireland over both periods (2012 2013). • The extent to any change in the margin of difference in Dublin exit rates exit rates (relative to rest of Ireland) over the two periods will be interpreted as the impact of the programme.
Model Estimation • dt = dummy variable for treatment group (Dublin area), will pick up any differences between the treatment and control groups prior to the policy change. • T is a dummy variable for time period 2 and measures the extent to which the value of y increased or fell in period 2 independent of anything else. • dt*T will be = 1 for those individuals in the treatment group receiving intervention in the second period. It is therefore a measure of the impact of the policy.
Difference in Difference • Really powerful tool in eradicating unobserved bias “picking winners” self-selection etc. • Required little data. • Requires that policy be implemented in a rolled out fashion e.g. across regions across time « not always appreciated by policy makers. • Is it sufficient to deal with selection bias on observables?
Propensity Score Matching • This technique allows us to deal explicitly with the problem of differences in the characteristic make-up of the control and treatment group that have the potential to bias our estimate of the programme impact. • For example, say we have an active labour market programme aimed at reducing unemployment and the control group contains a higher proportion of LT unemployed. Failure to control for this will upwardly bias the estimated programme impact as the control group, by definition almost, have lower likelihoods of labour market success even before the impacts of any programme are started. • Basically, chances are that if you compare the proportions in employment of both groups, at a future point in time, in the absence of any labour market programme, the treatment group will have performed better. • Thus the problem we must confront is that the estimated programme impact is simply being driven up, or entirely attributable, to differences in the characteristic make up of our control and treatment groups.
What does PSM do • It is a method that allows us to match both the treatment and control groups on the basis of observable characteristics to ensure we are making a like for like comparison. • After matching has been completed, we simply compare the mean outcomes (e.g. employment rates) of the control and treatment groups to see which is highest.
How do we match • We estimate a probit (1,0) model on treatment group membership. This identifies that main characteristics that separates the control from the treatment group. • Every member of the control and treatment group is then given a probability of their likelihood of being assigned to the treatment group based on their characteristics. • Each member of the treatment group is then “matched” with a member of the control group with a similar probability score. • It can be shown that matching on probability score is equivalent to matching on actual characteristics. • This process ensures that the treatment and control groups are similar in terms of their observable characteristics.
Matching • Clearly again a powerful tool and the most effective for tackling the sample selection problem. • Requires a lot of data and additional checks to ensure that matching was successful and all observable differences between the control and treatment groups were eradicated. • Does not deal with unobserved bias.
Carrots without sticks: an evaluation of active labour market policy in ireland Seamus McGuinness, Philip O'Connell & Elish Kelly
Overview • This study focuses on assessing the effectiveness of the Job Search Assistance (JSA) component of the National Employment Action Plan (NEAP). The NEAP is Irelands principal tool for activating unemployed individuals back into the labour market. • Under the NEAP, individuals registering for unemployment benefit are “automatically” referred to FÁS for an interview after 13 weeks on the system. The FAS interview is aimed at helping claimants back into work through advice and placement and referring others for further training. • Individuals with previous exposure to NEAP – i.e. those with a previous history of unemployment – are excluded and will not be referred to FÁS for a second time. • NEAP was distinct in an international sense in that it was characterised by an almost complete absence of monitoring and sanctions. Unusually, it did not appear to hinge on the principal of mutual obligation.
Evaluations Objectives • To assess the extent to which individuals participating in the NEAP were more likely to find employment relative to non-participants • To assess the extent to which individuals in receipt of both interview and training had enhanced employment prospects relative to those in receipt of interview only (impact of training). • We are going to focus on the effectiveness of the referral and interview process.
Problem 1: No control group? • Selection under the NEAP is automated and universal. If all claimants are automatically sent for interview at week 13 of their claim then how can we construct a counterfactual. • i.e. remember counterfactual assesses what happens to individuals in the absence of the programme. • The only people not exposed to the programme are those already in employment by week 13. This rules difference-in-difference out for a start. • Problem illustrates very clearly that the need for proper evaluation was not a major consideration in the programmes design or implementation.
What can we do? • Only option is to utilise the fact that individuals with previous exposure to NEAP can’t access it again (totally counter-intuitive rule as basically those most in need of support were being excluded from the outset). • We take as an initial control group individuals who had previous exposure to NEAP more than two years prior to the study who’s contact was limited to a FAS interview. • Given the time lapse and changing macroeconomic conditions any advice received by the control group should have declined in relevance allowing some assessment of the impact of the programme. • Still even if the above were true we are still left with a selection problem as, prior to the study, all of the control group will have had a previous unemployment spell of at least 13 weeks whereas none of the treatment group will. • This difference cannot be eradicated by matching and our estimates are unlikely to be free of bias.
Profiling Questionnaire Information for Claimant Population Issued June to September 2006 Weekly Population Li ve Register Claimant of Live Register Population Claimants Dataset for (September 2006 – June NEAP Evaluation 2008) Weekly Population of Live Register Claimant Closure Files Construction of The Evaluation data FAS Events Histories
New Control Group Found? • On linking the data we found that around 25 % of new claimants were not being referred by DSP to FAS after 13 weeks unemployment duration, despite these individuals having no previous exposure to the NEAP. • We need to establish what is going on here, are we missing something in terms of the referral process and, if not, what are the factors driving the omission and are they random. • A list containing the PPS numbers of our potential new control group was sent to DSP for validation.
Validation checks • DSP confirm that individuals had fallen through the net. • No concrete explanation found. Most likely that individuals were not referred when number of referrals in DSP office exceeded slots in local FAS office and had been subsequently overlooked when slots became available. • Even before we begin we have uncovered major problems with programme processes i.e. 25 % of potential claimants excluded and a further 25% missed. • Clear example of how process evaluation becomes a component of an impact evaluation.
the control group • A natural experiment?
Data and methods • In terms of econometrics, we estimate probit and matching models augmented by additional checks for unobserved heterogeneity bias. • All models contain a wide range of controls for educational attainment, health, location attributes, access to transport, age marital status, labour market history etc, that we available to us as a consequence of the profiling data.
What are the descriptive telling us ? • The treatment group and control group I look very similar which would suggest that the “process” that generating control group I was random in nature. • There are more substantial differences between the treatment group and control group II in that the latter tends to be more disadvantaged in terms of their observable characteristics. Potential for selection bias here.
Regular Probit • These will give us an initial estimate, that may or may not be biased, of the effectiveness of NEAP. • We want to see that the data is sensible and that relationships move in the correct direction. This is important both in terms of the probit estimate and the reliability of any subsequent matching. • Provides us with an assurance that there is nothing weird happening with our data. • Note: The models measure the impact of variables w.r.t. The claimants probability of exiting the live register before 12 months.
Summary and conclusions - I • Strong and consistent evidence that JSA delivered under the NEAP was highly ineffective and actively reduced transitions off the Live Register to employment. • Two possibilities arise: (i) claimants received poor advice or (ii) claimants relaxed the intensity of job-search on learning of the absence of monitoring and sanctions. • Advice explanation not supported by results as we would expect the negative impact to fall away in medium term models as claimants adjust behaviour.
Summary and conclusions - II • We conclude that participants attending the interview quickly learnt that their prior fears with respect to the extent of job search, monitoring and sanctions were unjustified and consequently lowered their job search activity levels. • Note: - The analysis was found to be robust to the influences of both sample selection and unobserved heterogeneity. - Strong negative JSA effects were also generated using a other estimation techniques (Cox Proportional Hazard Model).
How Reliable are our results ? • We controlled for a wide-range of observables implying that unobserved factors should be less of a factor. • Sensitivity tests seemed to confirm this. • We had a highly representative control group. • Still PSM framework while allowing us to test the sensitivity of estimates of unobserved bias – it does not eradicate. • We are seeing the increased use of combined PSM and diff-in-diff methods of ensuring that evaluation estimates are free from both selection bias (on observables) and unobserved bias (picking winners etc).