Are Teacher-Level Value-Added Estimates Biased? An Experimental Validation of Non-Experimental Estimates

Are Teacher-Level Value-Added Estimates Biased?An Experimental Validation of Non-Experimental Estimates Thomas J. Kane Douglas O. Staiger HGSE Dartmouth College

All standardized by grade and year. LAUSD Data • Grades 2 through 5 • Three Time Periods: • Years before Random Assignment: Spring 2000 through Spring 2003 • Years of Random Assignment: Either Spring 2004 or 2005 • Years after Random Assignment: Spring 2005 (or 2006) through Spring 2007 • Outcomes: • California Standards Test (Spring 2004- 2007) • Stanford 9 Tests (Spring 2000 through 2002) • California Achievement Test (Spring 2003) • Covariates: • Student: baseline math and reading scores (interacted with grade), race/ethnicity (hispanic, white, black, other or missing), ever retained, Title I, Eligible for free lunch, Gifted and talented, Special education, English language development (level 1-5). • Peers: Means of all the above for students in classrooms. • Fixed Effects: School x Grade x Track x Year • Sample Exclusions: • Special Education Exclusion: >20 percent special education classes • Small and Large Class Exclusion: Fewer than 5 and more than 36 students in class

Experimental Design • Sample of NBPTS applicants from Los Angeles area. • Sample of Comparison teachers working in same school, grade, calendar track. • LAUSD chief of staff wrote letters to principals inviting them to draw up two classrooms that they would be willing to assign to either teacher. • If principal agreed, classroom rosters (not individual students) were randomly assigned by LAUSD on the day of switching. • Yielded 78 pairs of teachers (156 classrooms and 3500 students) for whom we had estimates of “value-added” impacts from the pre-experimental period.

Generate Empirical Bayes estimates (VAj) of teacher effects using a variety of specifications of A, X. Step 1: Estimate a Variety of Non-Experimental Specifications using Pre-Experimental Data

At the classroom level: Differencing within each pair, p=1 through 78: Step 2: Test Validity of VAj in Predicting Within-Pair Experimental Differences .

Summary of Sample Comparisons • The experimental sample of teachers was more experienced. (15 vs. 10.5 years in LAUSD) • The pre-experimental mean and s.d. of VAj were similar in the experimental and non-experimental samples. • Could not reject the hypothesis of no relationship between VA2p-VA1p and differences in mean baseline characteristics. • Could not reject the hypothesis of no differential attrition or teacher switching.

Why would student fixed-effect models underestimate differences in teacher value added? • When we demean student data, we subtract off 1/T of current teacher’s effect (T=#years of data on each student) •  underestimate magnitude of teacher effect by 1/T (i.e., need d.f. correction) • In our data, typical student had 2-4 years of data, so magnitude is biased down by ½ to ¼. • Subtract off even more of teacher effect if some of current teacher’s effect persists into scores in future years (FE model assumes no persistence) •  underestimate magnitude by 1/T for teacher in year T (since this teacher’s effect only in last year’s score) •  underestimate magnitude by more than 1/T for teachers in earlier years, with downward bias largest for first teacher. • If first teacher’s effect completely persistent, we would subtract off all of the effect & estimate no variance in 1st year teacher effect.

Structural Model for Estimating Fade-out Parameter, δ

IV Strategy for Estimating Fade-Out Parameter (δ) in Non-Exp Data • We can rewrite the error component model as: • OLS estimates of δ biased, because Aijt-1 correlated with error • Use prior year teacher dummies to instrument for Aijt-1 • Assumes that prior year teacher assignment is not correlated with • Control for teacher or classroom fixed effects to capture current teacher/classroom effects.

Joint Validity of Non-Experimental Estimates of δ and VAj .

Potential Sources of Fade-out • Unused knowledge may becomes inoperable. • Grade-specific content is not entirely reflected in future achievement. (e.g. even if you’ve not forgotten logarithms, may not hurt you in calculus)

Potential Sources of Fade-out • Unused knowledge becomes inoperable. • Grade-specific content is not entirely relevant for future achievement. (e.g. even if you’ve not forgotten logarithms, may not hurt you in calculus) • Takes more effort to keep students at high performance level than at low performance level. • Students of best teachers mixed with students of worst teachers in following year, and new teacher will focus effort on students who are behind.( no fade-out if teachers were all effective)

Is Teacher-Student Sorting Different in Los Angeles?

Summary of Main Findings: • All non-experimental specifications provided information regarding experimental outcomes, but those controlling for baseline score yielded unbiased predictions with highest explanatory power. • The experimental impacts in both math and english language arts seem to fade out at annual rate of .4-.6. • Similar fade-out was observed non-experimentally. • Depending on source, fade-out has important implications for calculations of long-term benefits of improvements in average teacher effects.

Next steps: • Test for “complementaries” in teacher effects across years. (e.g. What is the effect of having a high or low-value added teacher in two consecutive years?) (Current experiment won’t help, but STAR experiment might.)

Empirical Methods: 2. Generating Empirical Bayes Estimates of Non-Experimental Teacher Effects .

Why would current gains be related to prior teacher assignments? • We find teacher effect fading out • Let VAt = value added of teacher in year t ak = % left after k years • Then At = VAt + a1VAt-1 + a2VAt-2 + … Implies gains include % of prior teacher effect • (At – At-1) = VAt + (a1 – 1)VAt-1 + (a2 – a1)VAt-2 + … • Our estimate of a1≈0.5 implies • Variance of prior teacher effect would be roughly 25% of the variance of current teacher effect. • Prior teacher effect would enter with negative sign. • Does fade-out mean the non-structural approach would be biased? Do we need to estimate full human capital production function? • Depends partially on correlation among VAjt,VAjt-1VAat-1… ,

Why would current gains be related to future teacher assignments? • Students are assigned to future teachers based on current performance. • e.g., tracking, student sorting • This is why the unadjusted mean end of year score was a biased measure of teacher effects. (If differences in baseline scores were just random noise, mean student scores from the non-experimental period would have been a noisy but unbiased estimator). • In value-added regression, this generates relationship between future teacher assignment (in t+1) and current end-of-year score (in t) (that is, future teacher assignments are endogenous to current year gains). • We would expect future teacher assignments to be related to current gains, as Rothstein (2007) reports.

What is the variance in teacher effects on student achievement? Non-Experimental Studies: • Armour (1971), Hanushek (1976), McCaffrey et. al. (2004), Murnane and Phillips (1981), Rockoff (2004), Hanushek, Rivkin and Kain (2005), Jacob and Lefgren (2005), Aaronson, Barrow and Sander (2007), Kane, Rockoff and Staiger (2006), Gordon, Kane and Staiger (2006) • Standard Deviation in teacher-effect estimated .10 to .25 student-level standard deviations. Experimental Study (TN Class-Size Experiment): • Nye, Konstantopoulous and Hedges (2004) • Teachers and students were randomly assigned to classes of various sizes, grades K through 3. • Looked at teacher effect, net of class size category effects and school effects. • Standard Deviation in teacher-effect estimated .08 to .11 student-level standard deviations. Even higher (.10 to .18) in low SES schools.

Interpretation of Coefficient on Lagged Student Performance • We estimate several non-experimental specifications Βo =0 (no controls), Βo =1 (“gains”), Βo <1 (“quasi-gains”) and ask: • Which yields unbiased estimates of teacher effects (μj)? • Which minimizes the mean squared error in predicting student outcomes?. • We place no structural interpretation on Βo . • Βo presumably contains a number of different roles– (i) systematic selection of students to teachers, (ii) fade-out of prior educational inputs, (iii) measurement error. • These separate roles are difficult to identify. • The various biases introduced may or may not be offsetting.

Are Teacher-Level Value-Added Estimates Biased? An Experimental Validation of Non-Experimental Estimates