Issues in Teacher Evaluation and Validity: Conceptual, Methodological, and Practical

Issues in Teacher Evaluation and Validity: Conceptual, Methodological, and Practical UCLA Graduate School of Education & Information Studies Jose Felipe Martinez University of California, Los Angeles Graduate School of Education New Mexico Teacher Evaluation Advisory Council (NMTEACH) New Mexico Public Education Department

When pasting text from another document, do the following: Highlight the text you want to replace Go to the EDIT menu and select PASTE SPECIAL Select “Paste as: UNFORMATTED TEXT” Overview • Teacher Evaluation: The Policy Context • Teacher Evaluation • Conceptual/Methodological Issues: Why, What, How • Constructs and methods • Teacher Evaluation with Multiple Measures • Multiple Measures and Validity • Models for combining indicators • Validation Frameworks and Sources of Evidence • Consequences, additional issues

Teacher Evaluation: The Policy Context

Teacher Evaluation: A New Silver Bullet? • Teacher evaluation systems undergoing reform • Tied to perceptions of performance in national or international evaluations, • Reverse Lake Wobegon; all below avg. (Feuer, 2012) • …assumptions about the role of “good/bad” teachers in explaining/improving the results and • …about our ability to identify these teachers • Related to perceptions of teaching profession • …quality of existing teacher evaluation systems

Many Prominent examples • United States • Los Angeles, New York, Chicago (2012) • Denver (2010) • Tennesee (1992, 2012) • Toledo, Cincinnati (1990’s) • Worldwide • Singapore (2006) • Chile (2003) • Mexico (1993,2009) • Australia (2013)

Teacher Evaluation: Conceptual/Methodological Issues

Why Evaluate? • Motivations, inferences and uses • Identify struggling teachers to help them improve • Identify recurrent struggling teachers for sanction • Provide incentives to the best teachers • Inform school practice/district policies on Teacher Preparation and Professional Development • Identify and scale effective teacher practice • Or typically a combination... (e.g. NMTEACH)

Teacher Evaluation Conceptual/Methodological Issues: Why, What, How

What to Evaluate? • Teacher competence (Reynolds, 1999): • Knowledge: Subject, Pedagogical • Skill: Ability, applied knowledge • Disposition: Attitudes, Perceptions, Beliefs • Practice: Classroom processes (e.g. instruction, assessment, management) • And.. • Seniority, Credentials • School citizenship, contributions to community… • “Effectiveness”: Ability to raise student test scores

What to Evaluate? All of the above? • “We fully understand that standardized tests don't capture all of the subtle qualities of successful teaching. That's why we call for multiple measures in evaluating teachers. In an ideal world, that data should also drive instruction and drive useful professional development.“ Arne Duncan U.S. Secretary of Education

How to Evaluate? (Reynolds, 1999)

Which is Best? Which should we use? • No method is inherently preferable • Each illuminates a different aspect of Teacher [insert euphemism here]. • Different kind of information from different sources • Pros and cons in reliability, validity, credibility… • Here I will briefly discuss: • Value Added Models • Observations • Surveys • Portfolios

Value Added Models • Culture changing towards using student achievement to evaluate teachers • Simple Logic: • Students do better (grow) more in some classrooms (Weisberg et al. 2009; Kane et.al. 2011) • Student learning should be a (the?) key criterion to evaluate teacher quality • Seemingly Simple Method: • With longitudinal data…compare teachers on the progress of their students, not their achievement. • Estimate teacher unique contributions to student academic growth, net of factors outside teacher control

Value Added Models • A family of statistical models • e.g. TVAAS, Growth percentiles, (variable) Persistence • Correlated; measures used + important (Lockwood et.al 2007) • A variety of issues: • Partial view of student learning (Baker et. al. 2010) • Unstable estimates (Schochet & Chiang; 2010) • Descriptive, not causal (Stuart, Rubin,Zanutto,2004), nor explanatory/diagnostic (Goe, 2011) • Available only for some teachers (30-40% US) • “…VAM estimates best used in combination with other indicators” (Braun et al., 2010)

Classroom Observations • Widely used to assess quality teaching practice • Explanatory + Formative counterpart to VAM • Identify areas in need of improvement  Inform PD • Expensive if standardized (training, time) • Error from complex rubrics, human judgment • Bias/Subjectivity in construct definition/emphasis • Lower reliability than traditional instruments (live or video) • Weak correlations with other indicators including student achievement (Kane et al. 2010)

Classroom Observation: Constructs Singapore’s Competencies • Nurturing the Whole Child • Core Competency! • Share values with student • Take action to develop the student • Act consistently in the student’s interest • Cultivating Knowledge • Subject Mastery • Analytical Thinking • Initiative • Teaching Creatively • Working with Others • Partnering with Parents • Working in Teams • Winning Hearts and Minds • Understanding the Environment • Developing Others • Knowing Self and Others • Emotional Intelligence Danielson Framework Planning and Preparation • Demonstrating Knowledge of Content and Pedagogy • Demonstrating Knowledge of Students • Selecting Instructional Goals • Demonstrating Knowledge of Resources • Designing Coherent Instruction • Assessing Student Learning Classroom Environment • Creating an Environment of Respect and Rapport • Establishing a Culture for Learning • Managing Classroom Procedures • Managing Student Behavior • Organizing Physical Space Instruction • Communicating Clearly and Accurately • Using Questioning and Discussion Techniques • Engaging Students in Learning • Providing Feedback to Students • Demonstrating Flexibility and Responsiveness Professional Responsibilities

Classroom Observation: Reliability (Source: Bill and Melinda Gates Foundation, 2011)

Teacher Surveys • Common method for collecting data on teacher (classroom) practice on a large scale • Good coverage; Low cost; low burden for teachers • Adequate reliability • Questionable Validity • Error from inconsistency in interpretation of questions • …and social desirability • e.g. Emphasis on higher order thinking • Weak correlations with other indicators including student achievement (Kane et al. 2010)

Student Surveys • Increasingly popular for teacher evaluation • Coverage; cost; perceived validity • Adequate reliability aggregated by classroom • Correlated w/student achievement as much or more than teacher surveys (Kane etal. 2010) • Additional information at the student level • Variance reflects differentiated teacher practice with different students (Martínez, 2012; Muthen , 1995) • Correlated w/achievementalsowithinclassrooms

Student Surveys: Remaining Issues • Memory errors, inconsistency in interpretation • Particularly with younger children • Concerns for high stakes teacher evaluation • Social desirability, pressure, other validity issues • Cost Issues • Unit of measurement, construct invariance • “My teacher asks me to read books” • vs. “Our teacher asks us to read books”

Student Surveys: Correlation to Achvmt

Teacher Portfolios • Compile evidence of teacher practice over a period of time What’s in a Teacher Portfolio? Classroom Artifacts (lesson plans, assignments, samples of student work, etc.) Teacher Reflections (on practice reflected in artifacts) Student/Teacher Survey/Log (classroom practice, attitudes, perceptions) vs. Surveys +  Richer, Better Validity, PD value -  Higher cost, Rater/Rubric Error, Burden on teachers vs. Observations Debate taking form

1. Cost to Collect & Score? Similar or lower than observations 2. Score Reliability? Similar to observations/video (see MET study) May need to re-examine ideas of “acceptable reliability” Better coverage, validity x/some aspects of practice Interesting possibilities with newer technologies 3. More burdensome for teachers? Yes, much more so (20-30+ hour effort) But, with burden comes Professional Development So far used mostly for “National Certification” Growing interest? : EdTPA, PACT May be feasible as integral to an evaluation/PD cycle Portfolios vs. Observations

Teahcer Evaluation and Multiple Measures (Validity)

Validity • How do we know we are doing a good job of evaluating teachers? • Are our inferences and decisions valid? “An integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or others modes of assessment.” Messick (1989)

“In educational settings, a decision or characterization that will have major impact [on a student] should not be made on the basis of a single score. Other relevant information should be taken into account if it will enhance the overall validity of the decision.” Standards for Educational and Psychological Testing, Standard 13.7 (AERA, APA, & NCME, 1999)

What to Evaluate? All of the above • New Mexico’s teacher evaluation system should utilize a matrix in which multiple components of a teacher’s evaluation combine to determine a teacher’s overall effectiveness rating. • Effectiveness levels should only be assigned after careful consideration of multiple measures, including student achievement data, observations, and other proven measures [emphasis added] New Mexico Effective Teaching Task Force

Multiple measures: Logic and Assumptions • General Assumption: • Combining multiple measures leads to better informed (more valid) decisions about teachers and teaching 1.Accuracy 2.Validity 3.Feedback 4.Relevance • -Teachers classified into finer, more stable categories (De Pascale, 2012; Steele et. al. 2010) • - More complete picture of performance (Goe, 2011)-Less incentive for test preparation (Steele et. al. 2010) • - Information to help teachers adjust and improve instruction and classroom strategies. (Duncan, 2011) • - Greater confidence in results of evaluation among the public and stakeholders (Glazerman et. al. 2011)

When/were does these assumptions hold?, in what situations? Depends on several factors Assumptions about nature of constructs involved Intended inferences and uses Whatismeantexactlybycombining(Brookhart, 2009) Not self-explanatory. A variety of models is available Substantial literature in psychology, personnel evaluation, and student assessment. Only starting to be applied to Teacher Evaluation Combining Multiple measures: Conceptual Issues

Models for Combining Multiple Measures

Combination Model 0: Do not Combine! • May consider not combining the indicators ! • Summary indices not essential to formative or summative evaluation • Key measures may be collected, maintained, and reported separately • All used to illuminate a side of the picture (improve teaching, communication, citizenship, achievmt?) • And used jointly as needed where summative judgments are sought (Mehrens 1989; Brookhart 2009) • Making combined use of multiple indicators ≠Combining multiple indicators

Combination Model 1: Conjuntive, Disjunctive Portfolio Classroom Observation Student Survey Teacher Test Other Indicators Student Achievemt.

Decision Rules and Reliability • Error in Multiple Measures may cancel out or compound • Assume Teacher A True Scores in T1, T2 are passes • Because of unreliability the probability of passObserved Scores is estimated at 0.80 and 0.90, respectively • Probability of pass scores in both tests (Conjunctive Model): 0.8*0.9=0.72 • Probability of pass scores in either test (Disjunctive Model): 1-[0.2*0.1]=0.98 (see e.g. Cronbach, Linn, Brennan, & Haertel, 1997; Douglas and Mislevy, 2010)

Decision Rules and Reliability • Simplistic scenario. Complex rules often used in practice according to policy context and goals • E.g.: Teachers must pass Measure 1 or 2, AND not rank lowest in Measure 3 (eg. New Haven) • Choice of decision rule more important for accuracy and validity than the reliability of the component measures chosen (Chester, 2003) • Importantly: Models are not “objective”; each involves judgment • Why satisfy k criteria, not k-1? Why those criteria?

Hybrid system : e.g. New Haven • Synthesizes three component measures (each on 5-pt. scale): • Teacher instructional practice • Teacher professional values • Student learning outcomes

Combination Model 2 (Compensatory): Principal Components / Factor Analysis Other Measures Portfolio Student achievement Global Construct Classroom Observation Student/ Parent Survey Teacher Survey

Combination Model 3 (Compensatory): Optimal Weight (Achievement as Criterion) Other Measures Artifacts/ Portfolio Teacher Construct Student Achievement Classroom Observation Student/ Parent Survey Teacher Survey

Combination Model 3 (Compensatory): Optimal Weight (Achievement as Criterion) Other Measures Artifacts/ Portfolio β β β Teacher Construct Student Achievement Classroom Observation β Student/ Parent Survey β Teacher Survey

MM Combination Model 4 (Compensatory):PC/FA: Student achievement as Indicator Other Measures Artifacts/ Portfolio Teacher Construct Student Achievement Classroom Observation Student/ Parent Survey Teacher Survey

MM Combination Model 5 (Compensatory): SEM/Canonical Correlates Other Measures Artifacts/ Portfolio Student Measure #1 Classroom Observation Student Measure #2 Teacher Construct Student Outcomes Other (e.g. non- cognitive) Student/ Parent Survey Teacher Survey

MM Combination Model 6 : (Darlington, 1970) Unmeasured Criterion, theoretical weights Other Measures Student Achievement Artifacts/ Portfolio Classroom Observation Unmeasured Teacher Construct Student/ Parent Survey Teacher Survey

Empirical vs. Theoretical Weighting • Model 6 is most likely scenario in practice • Policy assumptions/values (consensual) inform the system, alongside technical considerations • It really is the only feasible scenario • Empirical weights cannot be derived • Ultimate criterion measure is NOT available • Note model 3 assumes such measure is available • But does not give “correct” weight for criterion • Exposure to Validity shrinkage (weight change over time)

Multiple Measures and Validity • Models may lead to different inferences. • Little guidance available; so… • LOCAL VALIDITY STUDIES NEEDED (lots of them) • As with single measures, need to set up testable validation hypotheses (Kane, 2006) • Whatever the construct : Teacher [euphemism] • 1. Describe intended inferences, uses, AND CONSEQUENCES • 2. Collect empirical evidence to support • 2012, 2013 MET reports will be influential. May force field to broaden our lens and revise assumptions and expectations • No getting around conducting local validation studies

What KINDS of EVIDENCE? • All of them: Validity is a unitary notion • Theoretical support • Consistency and accuracy (Reliability) • Correlations, Internal structure • Predictive power • Consequences of use • Validity becomes a rather empty academic topic if the consequences are not considered • Or if they differ markedly from expectation

What consequences? • Intended and Unintended Effects • On teaching practice • On different student outcomes • On recruitment and retention • On Motivation, Competition, Fraud • On Perceptions of validity, fairness, utility • On dynamic of relationships with parents and community • Etcetc

Final Remarks. Teacher Evaluation: Why are we doing this again? • Some good reasons • Make student achievement priority • Monitor & assess teacher performance • Develop a culture of accountability • and of reflection and improvement • Inform PD to improve teacher performance • However • Multiple fallible indicators do not automatically yield better, less fallible inferences. But they always yield more complex ones • Using indicators in combination involves technical but also conceptual and policy assumptions

Final Remarks. Teacher Evaluation: Why are we doing this again? • Because “the stakes are high, and the future of our children is at stake” (insert public official name here, circa 2012) we should proceed carefully and deliberately. • Good measures take time to develop. • Solid systems based on these measures take longer to test and implement. • The consequences of implementing these systems are unknown and will take longer to assess. • Experience suggests moving too fast to implement may shortchange the system

Final Remarks. Teacher Evaluation: Why are we doing this again? • Most important goal in my view is not only to avoid unfair decisions, and negative unintended consequences (though the potential for both should give us pause) • Greatest risk is missing an opportunity to enact sound teacher evaluation policy with great potential to positively impact educational practice and outcomes

Thank you jfmtz@ucla.edu

Issues in Teacher Evaluation and Validity: Conceptual, Methodological, and Practical

Issues in Teacher Evaluation and Validity: Conceptual, Methodological, and Practical

Presentation Transcript

Packing and Unpacking Sources of Validity Evidence: History Repeats Itself Again

Internal Validity

CHAPTER Modems

Student Growth Professional Development for the Principal and Teacher Evaluation Process

Evaluation, cont’d

Practical Online Retrieval Evaluation SIGIR 2011 Tutorial

Teacher Evaluator Training

FIND: Faulty Node Detection for Wireless Sensor Networks

Evaluation Planning; forms of Evaluation; evaluation Stages—Chen and Holden-Zimmerman

MarkeTrak Enhancements Conceptual Design Review

Teacher Evaluator Training

Evaluation, cont’d

Measurement Issues Inherent in Educator Evaluation

Positive Youth Development: Conceptual Issues, Empirical Findings, and Practical Applications

CHAPTER Modems

South Carolina’s New Educator Evaluation System

centrifuge A methodological figuration

MarkeTrak Enhancements Conceptual Design Review

Teacher Evaluation in Newark: Evaluator Training

SUMMATIVE EVALUATION