Howard Wainer National Board of Medical Examiners

Uneducated Guesses:Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners An Invited Talk Given to the Institute of Education Science in the Graduate School of Education of the University of PennsylvaniaFebruary 13, 2012

“In general we look for a new law by the following process. First we guess it. Then we compute the consequences of the guess to see what would be implied if this law that we guessed is right. Then we compare the result of the computation to nature, with experiment or experience, compare it directly with observation, to see if it works. If it disagrees with experiment it is wrong. In that simple statement is the key to science. It does not make any difference how beautiful your guess is. It does not make any difference how smart you are, who made the guess, or what his name is - if it disagrees with experiment it is wrong. That is all there is to it.” Richard P. Feynman (1964)

Outline I. Introduction– Mistreating missing data can have a huge effect A. Lombard’s most dangerous profession B. Getting younger in Princeton’s cemetery C. Wald’s model for armoring planes II. Case 1. What happens if the SAT is made Optional: Bowdoin College as an example. III. Case 2. Allowing choice on examsA. Some history – especially 1921 EnglishB. The mystery of 1968 AP ChemistryC. Women suffer in 1988 US HistoryD. The only unambiguous solution to missing dataE. Indiana Jones and a wonderful workaround 1. 1989 Chemistry as proof of concept. IV. Case 3.Using student test scores to evaluate teachers: Value-Added ModelsA. VAM and missing scores - Gaming the system by using missing data imputations.B. VAM and Counterfactuals – How would Freddy have done if he hadn’t had Ms. Jones? V. Conclusions

I will illustrate my talk today with three principal examples: • A September 2008 report published by the National Association for College Admission Counseling in which one of the principal recommendations was for colleges and universities to reconsider requiring the SAT or the ACT for applicants. • Increasingly often ‘standardized’ exams provide a set of possible questions and allow the examinee to pick which ones to answer. • “Race to the Top” provides funds to states that amend their educational system in specific ways. But all must somehow use the change in student test scores to evaluate teachers.

In all three of these, the issue of missing data looms large The issue of missing data is too often assumed to be a small technical one that is not likely to have any serious effect; even by people who ought to know better. How we understand and treat missing data can have an enormous effect on the conclusions we draw.

MD1. The most dangerous profession

MD2. The 20th Century was a dangerous time

MD3. Bullet holes and a model for missing data From Abraham Wald

Example 1. National Association for College Admission Counseling’s September 2008 report on admissions testing On September 22, 2008, the New York Times carried the first of three articles about a report, commissioned by the National Association for College Admission Counseling, that was critical of the current, widely used, college admissions exams, the SAT and the ACT. The commission was chaired by William R. Fitzsimmons, the dean of admissions and financial aid at Harvard. The report was reasonably wide-ranging and drew many conclusions while offering alternatives. Although well-meaning, many of the suggestions only make sense if you say them very fast.

Among their conclusions were: • 1. Schools should consider making their admissions “SAT optional,” that is allowing applicants to submit their SAT/ACT scores if they wish, but they should not be mandatory. The commission cites the success that pioneering schools with this policy have had in the past as proof of concept. • 2. Schools should consider eliminating the SAT/ACT altogether and substituting instead achievement tests. They cite the unfair effect of coaching as the motivation for this – they were not naïve enough to suggest that because there was no coaching for achievement tests now that, if they became more high stakes coaching for them would not be offered, but rather that such coaching would be directly related to schooling and hence more beneficial to education that coaching that focuses on test-taking skills. • 3. That the use of the PSAT with a rigid qualification cut-score for such scholarship programs as the Merit Scholarships be immediately halted.

Recommendation 1. Make SAT optional: It is useful to examine those schools that have instituted “SAT Optional” policies and see if the admissions process been hampered in those schools. The first reasonably competitive school to institute such a policy was Bowdoin College, in 1969. Bowdoin is a small, highly competitive, liberal arts college in Brunswick, Maine. A shade under 400 students a year elect to matriculate at Bowdoin, and roughly a quarter of them choose to not submit SAT scores. In the following table is a summary of the classes at Bowdoin and five other institutions whose entering freshman class had approximately the same average SAT score. At the other five institutions the students who didn’t submit SAT scores used ACT scores instead.

Table 1 : Six Colleges/Universities with similar observed mean SAT scores for the entering class of 1999.

To know how Bowdoin’s SAT policy is working we will need to know two things: • How did the students who didn’t submit SAT scores do at Bowdoin in comparison to those students that did submit them? • Would the non-submitters’ performance at Bowdoin have been better predicted by their SAT scores, had the admissions office had access to them?

The first question is easily answered by looking at their first year grades at Bowdoin.

But would their SAT scores have provided information missing from other submitted information? This would depend on why these students chose to not submit their scores. Some possibilities are: • If I don’t need to submit them, why bother to take them? • I took them, and did really well, but so what? • I took them, but did worse than the typical student who was accepted by Bowdoin in the past. Submitting them wouldn’t help my cause.

Although we may have some opinions on the likelihood of each of these options, under typical circumstances we have no data to help us decide, for these students did not submit their SAT scores.

However all of these students actually took the SAT, and through a special data-gathering effort at the Educational Testing Service, we found that the students who didn’t submit these scores behaved sensibly. They realized that their lower-than-average scores would not help their cause at Bowdoin, and hence chose not to submit them. Here is the distribution of SAT scores for those who submitted them as well as those who did not.

As it turns out, the SAT scores for the students who did not submit them would have accurately predicted their lower performance at Bowdoin. In fact, the correlation between grades and SAT scores was higher for those who didn’t submit them (0.9) than for those who did (0.8).

So not having this information does not improve the academic performance of Bowdoin’s entering class – on the contrary it diminishes it. Why would a school opt for such a policy? Why is less information preferred to more?

There are surely many answers to this, but one is seen in an augmented version of the earlier table 1: We see that if all of the students in Bowdoin’s entering class had their SAT scores included, the average SAT at Bowdoin would shrink from 1323 to 1288, and instead of being second among these six schools they would have been tied for next to last.

Since mean SAT scores are a key component in school rankings, a school can game those rankings by allowing their lowest scoring students to not be included in average. I believe that Bowdoin’s adoption of this policy pre-dates US News & World Report’s rankings, so that was unlikely to have been their motivation, but I cannot say the same thing for schools that have chosen such a policy more recently.

Is inferring such a nefarious goal just the paranoid ravings of an aging cynic? Or are colleges actively engaged in trying to game college rankings?

Some evidence: • TheJanuary 31, 2012 NY Times reported that Richard C. Vos, VP and dean of admissions of Claremont McKenna College has, for the past six years, been adding points to the mean SAT scores that the school reported to USN&WR. • TheFebruary 1, 2012 NY Times reported that Iona College “has lied for years about test scores, graduation rates, freshman retention, student-faculty ratio, acceptance rates and alumni giving.” • “Baylor University paid admitted students to retake the SATs in hopes of increasing scores.” This seems like an inefficient approach -- easier, cheaper and more sure to use Claremont’s approach and just falsify them.

Case 2. Allowing choice on exams If you allow choice, you will regret it; if you don't allow choice, you will regret it; whether you allow choice or not, you will regret both. (Søren Kierkegaard, 1986, p. 24)

It is common practice to allow choice on exams Why? If a test is made up of multiple choice questions answering any one of them takes very little time and so there can be lots of them. If we ask essay questions, or other kinds of big problems, it is impractical to ask more than a few of them, and so some students may be disadvantaged by the specific topic selected.

So we offer a choice, “Answer 2 of the following 6” Is this a a good idea? Historically, such an approach was most common almost a century ago, but its popularity rapidly declined. It is currently enjoying a resurgence.

Number of possible test forms generated by examinee choice patternsin College Entrance Exams

How did they arrive at the unlikely number of test forms for the 1921 English exam? Section I - Answer 1 of 3 questions; 3 forms. Section II - Answer 5 of 26 questions; 65,780 forms. Section III - 1 of 15; 15 forms. 3x65,780x15 = 2,960,100 Voila!

Are choice items of equal difficulty? Average Scores on AP Chemistry 1968 While their scores on the common multiple-choice (MC) section were about the same (11.7 vs. 11.2 out of a possible 25), their scores on the choice problem were very different (8.2 vs. 2.7 on a 10-point scale).

There are several possible conclusions to be drawn from this; four among them are: 1. Problem 5 is a good deal more difficult than problem 4. 2. Small differences in performance on the multiple-choice section translate into much larger differences on the free response questions. 3. The proficiency required to do the two problems is not strongly related to that required to do well on the multiple-choice section. 4. Item 5 is selected by those who are less likely to do well on it.

1988 AP United States History Exam

The only unambiguous data on choice and difficulty Xiang-bo Wang and his colleagues repeatedly presented examinees with a choice of two items, but then required them to answer both

The proportion of students getting each item correct shown conditional on which item they preferred to answer

The conclusion drawn from many results like this is that: As examinees’ ability increases they tend to choose more wisely – they know enough to be able to determine which choices are likely to be the least difficult. As ability declines choice becomes closer and closer to random. On average, lower ability students, when given choice are more likely to choose more difficult items than their competitors at the higher end of the proficiency scale. Thus allowing choice will tend to exacerbate group differences.

How can we allow choice? • Adjust for differential difficulty after administering items to random samples of examinees - equate(but that makes the examinee’s job more difficult). And, if we are successful, it renders choice unnecessary. OR

What if we make the choice part of the test? But choose wisely, for while the true Grail will bring you life, the false Grail will take it from you. • Grail Knight in Indiana Jones and the Last Crusade, 1989

The alternative to trying to make all examinee-selected choices within a choice question of equal difficulty is to consider the entire set of questions with choices as a single item. Thus the choice is part of the item. If you make a poor choice and select an especially difficult option to respond to, that is considered in exactly the same way as if you wrote a poor answer.

Under what circumstances is this a plausible and fair approach? • We must believe that choosing wisely uses the same knowledge and skills that are required for answering the question. 2. That the choice is being made by the examinees and not by their teachers.

If we agree to adopt this strategy a remarkable result ensues! Let us consider data from Section D of the 1989 Advanced Placement Examination in Chemistry. Section D has five problems (Problems 1, 2, 3, 4 and 5) of which the examinee must answer just three. ETS calculates the reliability of Section D as 0.60.

Scores of examinees as a function of the problems they chose

Suppose we think of Section D as a single ‘item’ with an examinee falling into one of ten possible categories, and the estimated score of each examinee is the mean score of everyone in their category. How reliable is this one item test?

After doing the appropriate calculation we discover that the reliability of this‘choice item’ is .15. While .15 is less than .60, it is also larger than zero, and it is easier to obtain. We don’t have to score the examinees’ answers, we just note which problems they chose. In fact, they don’t even have to answer them -- just indicate which three they would answer, if they were forced to.

Of course with a reliability of only .15, this is not much of a test. But suppose we had two such items, each with a reliability of .15? This new test would have a reliability of .26. And, to get to the end of the story, if we had eight such ‘items’ it would have a reliability of .60, the same as the current form. Such a test would be easier on examinees and much cheaper for the testing company. A win-win.

This is what I like best about science, with only a small investment in fact, we can garner such huge dividends in conjecture.

Case 3. Using student test scores to evaluate teachers “Some professors are justly renowned for their bravura performances as Grand Expositor on the podium, Agent Provocateur in the preceptorial, or Kindly Old Mentor in the corridors. These familiar roles in the standard faculty repertoire, however, should not be mistaken for teaching, except as they are validated by the transformation of the minds and persons of the intended audience.”“Good teachers evaluate themselves with a pitiless gaze and measure their successes not by their virtuosity as performers but by their contribution to the transformation of students.”(Marvin Bressler, 1991)

Value Added Models (VAMs) yi1 = m1 + q1+ ei1(1)yi2 = m2 + q1 +q2+ ei2(2)Hence the change, the value-added, is simply the difference between the scores from year 1 to year 2, oryi2 -yi1 =( m2 -m1) +q2+ (ei2 - ei1) (3)

“The child in me was delighted.The adult was skeptical.”Saul Bellow, 1977“I was impressed, not because it did it well, but that it could do it at all.”Samuel Johnson after watching a dog walk on its hind legs

Howard Wainer National Board of Medical Examiners

Howard Wainer National Board of Medical Examiners

Presentation Transcript

National Plant Board, Board of Directors Report

North Carolina State Board of Examiners for Nursing Home Administrators

Kam Habibi Cleared Three Primary Tests Held By National Board Of Chiropractic Examiners

STATE OF ARIZONA BOARD OF CHIROPRACTIC EXAMINERS

NCATE Board of Examiners Visit

Minnesota Board of Examiners for Nursing Home Administrators

The National Board BULLETIN of National Board of Boilers and

Howard Hughes Medical Institute

Maryland State Board of Podiatric Medical Examiners

Board of Examiners Training Faculty / Quality Assurance Directorate

GaPSC Board of Examiners Chair Training Overview

NATIONAL BOARD OF EXAMINATIONS

-- Welcome -- NCATE/IPSB Board of Examiners

Georgia Board of Chiropractic Examiners Legal Review

Report from the Board of Examiners

The National Board of Certification for Medical Interpreters

Texas State Board of Examiners of Psychologists Panel Discussion: Board Complaints and Hot Topics

Faculty Development National Board of Medical Examiners

The National Board of Certification for Medical Interpreters

The National Board of Certification for Medical Interpreters

Minnesota Board of Chiropractic Examiners

National Board Of Accreditation