Designing Trustworthy & Reliable GME Evaluations

Designing Trustworthy & Reliable GME Evaluations Conference Session: SES85 2011 ACGME Annual Education Conference Nancy Piro, PhD, Program Manager/Ed Specialist Alice Edler, MD, MPH, MA (Educ) Ann Dohn, MA, DIO Bardia Behravesh, EdD, Manager/Ed Specialist Stanford Hospital & Clinics Department of Graduate Medical Education

Overall Questions • What is assessment…an evaluation? • How are they different? • What are they used for? • Why do we evaluate? • How do we construct a useful evaluation? • What is cognitive bias? • How do we eliminate bias from our evaluations? • What is validity? • What is reliability?

Defining the Rules of the “Game”

Assessment - Evaluation: What’s the difference and what are they used for? Assessment…is the analysis and use of data by residents or sub-specialty residents (trainees), faculty, program directors and/or departments to make decisions about improvements in teaching and learning.

Assessment - Evaluation: What’s the difference and what are they used for? Evaluation is the analysis and use of data by faculty to make judgments about trainee performance. Evaluation includes obtaining accurate performance based, empirical information which is used to make competency decisionson trainees across the six domains.

Evaluation Examples • Example 1: A trainee delivers an oral presentation at a Journal Club. The faculty member provides a critique of the delivery and content accompanied by a rating for the assignment. • Example 2: A program director provides a final evaluation to a resident accompanied by an attestation that the resident has demonstrated sufficient ability and acquired the appropriate clinical and procedural skills to practice competently and independently.

Why do we assess and evaluate?(Besides the fact it is required…) • Demonstrate and improve trainee competence in core and related competency areas - Knowledge and application • Ensure our programs produce graduates, each of whom: “has demonstrated sufficient ability and acquired the appropriate clinical and procedural skills to practice competently and independently.” • Track the impact of curriculum/organizational change • Gain feedback on program, curriculum and faculty effectiveness • Provide residents/fellows a means to communicate confidentially • Provide an early warning system • Identify gaps between competency based goals and individual performance

So What’s the Game Plan for Constructing Effective Evaluations ? Without a plan… evaluations can take on a life of their own!!

How do we construct a useful evaluation?

How do we construct a useful evaluation? STEP 1. Create the Evaluation (Plan) Curriculum (Competency) Goals, Objectives and Outcomes Question and Scale Development STEP 2. Deploy(Do) Online /In-Person (Paper) STEP 3. Analyze (Study /Check ) Reporting, Benchmarking and Statistical Analysis Rank Order / Norms (Within the Institution or National) STEP 4. Take Action (Act) Develop & Implement Learning/Action Plans Measure Progress Against Learning Goals Adjust Learning/Action Plans

Question and Response Scale Construction Two Basic Goals: • Construct unbiased, unconfounded, and non-leading questions that produce valid data • Design and use unbiased and valid response scales

What is cognitive bias… • Cognitive bias is distortion in the way we perceive reality or information. • Response bias is a particular type of cognitive bias which can affect the results of an evaluation if respondents answer questions in the way they think they are designed to be answered, or with a positive or negative bias toward the examinee.

Where does response bias occur? • Response bias most often occurs in the wording of the question. • Response bias is present when a question contains a leading phrase or words. • Response bias can also occur in rating scales. • Response bias can also be in the raters themselves • Halo Effect • Devil Effect • Similarity Effect • First Impressions

Step 1: Create the EvaluationQuestion Construction • Example (1): • "I can always talk to my Program Director about residency related problems.” • Example (2): • “Sufficient career planning resources are available to me and my program director supports my professional aspirations .”

Question Construction • Example (3): • “Incomplete, inaccurate medical interviews, physical examinations; incomplete review and summary of other data sources. Fails to analyze data to make decisions; poor clinical judgment.” • Example (4): • "Communication in my sub-specialty program is good."

Create the EvaluationQuestion Construction • Example (5): • "The pace on our service is chaotic."

Exercise One • Review each question and share your thinking of what makes it a good or bad question.

Question Construction - Test Your Knowledge • Example 1: "I can always talk to my Program Director about residency related problems." • Problem: Terms such as "always" and "never" will bias the response in the opposite direction. • Result: Data will be skewed.

Question Construction - Test Your Knowledge • Example 2: “Career planning resources are available to me and my program director supports my professional aspirations." • Problem: Double-barreled ---resources and aspirations… Respondents may agree with one and not the other. Researcher cannot make valid assumptions about which part of the question respondents were rating. • Result: Data is useless.

Question Construction - Test Your Knowledge • Example 3: "Communication in my sub-specialty program is good." • Problem: Question is too broad. If score is less than 100% positive, researcher/evaluator still does not know what aspect of communication needs improvement. • Result: Data is of little or no usefulness.

Question Construction - Test Your Knowledge • Example 4: “Evidences incomplete, inaccurate medical interviews, physical examinations; incomplete review and summary of other data sources. Fails to analyze data to make decisions; poor clinical judgment.” • Problem: Septuple-barreled ---Respondents may need to agree with some and not the others. Evaluator cannot make assumptions about which part of the question respondents were rating. • Result: Data is useless.

Question Construction - Test Your Knowledge • Example (5): • "The pace on our service is chaotic.“ • Problem: The question is negative, and broadcasts a bad message about the rotation/program. • Result: Data will be skewed, and the climate may be negatively impacted.

Evaluation Question Design Principles Avoid ‘double-barreled’ questions • A double-barreled question combines two or more issues or “attitudinal objects” in a single question.

Avoiding Double-Barreled Questions • Example: Patient Care Core Competency “Resident provides sensitive support to patients with serious illness and to their families, and arranges for on-going support or preventive services if needed.” Minimal Progress Progressing Competent

Evaluation Question Design Principles • Combining the two or more questions into one question makes it unclear which object attribute is being measured, as each question may elicit a different perception of the resident’s performance. • RESULT: • Respondents are confused and results are confounded leading to unreliable or misleading results. • Tip: If the word “and” or the word “or” appears in a question, check to verify whether it is a double-barreled question.

Evaluation Question Design Principles • Avoid questions with double negatives… • When respondents are asked for their agreement with a negatively phrased statement, double negatives can occur. • Example:Do you agree or disagree with the following statement?

Evaluation Question Design Principles • “Attendings should not be required to supervise their residents during night call.” • If you respond that you disagree, you are saying you do not think attendings should not supervise residents. In other words, you believe that attendings should supervise residents. • If you do use a negative word like “not”, consider highlighting the word by underlining or bolding it to catch the respondent’s attention.

Evaluation Question Design Principles • Because every question is measuring something, it’s important for each to be clear and precise. • Remember…Your goal is for each respondent to interpret the meaning of each question in exactly the same way.

Evaluation Question Design Principles • If your respondents are not clear on what is being asked in a question, their responses may result in data that cannot or should not be applied to your evaluation results… • "For me, further development of my medical competence, it is important enough to take risks" – Does this mean to take risks with patient safety, risks to one's pride, or something else?

Evaluation Question Design Principles • Keep questions short. Long questions can be confusing. • Bottom line: Focus on short, concise, clearly written statements that get right to the point, producing actionable data that can inform individual learning plans (ILPs). • Take only seconds to respond to/rate • Easily interpreted.

Evaluation Question Design Principles • Do not use “loaded” or “leading” questions • A loaded or leading question biases the response given by the respondent. A loaded question is one that contains loaded words. • For example: “I’m concerned about doing a procedure if my performance would reveal that I had low ability” Disagree Agree

Evaluation Question Design Principles "I’m concerned about doing a procedure on my unit if my performance would reveal that I had low ability" • How can this be answered with “agree or disagree” if you think you have good abilities in appropriate tasks for your area?

Evaluation Question Design Principles • A leading question is phrased in such a way that suggests to the respondent that a certain answer is expected: • Example: Don’t you agree that nurses should show more respect to residents and attendings? • Yes, they should show more respect • No, they should not show more respect

Evaluation Question Design Principles • Use of Open-Ended Questions • Comment boxes after negative ratings • To explain the reasoning and target areas for focus and improvement • General, open-ended questions at the end of the evaluation. • Can prove beneficial • Often it is found that entire topics have been omitted from the evaluation that should have been included.

Evaluation Question Design Principles – Exercise 2 “Post Test” 1. Please rate the general surgery resident’s communication and technical skills 2. Rate the resident’s ability to communicate with patients and their families 3. Rate the resident’s abilities with respect to case familiarization; effort in reading about patient’s disease process and familiarizing with operative care and post op care 4. Residents deserve higher pay for all the hours they put in, don’t they?

Evaluation Question Design Principles – Exercise 2 “Post Test” 5. Explains and performs steps in resuscitation and stabilization 6. Do you agree or disagree that residents shouldn’t have to pay for their meals when on-call? 7. Demonstrates an awareness of and responsiveness to the larger context of health care 8.Demonstrates ability to communicate with faculty and staff

The scale you construct can also skew your data, much like we discussed about question construction. Bias in the Rating Scales for Questions

Evaluation Design Principles: Rating Scales • By far the most popular scale asks respondents to rate their agreement with the evaluation questions or statements – “stems”. • After you decide what you want respondents to rate (competence, agreement, etc.), you need to decide how many levels of rating you want them to be able to make.

Evaluation Design Principles: Rating Scales • Using too few can give less precise, cultivated information, while using too many could make the question hard to read and answer (do you really need a 9 or 10 point scale?) • Determine how fine a distinction you want to be able to make between agreement and disagreement.

Evaluation Design Principles: Rating Scales • Psychological research has shown that a 6-point scale with three levels of agreement and three levels of disagreement works best. An example would be: • Disagree Strongly • Disagree Moderately • Disagree Slightly • Agree Slightly • Agree Moderately • Agree Strongly

Evaluation Design Principles: Rating Scales • This scale affords you ample flexibility for data analysis. • Depending on the questions, other scales may be appropriate, but the important thing to remember is that it must be balanced, or you will build in a biasing factor. • Avoid neutral and neither agree nor disagree…you’re just giving up 20% of your evaluation ‘real estate’

Evaluation Design Principles: Rating Scales 1. Please rate the volume and variety of patients available to the program for educational purposes. Poor Fair Good Very Good Excellent 2. Please rate the performance of your faculty members. Poor Fair Good Very Good Excellent 3. Please rate the competence and knowledge in general medicine. Poor Fair Good Very Good Excellent

Evaluation Design Principles: Rating Scales The data will be artificially skewed in the positive direction using this scale because there are far more (4:1) positive than negative rating options….Yet we see this scale being used all the time!

Gentle Words of Wisdom…. Avoid large numbers of questions…. • Respondent fatigue – the respondent tends to give similar ratings to all items without giving much thought to individual items, just wanting to finish • In situations where many items are considered important, a large number can receive very similar ratings at the top end of the scale • Items are not traded-off against each other and therefore many items that are not at the extreme ends of the scale or that are considered similarly important are given a similar rating

Gentle Words of Wisdom…. Avoid large numbers of questions….but ensure your evaluation is both valid and has enough questions to be reliable….

How many questions (raters) are enough? Not intuitive Little bit of math is necessary (sorry) True Score =Observed Score +/- Error score

Why are we talking about reliability in a question writing session ? To create your own evaluation questions and insure their reliability To share/use other evaluations that are assuredly reliable To read the evaluation literature

Reliability • Reliability is the "consistency" or "repeatability" of your measures. • If you could create 1perfect test question (unbiased and perfectly representative of the task) you would need only that one question • OR if you could find 1perfect rater (unbiased and fully understanding the task) you would need only one rater

Reliability Estimates • Test Designers use four correlational methods to check the reliability of an evaluation: • the test-retest method,(Pre test –Post test) • alternate forms, • internal consistency, • and inter-rater reliability.

Generalizability One measure based on Score Variances Generalizablity Theory

Designing Trustworthy & Reliable GME Evaluations