1 / 33

Measuring Measuring: Developing a scale of proficiency for the CM framework

2. Item 106: Practice Test Five: Curriculum, Instruction, and Assessment" (Kaplan, Praxis Edition, 2006). A new principal in an open-minded school approaches the teachers about including at least one objective test in each subject each quarter because of what he terms the need for accountability."

trygg
Télécharger la présentation

Measuring Measuring: Developing a scale of proficiency for the CM framework

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Measuring Measuring: Developing a scale of proficiency for the CM framework Presented at the13th Biennnial International Objective Measurement Workshop April 7, 2006 Brent Duckor, PhD. Candidate, UC Berkeley

    2. 2 Item 106: Practice Test Five: Curriculum, Instruction, and Assessment (Kaplan, Praxis Edition, 2006) A new principal in an open-minded school approaches the teachers about including at least one objective test in each subject each quarter because of what he terms the need for accountability. The requirement for this accountability has probably come about because of the school boards concern about: The depth of content assessed in performance tasks The instructional planning time lost to grading essays and projects The possibility of teacher bias in evaluating students The teachers skill in creating performance-type assessments

    3. 3 Answers and Explanations (Kaplan, Praxis Edition, p. 391, 2006) 106(D) (A),(B) and (C) are all possibilities but probably not the driving force. (D) is the correct answer because the school board is probably concerned that the teachers assessments are not rigorous enough, and they want to make sure there is a professional tool involved. The received wisdom and subtle message is that teachers knowledge of educational measurement is deficient. It is assumed that professionals who make professional tools have the knowledge to make assessments (instruments) rigorous. In other words, experts are expert. But how do they (and we as teachers) learn to be rigorous and what is the learning progression on the path to acquiring knowledge of educational measurement and assessment. My research explicitly address the question: what does learning educational measurement look like? What the steps in the learning progression? Which variables constitute the core of educational measurement knowledge?The received wisdom and subtle message is that teachers knowledge of educational measurement is deficient. It is assumed that professionals who make professional tools have the knowledge to make assessments (instruments) rigorous. In other words, experts are expert. But how do they (and we as teachers) learn to be rigorous and what is the learning progression on the path to acquiring knowledge of educational measurement and assessment. My research explicitly address the question: what does learning educational measurement look like? What the steps in the learning progression? Which variables constitute the core of educational measurement knowledge?

    4. Background on the study Defining universe of measurement knowledge Listening to the authorities and professionals Domains beyond reliability and validity Limitations to content-mining and item hunting Cognition-based approaches (How we think) Knowledge-types (Shavelson et al, 2004) Construct modeling Building blocks framework (Wilson, 2005) Evidentiary approach (Mislevy et al., 2003) Assessment triangle (NRC, 2000) What is MK and who defines it? How do we think about MK? What types of latent traits or constructs might constitute the MK universe?What is MK and who defines it? How do we think about MK? What types of latent traits or constructs might constitute the MK universe?

    5. 5 Wilsons Constructing Measures (2005) framework for understanding educational measurement Wilsons CM framework provides both a definition of MK and a method for measuring itWilsons CM framework provides both a definition of MK and a method for measuring it

    6. Study Posits the existence of multi-dimensional proficiencies for constructing measures (CM) framework Develops 5 construct maps, pools of items & strategy for scoring item responses Persons sampled from diverse but construct-proximal populations Fits partial credit Rasch measurement model to empirical data to test hypotheses about structure of proficiencies Examines evidence for reliability and validity of inferences drawn from scores on CM instrument Explores relations of CM scale to other variables that may explain variations in individual proficiencies

    7. 5 constructs/dimensions under investigation in this study Understanding Construct Maps (UCM) Understanding the Items Design (UID) Understanding the Outcome Space (UOS) Understanding Wright Maps (UWM) Understanding Quality Control (UQC) Evidence for Validity Evidence for Reliability

    8. Reseach Questions Quality of the CM instrument: Evidence for Validity R1: What validity evidence is there for the content of the CM instrument? R2: What validity evidence is there based on the response processes of the CM instrument? R3: What validity evidence is there based on the internal structure of the CM instrument? R4: What validity evidence is there based on relations to external variables of the CM instrument?

    9. Reseach Questions Quality of the CM instrument: Reliability R5: Is there evidence that the CM instrument has sufficient internal consistency? R6: Is there evidence that the CM instrument has sufficient inter-rater consistency? Factors associated with proficiency on the CM instrument R7: What is the relationship, if any, between performance on the CM instrument and other factors such as research, professional and course experience?

    10. Methods Instruments CM instrument (n=72) 18 open-ended items 8 fixed choice items 4 exit interview items 53 demographic items Embedded assessments (n=8) Construct map homework Items design homework Data collection homework Final report Semi-structured interviews (n=5) Subjects Three sample pools EDU274A alum CAESL Work Circle Prof. development IOMW 2004 72 participants Characteristics Female (58%) Under 40 years (69.4%) Graduate students (48.6%) Have Masters degree (59.7%)

    11. 11 Procedures

    12. Results (RQ1): Validity evidence for CM instrument content The construct maps Theoretical descriptions of locations of respondents and responses to items for each sub-dimension The items design Mixed item format Task analysis The outcome space General rubrics Item specific rubrics The measurement model Technically calibrated Wright Map employed to test hypotheses about respondent and item locations More than just a test blueprintMore than just a test blueprint

    13. 13 Construct Map (UCM)

    14. 14 Items design Mixed format Open ended Visual and/or verbal prompt Extended response Fixed choice Stem Partially ordered distractors Task analysis Task demands Cognitive demands Item openess and complexity

    15. 15 Open ended item (UCM1) OE1 Stem: An educational consultant is asked to develop an instrument to measure understanding of a Living the Civil War after-school program. The consultant proposes to measure the following: Figure: Textbox titled: Participants level of historical knowledge Two columns titled: Respondents and responses to items Each column contains descriptions for a given level Two prompts: Is this a good example of a construct map? Please explain. What advice, if any, would you give to improve this construct map?

    16. 16 UCM1: Item analysis

    17. 17 General Scoring Guide (UCM)

    18. 18

    19. 19 Wright Map (UCM)

    20. Results (RQ2): Validity evidence for response processes Of the 84.7% reporting, four out of five respondents did not find the CM instrument confusing Of the 91.7% reporting, respondents did identify several factors that they believed affected their ability to give their best response to the CM instrument: Content domain and/or prior knowledge (41%) Time and Length (38%) Memory(13%) Administration and/or format (8%)

    21. 21 Results (RQ2): Validity evidence for response processes Of the 88.9% reporting, two out of three respondents did not want to go back and change any of their responses, although some reported using test-taking strategies and guessing on the fixed choice items Of the 76.4% reporting, three out of four respondents believed the CM instrument could be improved while the other respondents did not believe (18%) or were not sure (6%) if it required improvement Respondents suggested the following areas for improvement: Shorten time and length (43%) Item format and wording e.g. fixed choice distractors (33%) Terminology and content coverage e.g. reliability scenarios (19%) Standardize administration conditions (5%)

    22. Results (RQ3): Validity evidence for internal structure Did the evidence support the constructs? Wright Map (CM Scale) suggests structure predicted by construct map(s) Yet evidence for multidimensionality given low correlations between separately calibrated maps (.263=r=.538) Did the evidence support the items design? Item analysis For each item, the mean location of the item thresholds increases as the score increases Respondents higher on the construct are, in fact, also scoring higher on each item. Differential Item Functioning (DIF) Female respondents scored lower (0.088) logits than male respondents, but this parameter estimate is not statistically significant (Chi square test of parameter equality = 0.46, df=1, Sig. level p=0.499) While overall no statistically significant evidence of DIF found, one item (MC6) did display largish DIF (.766>.638) which is likely due to sampling effects

    23. CM Scale

    24. CM Instrument Partial Credit Model fit Did the Rasch measurement model fit the item data? Overall, weighted mean square statistics indicated good item fit (.75<MNSQ<1.33) Only two generalized item thresholds (OE12.0 and OE14.0) showed evidence of misfit (.58), but neither were statistically significant (-0.8) Did the Rasch measurement model fit the person data? Overall, weighted mean square fit statistics (.75<MNSQ<1.33) indicated relatively good person fit with some exceptions 6 out of 72 respondents did show evidence of statistically significant person misfit Two cases (MNSQ=.49) indicated better than expected model fit Four cases (MNSQ=1.89, 2.05, 2.44, 3.14) showed worse than expected fit indicating that the expected order may be wrong i.e. the model did not account for much of the variability in these individuals scores Of the four cases (MNSQ=1.89, 2.05, 2.44, 3.14), the response processes evidence from the exit interviews confirmed that at least two of these respondents found the instrument confusing and/or difficult to engage.Of the four cases (MNSQ=1.89, 2.05, 2.44, 3.14), the response processes evidence from the exit interviews confirmed that at least two of these respondents found the instrument confusing and/or difficult to engage.

    25. 25 Results (RQ4): Validity evidence for relations to other variables Correlation between 274A course grades and EAP, MLE and raw scores were all low(*) Ranking of post-interview (SSI) responses and 274A final reports (EA) corresponded with patterns of CM proficiency (*)This may be due to Restriction of range or attenuation effects since grades were only available for 35 of the 72 respondents.(*)This may be due to Restriction of range or attenuation effects since grades were only available for 35 of the 72 respondents.

    26. 26 Results (RQ5): Evidence for reliability

    27. Results (RQ5): Reliability evidence for CM instruments internal consistency

    28. Results (RQ5): Reliability evidence for alternate forms

    29. Results (RQ6): Reliability evidence for rater agreement

    30. 30 Results (RQ6): Reliability evidence for rater consistency

    31. Results (RQ7): Factors associated with proficiency on CM scale These four independent variables explain about 35% of the variation in proficiency; that is, they seem to affect proficiency on the CM scale. It may be the case that other factors also affect proficiency or there may be unaccounted for random measurement error in the CM instrument scores. Fitting a unidimensional latent regression model with ConQuest might address the latter concern.These four independent variables explain about 35% of the variation in proficiency; that is, they seem to affect proficiency on the CM scale. It may be the case that other factors also affect proficiency or there may be unaccounted for random measurement error in the CM instrument scores. Fitting a unidimensional latent regression model with ConQuest might address the latter concern.

    32. 32 Results (RQ7): Factors associated with proficiency on CM scale 1. Those respondents who have taken 274A, and have experience with research, paid professional and/or consulting experience in the field do seem to score higher on the CM scale The coefficient of .477 for the dummy variable (274A experience) indicates that, on average, individuals who took the course score .477 logits higher on the CM scale compared to those who did not The coefficient of .275 for the dummy variable (research experience) indicates that, on average, individuals who have research experience score .275 logits higher on the CM scale compared to those who did not The coefficient of .232 for the dummy variable (paid professional/consulting experience) indicates that, on average, individuals who have professional experience score .232 logits higher on the CM scale compared to those who did not 1. Those respondents who have taken 274A, and have experience with research, paid professional and/or consulting experience in the field do seem to score higher on the CM scale The coefficient of .477 for the dummy variable (274A experience) indicates that, on average, individuals who took the course score .477 logits higher on the CM scale compared to those who did not The coefficient of .275 for the dummy variable (research experience) indicates that, on average, individuals who have research experience score .275 logits higher on the CM scale compared to those who did not The coefficient of .232 for the dummy variable (paid professional/consulting experience) indicates that, on average, individuals who have professional experience score .232 logits higher on the CM scale compared to those who did not

    33. 33 Results (RQ7): Factors associated with proficiency on CM scale (cont.) All variance inflation factors (VIF) are low. Allison (1999) suggests values less than 2.50 indicate no or little evidence for multicollinearity.All variance inflation factors (VIF) are low. Allison (1999) suggests values less than 2.50 indicate no or little evidence for multicollinearity.

    34. 34 Discussion/Next steps Construct design Improve theory of overall CM proficiency Items design/outcome space Revise or remove several fixed choice items Change stems to use terms e.g. score interpretation consistently Clarify language in distractors Develop more open ended items targeted on UOS and UWM dimensions Augment task analysis and think aloud protocols on open ended items to ensure better understanding of cognitive processes and possible role of construct irrelevant noise Measurement model Fit unidimensional latent regression model Fit multi-dimensional model to (n>72) data set

More Related