Enhancing Statistics Concept Understanding: Reliability and Validity Analysis

The Statistics Concept InventoryPresenters:Kirk Allen, School of Industrial EngineeringRobert Terry, Department of PsychologyOther team members:Teri Reed Rhoads, Director of Engineering EducationTeri Jo Murphy, Associate Professor of MathematicsAndrea Stone, Ph.D. student in MathematicsMaria Cohenour, Ph.D. student in Psychology

Overview • Background • Big picture analysis • reliability, validity • Item analysis

Background • Statistics Concept Inventory (SCI) project began in Fall 2002 • Based on the format of the Force Concept Inventory (FCI) • Shifts focus away from problem solving, which is the typical classroom format • Focus on conceptual understanding • Multiple choice, around 30 items

Force Concept Inventory • Focuses on Newton’s three laws and related concepts • Scores and gains on initial testing much lower than expected • Led to evaluating teaching styles • Interactive engagement found to be most effective at increasing student understanding

Other Concept Inventories • Many engineering disciplines are developing concept inventories • e.g., thermodynamics, circuits, materials, dynamics, statics, systems & signals • Foundation Coalition • http://www.foundationcoalition.org/home/keycomponents/concept/index.html

Statistics Concept Inventory • Currently at 4 dimensions – 38 items total • Descriptive statistics - 11 items • Inferential statistics - 11 items • Probability - 9 items • Graphical - 7 items

SCI Pilot Study (2003 FIE) • Pilot version of the instrument tested in 6 classes during Fall 2002, near the end of the semester • Intro classes: Engineering, Communications, Mathematics (2) • Advanced classes: Regression, Design of Experiments • 32 questions • Results • Males significantly outperformed females on the SCI • Mathematics majors outperformed social science majors, but no other pairs of majors differed significantly • Most of the social science majors were in a class with poor testing conditions, which may be the reason for their low scores • SCI scores positively correlated with statistics experience and a statistics attitudinal measure

Further Work (2004 ASEE) • Based on results from the revised SCI from Summer 2003 and Fall 2003 • Focus on assessing and improving the validity, reliability, and discriminatory power of the instrument • Utilized focus groups, psychometric analyses, and expert opinions

Results – Spring 2005

Reliability • A reliable instrument is one in which measurement error is small, which can also be stated as the extent that the instrument is repeatable • Most commonly measured using internal consistency • Cronbach’s alpha, which is a generalization of Kuder-Richardson equation 20 (KR-20) • Alpha above 0.80 is reliable by virtually any standard • Alpha 0.60 to 0.80 is considered reliable for classroom tests according to some references (e.g., Oosterhof)

Reliability – Big picture

Reliability – More detailed(Spring 2005)

Reliability Observations • Reliability generally increases from pre-test to post-test • Guessing (pre) tends to lower alpha • Reliability varies by class, especially the lower value for the Psych Gen Ed course • Similar results for other algebra-based courses • Different math background • Questions are context-dependent? • Reliability at other universities has been lower at times but not always • Reliability generalization?

Reliability – other measures • For multi-dimensional data, alpha underestimates the true reliability • Theta – based on largest eigenvalue from principal components • Value 0.77 • Omega – based on communalities from common factor model • Value 0.86 • α ≤ θ ≤ Ω 0.70 < 0.77 < 0.86 • Force Concept Inventory is “more” uni-dimensional (α=0.89) • SCI designed to measure four concepts

Validity • Many types of validity (e.g., face, concurrent, predictive, incremental, construct) • Focused on content, concurrent, and predictive because they are broad validity concepts and are commonly used in the literature

Content Validity • Content validity refers to the extent to which items are (1) representative of the knowledge base being tested and (2) constructed in a “sensible” manner (Nunnally) • Focus groups – ensure that the question is being properly interpreted and help develop useful distracters

Content Validity • Faculty survey – statistics topics were rated for their importance to the faculty – helps provide a list of which topics to include on the SCI • Planning to conduct a new survey (online) using faculty from outside OU as well as non-engineering faculty • AP Statistics course outline – also consulted for topic coverage • Gibbs’ criteria – identify poorly written questions

Concurrent Validity • Concurrent validity is “assessed by correlating the test with other tests” (Klein) • For the “other test”, we used the overall course grade.

Concurrent Validity • For Fall 2003 – Added 2 external universities (3 classes total, intro level, Engineering depts.), along with Engr & Math • Valid as a post-test for all four engineering stats courses, but again not for Math

Concurrent Validity • For Spring 2004 – Three courses: 1 Engr, 2 Math • Different results: Math course now valid but not Engr • Engr had a different professor and textbook

More observations • In general, pre-test performance on the SCI is less predictive of end-of-course performance than the post-test SCI performance, as expected • This analysis could serve as a diagnostic to determine which instructors focus on concepts vs. calculations

Construct Validity • Three-factor and four-factor FIML with general factor • Descriptive, inferential, probability, and graphical sub-tests • Graphical a priori grouped with Descriptive in 3-factor Confirmatory Model • Overall results: Item Uniqueness 70.1% and 70.4% • Most items share about 30% factor variance • Preference is for four-factor model because graphical items are a separate sub-test • Verify analysis with more recent data and more graphical items added to the SCI • These results are based on Fall 2003 (largest dataset thus far)

Test Information Curve for Probability Subtest

Test Information Curve for Descriptive Subtest

Test Information Curve for Inferential Subscale

Test Information Curve for Graphical Subtest – (2 items)

Item Discrimination Index • Compares top quartile to bottom quartile on each item • Generally around 1/3 of the items fall into each of the ranges poor (< 0.20), moderate (0.20 to 0.40) and high (> 0.40)

Percentage of items falling into each range

Item Analysis • Discrimination index • Alpha-if-deleted • Reported by SPSS or SAS • Shows how overall alpha would change if that one item were deleted • Answer distribution • Try to eliminate or improve choices which are consistently not chosen • Focus group comments • IRT curves

Item Analysis If P (A|B) = 0.70, what is P (B|A)? a) 0.70 b) 0.30 c) 1.00 d) 0 e) Not enough information (** correct **) f) Other: __________________________ • Question which was totally changed • Fall 2002: • Discrimination index poor (0.16) • Alpha-if-deleted above the overall alpha (deleting the item would increase alpha) • Too symbol-oriented, not focused on the concept • Topic of conditional probability too important to delete (faculty survey)

Item Analysis • Replacement item In a manufacturing process, the error rate is 1 in 1000. However, errors often occur in bursts. Given that the previous output contained an error, what is the probability that the next unit will also contain an error? a) Less than 1 in 1000 b) Greater than 1 in 1000 (** correct **) c) Equal to 1 in 1000 d) Insufficient information

Item Analysis • Summer 2003: • Three of four classes have discriminatory indices above 0.30 (max 0.55) • Same three also have positive effect on alpha • Focus groups: comments on “non-memoryless” property and bursts would “throw off the odds” • Possible problem: some students chose D because unsure how a “burst” is defined

Upon Further Revision • In a manufacturing process, the error rate is 1 in 1000. However, errors often occur in groups, that is, they are not independent. Given that the previous output contained an error, what is the probability that the next unit will also contain an error? • Less than 1 in 1000 • Greater than 1 in 1000 • Equal to 1 in 1000 • Insufficient information

New data • Item discrimination improved compared to Summer 2003 • Fall 2004, average 0.63 • Spring 2005, average 0.56 • Around 40% to 50% correct • Other options are chosen in approximately equal proportions, with A only slightly less

IRT curves– First Version

IRT curves– Second Version

Law of Large Numbers • Which would be more likely to have 70% boys born on a given day: A small rural hospital or a large urban hospital? • a) Rural • b) Urban • c) Equally likely • d) Both are extremely unlikely

Results from 3 classes, percent of students choosing each letter (Spring 2004). Misconception: do not realize the importance of sample size on variability of means Discrimination index on the post test is 0.44, 0.27, and 0.50. So the question could be considered basically “good” psychometrically, as well as demonstrating the lack of knowledge gain.

A fair coin is flipped four times in a row, each time landing with heads up. What is the mostlikely outcome if the coin is flipped a fifth time? Tails, because even though for each flip heads and tails are equally likely, since there have been four heads, tails is slightly more likely Heads, because this coin has a pattern of landing heads up Tails, because in any sequence of tosses, there should be about the same number of heads and tails Heads and tails are equally likely

Results • Almost everyone gets this question correct • Fall 2003, Six courses • 83% to 98% correct • Discrimination -0.06, 0.02, 0.08, 0.14, 0.19, 0.31

New Version (same choices) • A coin of unknown origin is flipped twelve times in a row, each time landing with heads up. What is the most likely outcome if the coin is flipped a thirteenth time? • Tails, because even though for each flip heads and tails are equally likely, since there have been twelve heads, tails is slightly more likely • Heads, because this coin has a pattern of landing heads up • Tails, because in any sequence of tosses, there should be about the same number of heads and tails • Heads and tails are equally likely

Results • Fall 2004 and Spring 2005 • Engineering class does ok (38%), disc. 0.67 • Other classes poor (10 – 15% correct), low discrimination • Nearly all students still choose D • Students seem to be “trained” to answer that coins are fair, but they cannot adapt to a situation where the coin (or the flipper) is most likely unfair.

Understanding p-values • A researcher performs a t-test to test the following hypotheses:He rejects the null hypothesis and reports a p-value of 0.10. Which of the following must be correct? • The test statistic fell within the rejection region at the significance level • The power of the test statistic used was 90% • Assuming Ho: is true, there is a 10% possibility that the observed value is due to chance • The probability that the null hypothesis is not true is 0.10 • The probability that the null hypothesis is actually true is 0.9

Results for 4 classes

Analysis • Discrimination • Pre: 0.25, -0.17, 0.52, 0.15 • Post: 0.00, -0.14, 0.25, 0.33

P-value question • Problems? • too definitional • p-value taught from an interpretive standpoint • when to reject or not reject the null hypothesis • Therefore …

New question (not a replacement) • An engineer performs a hypothesis test and reports a p-value of 0.03. Based on a significance level of 0.05, what is the correct conclusion? • The null hypothesis is true. • The alternate hypothesis is true. • Do not reject the null hypothesis. • Reject the null hypothesis

Enhancing Statistics Concept Understanding: Reliability and Validity Analysis

Enhancing Statistics Concept Understanding: Reliability and Validity Analysis

Presentation Transcript

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

OVERVIEW

Overview

Overview

Overview

Overview

Overview

OVERVIEW

Overview

Overview

OVERVIEW

Overview

Overview

Overview