Gerard Seinhorst STANAG 6001 Testing Workshop 2018 Workshop C2 Kranjska Gora , Slovenia

Using statistics to evaluate your test Gerard Seinhorst STANAG 6001 Testing Workshop 2018 Workshop C2 KranjskaGora, Slovenia

WORKSHOP OBJECTIVES • Understand how to describe and analyze test data, and draw conclusions from it • Learn how to calculate and interpret the B-Index • Learn how to create a test summary report from raw test data • Understand how quantitative data analysis can support your claims about the test

B-INDEX The B-Index is an item statistic that indicate the degree to which the Masters (those who passed, e.g., a Level 3 test) outperformed the Non-Masters (test takers who failed the Level 3 test) on each item. • Calculation of the B-Index: • Determine what the cut score for passing the test is; e.g. 70% • Split the scores in a group of Masters (at least 70% correct on the test) and Non-Masters. • For each item, subtract the FV for the Non-Masters from the FV for the Masters. • Interpretation of the B-Index is similar to that for the DI.

SMALL-GROUP WORK • Work in small groups (2-4 persons) • Each group should have: • a handout • a flash drive with the data file • a laptop with MS Excel (preferably in English!) • at least one group member who is familiar with doing calculations in MS Excel • The data file is named Workshop C2_Dataset Activity.xlsx and can be found on the flash drive • The handout gives some guidance, but ask for help whenever needed • Work on the activities until 11.45hrs • At 11.45hrs discussion of findings in plenary

Remember… Numbers are like people: torture them enough and they’ll tell you anything. ANONYMOUS

ANALYSIS OF QUANTITATIVE TEST DATA Descriptive Statistics • TEST ANALYSIS- Describing/analyzing test results and test population • Measures of Central Tendency • Measures of Dispersion • Reliability estimates • ITEM ANALYSIS- Describing/analyzing individual item characteristics • Item Difficulty • Item Discrimination • Distractor Efficiency

TEST ANALYSIS: Describing/Analyzing test results Measures of Central Tendency • Gives us an indication of the typical score on a test • Answers questions such as: • In general, how did the test takers do on the test? • Was the test easy or difficult for this group? • How many test takers passed the test? • Statistics: • Mean(average score) • Mode (most frequent score) • Median (the middle point in a rank-ordered set of scores)

TEST ANALYSIS: Describing/Analyzing test results Measures of Central Tendency • When the mean, mode and median are all very similar, we have a “normal distribution” of scores (bell-shaped curve) • When they are not similar, the results are ‘skewed’

MEAN MEDIAN MODE Which measure of central tendency should you use? Depends on your data: • If there are no extreme scores, use the MEAN {8, 9, 10, 10, 11, 11, 12, 13, 14} • If there are extreme scores, use the MEDIAN {2, 9, 10, 10, 11, 12, 12, 12, 13} • If your data cannot be rank-ordered (nominal variables, e.g., gender or occupation), or if one score occurs substantially more often than any other score, use the MODE {8, 10, 11, 12, 13, 13, 13, 13, 13} Use the measure that best indicates the ‘typical’ score in your data set

TEST ANALYSIS: Describing/Analyzing test results Measures of Dispersion • Gives us an indication of how similar or spread out the scores are • Answers questions such as: • How much difference is there between the highest and lowest score? • How similar were the test takers’ results? • Are there any extreme scores (‘outliers’)? • Statistics: • Range(difference between highest and lowest score) • Standard Deviation(average distance of the scores from the mean)

TEST ANALYSIS: Describing/Analyzing test results Standard Deviation (SD, s.d. or ) • Small SD: scores are mostly close to the mean • Large SD: scores are spread out • Example scores of test 1: 48, 49, 50, 51, 52 scores of test 2: 10, 20, 40, 80, 100 MEAN of both tests (250:5) = 50 RANGE test 1 (52 minus 48) = 4 test 2 (100 minus 10) = 90 STANDARD DEVIATION test 1 = 1.58 test 2 = 38.73

VISUALIZING DATA – Bar Chart

VISUALIZING DATA – Box Plot maximum score mean 25% of the scores 25% of the scores 25% of the scores 25% of the scores median minimum score outlier n = 32

ITEM DISCRIMINATION (DI) The degree to which test takers with high overall test scores also got a particular item correct • Indicates how well an item distinguishes between high achievers and low achievers • Calculation: (FVupperFVlower) FV top group (1/3 of test takers with the highest test scores) minus FV bottom group (1/3 of test takers with the lowest test scores) • Ranges from -1.00 to +1.00 • Optimal values: • .40 and abovevery good item • .30 - .39reasonably good item, possibly room for improvement • .20- .29acceptable, but needing improvement • <.20 poor item, to be rejected or revised

DISTRACTOR ANALYSIS Distractor Efficiency is the degree to which a distractor worked as intended, i.e., attracting the low achievers, but not the high achievers. • The Distractor Efficiency is the number of test takers that selected that particular distractor, divided by the total number of test takers • Example: • A distractor that is chosen by less than 7% of the test takers (less than 0.07) is normally not functioning well and should be revised. • However, bear in mind that the easier the item, the lower the distractor efficiency will be.

OPTIMAL VALUES Mean, mode, median N/A * • is affected by test taker ability, • should be interpreted in relation to max. possible score N/A * Range SD N/A * FV Depends on test population, test type/purpose 0.30 - 0.70 > 0.40 Is affected by range of test takers’ ability DI Indicates only how often a distractor was chosen, not if it was chosen by a high achiever or low achiever ≥ 0.07 Distractor Efficiency * Note: Descriptive statistics do not have an optimal value – they merely describe and summarize test or population characteristics without one value a priori being ‘better’ than another

TEST RELIABILITY (Alpha) Test score reliability is an estimate of the likelihood that scores would remain consistent over time if the same test was administered repeatedly to the same learners. A reliability coefficient of .85 indicates that 85% of the variation in observed scores was due to variation in the “true” scores, and that 15% cannot be accounted for and is called ‘error’ (owing to chance) Reliability coefficients range from .00 to 1.00. Ideal score reliabilities are >.80. Higher reliabilities = less measurement error

STANDARD ERROR of MEASUREMENT (SEM) • An obtained test score is an estimate of a person’s “true” test score • The “true” score is the score that a test taker would get if s/he took the test infinite times • SEM indicates how accurate a test taker’s obtained score is. An obtained score is more accurate if it is closer to a test taker’s “true” score • The smaller the SEM, the less error and the greater the precision of the test score • As the reliability of a test increases, the SEM decreases • A test with a reliability coefficient of 1.00 has a SEM of zero – there is no error

STANDARD ERROR of MEASUREMENT (SEM) In a normal distribution it can be expected that • there is a 68% chance that the true score is between 1 SEM below of above the obtained score • there is a 95% chance that the true score is between 2 SEMs below or above the obtained score

STANDARD ERROR of MEASUREMENT (SEM) Example obtained score = 70 SEM = 4 (SEMs are expressed in the same units as test scores) • there is 68% chance thatthe test taker’strue score is between 66 and 74 points (70 minus or plus 4 [-/+ 1 SEM] • we canbe 95% certainthat his true score is between 62 and 78 points (70 minus or plus 8 [-/+ 2 SEMs]) If SEM = 2 • there is 68% chance that his true score is between 68 and 72 points (70 minus or plus 2 [-/+ 1 SEM]) 70 66 62 64 72 68 74 76 78 68% + 1 SEM - 1 SEM 95% - 2 SEMs + 2 SEMs

STANDARD ERROR of MEASUREMENT (SEM) The SEM not only indicates how accurate the test is, but can be used to adjust your cut score pass point based on that accuracy. Another example 100 item test (max. obtainable score: 100) Pass point: 70 (70%) Reliability (alpha): 0.69 SEM: 3 • Due to the comparatively low reliability, you can be less confident that the pass score truly represents the pass/fail point. • There is fair chance that a test taker with an obtained score of 69 might have a “true” score of 70 or 71 • Potentially this leads to a higher number of false negatives (Masters who fail) • Dropping the pass point 1 SEM would change the passing score to 67 (67%). • This will diminish the number of false negatives, but increases the number of false positives.

Gerard Seinhorst STANAG 6001 Testing Workshop 2018 Workshop C2 Kranjska Gora , Slovenia

Gerard Seinhorst STANAG 6001 Testing Workshop 2018 Workshop C2 Kranjska Gora , Slovenia

Presentation Transcript

ideas incredible @ c2 workshop

Pegasus Technical Workshop - Testing -

Study group 3: NATO STANAG 6001 Ed 3 Level 4 Testing

BackTrack Penetration Testing Workshop

Testing Workshop: Reactive Capability Testing

7-9. February 2008, Kranjska Gora, Slovenia WomenInNano Winter School

LEVEL 4 IAW STANAG 6001 - A CONCEPTUAL MODEL -

STANAG 6001 Conference 3-5 September 2013

Slovenia, Kranjska Gora, 7 October 2011 Ute Haller-Block

STANAG 6001 Testing Workshop 4-6 September 2018 Kranjska Gora, Slovenia

STANAG 6001- OPI Testing

INTERCULTURAL PROFICIENCY GUIDELINES A Supplement to STANAG 6001

Skopje, 5-7 Sept. 2017 STANAG 6001 Testing Workshop

STANAG 6001 Testing Workshop Skopje, Macedonia 5 – 7 September 2017

Washback of BiH STANAG 6001 test Major Dra z en Male s evi c BiH STANAG 6001 team