TESTING A TEST

TESTING A TEST Ian McDowell Department of Epidemiology & Community Medicine January 2008

The Challenge of Clinical Measurement • Diagnoses are based on information, from formal measurements and/or from your clinical judgment • This information is seldom perfectly accurate: • Random errors can occur (machine not working?) • Biases in judgment or measurement can occur (“this kid doesn’t look sick”) • Due to biological variability, this patient may not fit the general rule • Diagnosis (e.g., hypertension) involves a categorical judgment; this often requires dividing a continuous score (blood pressure) into categories. Choosing the cutting-point is challenging

Therefore… • You need to be aware … • That we express these complexities in terms of probabilities • That using a quantitative approach is better than just guessing! • That you will gradually become familiar with the typical accuracy of measurements in your chosen clinical field • That the principles apply to both diagnostic and screening tests • Of some of the ways to describe the accuracy of a measurement

Attributes of Tests or Measures • Safety, Acceptability, Cost, etc. • Reliability: consistency or reproducibility; this considers chance or random errors (which sometimes increase, sometimes decrease, scores) • Validity: “Is it measuring what it is supposed to measure?” By extension, “what diagnostic conclusion can I draw from a particular score on this test?” Validity may be affected by bias, which refers to systematic errors (these fall in a certain direction) 4

Reliability and Validity Reliability LowHigh Biasedresult! • • • • • • • • • Validity Low • • • • • • • ☺ High • • • • • • • Average of these inaccurate results is not bad. This is probably how screening questionnaires (e.g., for depression) work •

Ways of Assessing Validity • Content or “Face” validity: does it make clinical or biological sense? Does it include the relevant symptoms? • Criterion: comparison to a “gold standard” definitive measure (e.g., biopsy, autopsy) • Expressed as sensitivity and specificity • Construct validity (this is used with abstract themes, such as “quality of life” for which there is no definitive standard) 6

Criterion, or “Gold Standard” The clinical observation or simple test is judged against • More definitive (but expensive or invasive) tests, such as a complete work-up, Or against • Eventual outcome (for screening tests, when workup of well patients is unethical) Sensitivity and specificity are calculated 7

2 x 2 Table for Testing a Test Gold standard Disease DiseasePresent Absent Test score: Test positive Test negative a (TP) b (FP) c (FN) d (TN) • Validity: Sensitivity Specificity • = a/(a+c) = d/(b+d) TP = true positive; FP = false positive… Golden Rule: always calculate based on the gold standard

A Bit More on Sensitivity = Test’s ability to detect disease when it is present a/(a+c) = TP/(TP+FN) Mnemonics: - a sensitive person is one who is aware of your feelings- (1 – seNsitivity) = false Negative rate = how many cases are missed by the screening test? 9

…and More on Specificity Ability to detect absence of disease when it is truly absent (can it detect non-disease?) d/(b+d) = TN/(FP+TN) • Mnemonics: • a specific test would identify only that type of disease. “Nothing else looks like this” • (1- sPecificity) = false Positive rate (How many are falsely classified as having the disease?) • The FP idea will arise again, so keep it in mind! 10

Most Tests Provide a Continuous Score. Selecting a Cutting Point Test scores for a healthy population Sick population Healthyscores Pathologicalscores Possible cut-point Move this way to increase sensitivity(include more ofsick group) Move this way toincrease specificity(exclude healthy people) Crucial issue: changing cut-point can improve sensitivity or specificity, but never both

D + D - a b T + T - c d Clinical applications • A specific test can be useful to rule in a disease. Why? • Very specific tests give few false positives.So, if the result is positive, you can be sure the patient has the condition (‘nothing else would give this result’): “SpPin” • A sensitive test can be useful for ruling a disease out: • A negative result on a very sensitive test (which detects all true cases) reassures you thatthe patient does not have the disease: “SnNout”

Problems with Wrong Results • False Positives can arise due to other factors (such as taking other medications, diet, etc.) They entail cost and danger of investigations, labeling, worry • This is similar to Type I or alpha error in a test of statistical significance: the possibility of falsely concluding that there is an effect of an intervention. • False Negatives imply missed cases, so potentially bad outcomes if untreated • cf Type II or beta error: the chance of missing a true difference 13

Practical Question:“Doctor, how likely am I to have this disease?”= Predictive Values • Sensitivity & specificity don’t tell you this, because they work from the gold standard. • Now you need to work from the test result, but you won’t know whether this person is a true positive or a false positive (or a true or false negative). Hmmm… How accurately will a positive (or negative) result predicts disease (or health)?

D + D - a b T + T - c d Positive and Negative Predictive Values • Based on rows, not columns • Positive Predictive Value (PPV) = a/(a+b) = Probability that a positive score is a true positive • NPV = d/(c+d); same for a negative test result • BUT… there’s a big catch: • We are now working across the columns, so PPV & NPV depend critically on how many cases of disease there are (prevalence). • As prevalence goes down, PPV goes down (it’s harder to find the smaller number of cases) and NPV rises. • So, PPV and NPV must be determined for each clinical setting, • But this is then immediately useful to clinician: reflects this population, so tell us about thispatient

Prevalence and Predictive Values B. Primary care A. Specialist referral hospital D + D - D + D - 50 100 50 10 T + T - T + T - 5 1000 5 100 Sensitivity = 50/55 = 91% Specificity = 100/110 = 91% Prevalence = 55/165 = 33% Sensitivity = 50/55 = 91% Specificity = 1000/1100 = 91% Prevalence = 55/1155 = 3% PPV = 50/60 = 83% NPV = 100/105 = 95% PPV = 50/150 = 33% NPV = 1000/1005 = 99.5%

Imagine you know Sensitivity & Specificity. To work out PPV and NPV you need to guess prevalence, then work backwards: Fill cells in following order: “Truth” Disease Disease Total PV Present Absent Test Pos Test Neg Total 4th 5th 7th 6th 8th 9th 10th 11th 2nd 3rd 1st (from estimated prevalence) (from sensitivity) (from specificity)

a b c d N Gasp…! Isn’t there an easier way to do all this…? Yes (good!) But first, you need a couple more concepts (less good…) • Before you apply a diagnostic test, prevalence gives your best guess about the chances that this patient has the disease. • This is known as “Pretest Probability of Disease”: (a+c) / N in the 2 x 2 table: • It can also be expressed as odds of disease: (a+c) / (b+d), as long as the disease is rare

Test scores are continuous scales.You can use this to combine sensitivity and specificity:Meet Receiver Operating Characteristic Curves Work out Sen and Spec for every possible cut-point, then plot these. Area under the curve indicates the information provided by the test 1 0.8 0.6 Sensitivity Note: the theme of sensitivity & (1-specificity) will appearagain! 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 1-Specificity ( = false positives)

This Leads to … Likelihood Ratios • Defined as the odds that a given level of a diagnostic test result would be expected in a patient with the disease, as opposed to a patient without: true positives / false positives. [TP / FP] • Advantages: • Combines sensitivity and specificity into one number • Can be calculated for many levels of the test • Can be turned into predictive values • LR for positive test = Sensitivity / (1-Specificity) • LR for negative test = (1-Sensitivity) / Specificity

Practical application: a Nomogram • You need the LR for this test • Plot the likelihood ratio on center axis (e.g., LR+ = 20) 3) Select pretest probability(prevalence) on left axis (e.g. Prevalence = 30%) ▪ ▪ 4) Draw line through these points to right axis to indicate post-test probability of disease Example: Post-test probability = 91%

Chaining LRs Together (1) • Example: 45 year-old woman presents with “chest pain” • Based on her age, pretest probability that a vague chest pain indicates CAD is about 1% • Take a fuller history. She reports a 1-month history of intermittent chest pain, suggesting angina (substernal pain; radiating down arm; induced by effort; relieved by rest…) • LR of this history for angina is about 100

The previous example: 1. From the History: She’s young;pretest probabilityabout 1% Pretest probabilityrises to 50%based on history LR 100

Chaining LRs Together (2) 45 year-old woman with 1-month history of intermittent chest pain… After the history, post test probability is now about 50%. What will you do?Something more precise (but also more costly): • Record an ECG • Results = 2.2 mm ST-segment depression. LR for ECG 2.2 mm result = 10. • Overall post test probability is now >90% for coronary artery disease (see next slide)

The previous example: ECG Results Post-test probabilitynow rises to 90% Now start pretest probability (i.e. 50%, prior to ECG, based onhistory)

TESTING A TEST

TESTING A TEST

Presentation Transcript

Standard Test Administration Testing Ethics

TESTING A TEST

Software Testing: Test Design

Field Testing Testing the Test

Software Testing: Test Management

Test security Ethics of Testing

TESTING Test plan

Testing carbohydrates. Quantitative Benedict’s Test

Testing Calendar Manuals Test Administration

Lung Function Testing Sequence Terminating a fitness test

TESTING THE TEST METRICS

TEST ISTANBUL TESTING COMPETITION

Toxicity Testing II P erforming a toxicity test

Automated Testing & Test Tools

Standard Test Administration Testing Ethics

Testing and Test Case Development

Software Testing: Building Test Cases

Test Automation vs. Automated Testing

Allergy Testing - A Look at the Skin Test

Testing Your Water With a Home Test Kit

Usability Testing, a manual human test

Standard Test Administration Testing Ethics

TESTING A TEST

TESTING A TEST

Presentation Transcript

Standard Test Administration Testing Ethics

TESTING A TEST

Software Testing: Test Design

Field Testing Testing the Test

Software Testing: Test Management

Test security Ethics of Testing

TESTING Test plan

Testing carbohydrates. Quantitative Benedict’s Test

Testing Calendar Manuals Test Administration

Lung Function Testing Sequence Terminating a fitness test

TESTING THE TEST METRICS

TEST ISTANBUL TESTING COMPETITION

Toxicity Testing II P erforming a toxicity test

Automated Testing &amp; Test Tools

Standard Test Administration Testing Ethics

Testing and Test Case Development

Software Testing: Building Test Cases

Test Automation vs. Automated Testing

Allergy Testing - A Look at the Skin Test

Testing Your Water With a Home Test Kit

Usability Testing, a manual human test

Standard Test Administration Testing Ethics

Automated Testing & Test Tools