EPI-820 Evidence-Based Medicine (EBM) LECTURE 2: MEDICAL MEASUREMENT Mat Reeves BVSc, PhD Department of Epidemiology Michigan State University
Objectives: • 1. Understand biological and measurement variation and its effects on precision and validity. • 2. Understand the components of variability • biological and measurement • between- and within-person/observer • 3. Understand measures of variation and measures of agreement. • 4. Understand the calculation and application of K. • 5. Understand the consequences of variability in clinical data and possible remedies to ameliorate • 6. Understand regression to the mean.
I. Variation in Clinical Data • 1. Biologic Variation= variation in the actual entity being measured • derives from the dynamic nature of physiology, homeostasis and pathophysiology. • within (intra-person) biologic variability and, • between (inter-person) biologic variability
Within (day-to-day variation) and Between Person Biological Variation: Coefficient of Variation (%) (see Winkel et al, 1974) • VariableCV (Within)CV (Between) • Na 0.7% 0.8% • K 4.3% 4.3% • Cl 2.1% 1.2% • Ca 1.7% 2.8% • BUN 12.3% 16.4% • Creatinine 4.3% 9.5% • Cholesterol 5.3% 13.6% • SGOT (ALT) 24.2% 24.8% • TP 2.9% 5.7%
I. Variation in Clinical Data • 2. Measurement Variation= variation due to the measurement process • inaccuracy of the instrument (instrument error), and/or, • inaccuracy of the person (operator error) • can introduce both random error andbias
Analytical Variation - Coefficient of Variation (%) of Duplicate Samples • VariableCV (Analytical) • Na 1.1% • K 2.6% • Cl 2.1% • Ca 2.1% • BUN 2.2% • Creatinine 3.4% • Cholesterol 3.1% • SGOT (ALT) 7.3% • TP 1.7%
Validity • Degree to which a measurement process measures what is intended i.e., accuracy. • Lack of systematic error or bias. • A valid instrument will, on average, be close to the underlying true value. • Assessment of validity requires a “gold standard” (a reference).
What if no gold standard? (e.g., pain, nausea or anxiety) • Use instrument or clinical scale to measure a specific phenomenon or construct. • CriterionValidity - the degree to which the scale predicts a directly observable phenomenon e.g. APGAR score and neonatal survival. • Content Validity - the extent to which the instrument includes all of the dimensions of the construct being measured e.g. does APGAR include all relevant patho-physiological parameters? • Construct Validity - the degree to which the scale correlates with other known measures of the phenomenon e.g. how well does a new “Neonatal assessment scale” correlate with APGAR score?
How do you measure validity? • Dichotomous data • sensitivity, specificity, and predictive values. • Continuous data • mean and standard deviation of the difference between surrogate measure and gold standard (see Bland and Altman, 1986).
Precision(or reliability or reproducibility) • the extent that repeated measurements of a phenomenon tend to yield the same results (regardless of their accuracy!). • Precision refers to the lack of random error • Precision ~ 1 / random error
Blood chloride level Left ventricular ejection volume Migraine severity 28-d stroke case-fatality rate Indirect costs of school absenteeism Direct costs of school absenteeism Degree of depression Alzheimer severity Self-reported ability to do domestic chores Self-reported ability to climb stairs Patient preferences for induced labour Self-reported assessment of health Hard versus Soft Data ?
Hard versus Soft Data • No specific criteria to define “hard” data, attributes include: • Consistency: the ability to preserve basic evidence (repeated observations are consistent) (most important attribute). • Objectivity: observations are free of subjective influences. • Quantifiable: the ability to express the result as a number.
Hard versus Soft Data • Usually hard data are numeric measures, such as lab data, but not always (e.g., histology, cancer stage) • Hard (numeric) data preferred to softer (qualitative) measures because they are more objective and reliable? (but see Feinstein AR et al, 1985, Will Rogers phenomenon)
Between and Within Person Variation • Four categories of clinical variability: • 1. Between-person biological variability • 2. Within-person biological variability • 3. Between-observer measurement variability • 4. Within-observer measurement variability
ANOVA Model Conceptualization • yijkl = i + ij + ik + il • where: • yijk = the observed measurement for individual i, measured at time j, by the kth observer at the lth replication. • i = individuals usual true mean (between person biological variation) • ij = perturbation due to biological variation at time j (within person biologic variation). • ik = perturbation due to measurement error by the kth observer (between observer measurement variation). • il = perturbation due to measurement error at the lth replication (within observer measurement variation).
II. Statistical aspects of variability • A. Measures of Variation • 1. Variance and Standard Deviation • SD = absolute value of average differences of individual values from the overall mean. • CLT = 68%, 95%, 99% • Example: • Av. US Cholesterol = 220 mg/dl, SD = 15 mg/dl • Indv. readings expected to vary 190-250 mg/dl
A. Measures of Variation • 2. Co-efficient of Variation (CV) • represents the % variation of a set of measurements around their mean • conceptualized as a “noise-to-signal ratio” • useful index for comparing the precision of different instruments, individuals and/or laboratories. %
B. Measures of Agreement • 1. Correlation (r) • Pearson product moment correlation and Spearman’s rank correlation • measures the degree of linear relationship between two variables (-1, +1) • correlation between two sets of continuous measurements (= reliability) or extent of replication
1. Correlation (Cont’d) • Two observers, same time period = inter-rater reliability. • Single observer, two time periods = intra-rater reliability (test-retest reliability). • Can have very high values of r, but little direct agreement between raters or instruments. • Can only be used as a test of validity if the actual true values are known.
B. Measures of Agreement • Intra-class Correlation Coefficient (R or reliability) • a measure of reliability for continuous or quantitative data • an observed value (X) consists of two parts: • X = T + e • where: • T = the “True” unknown level or “error-free” score or “steady state” or “signal” • e = error (whether “biologic” or “measurement” error) • true error-free value varies about some unknown mean () with a variance of 2T.
2. R (Cont’d) • error term is regarded as iid ( = 0, 2e ). • Variance of X (2x ) = 2T + 2e • relative size of error variance (2e) in relation to variance of true value (2T ) is a measure of the imprecision. • R = 2T. 2T + 2e • R = the proportion of the total variance due to subject-to-subject (or between-person) variability in the “true” value. • As random error decreases, the value of R increases
2. Categorical data – Kappa (K) • A measure of reliability for categorical or qualitative data. • Kappa corrects for the degree of chance in the overall level of agreement, and is preferred over other measures (like overall percent agreement). • K = Po - Pe = Actual agreement beyond chance 1 - Pe Potential agreement beyond chance • Po = the total proportion of observations on which there is agreement • Pe = the proportion of agreement expected by chance alone.
OBSERVER B OBSERVER A Yes No TOTALS Yes a b f1 No c d f2 TOTALS n1 n2 N Agreement matrix for kappa statistic (inter-rater agreement, 2 observers, dichotomous data)
OBSERVER B OBSERVER A Yes No TOTALS Yes 69 15 84 No 18 48 66 TOTALS 87 63 150 Agreement matrix for kappa statistic (2 observers, dichotomous data)
K (Cont’d) • Observed agreement (Po) = 78% • (69 + 48)/150 = 0.78 or 78%. • Agreement expected dt chance (Pe) = 51%. • Calculated by the product of the marginal totals for cells a and d [87 x 84/150 = 48.75 + 63 x 66/150 = 27.72] • Then divide sum [76.47] by 150 to get Pe = 0.51 or 51%.
K (Cont’d) • K = Po - Pe = 0.78 - 0.51 = 0.27 = 0.55 or 55% 1 - Pe 1 - 0.51 0.47 • Kappa varies from -1 to +1, with a value of zero denoting agreement no better than chance (negative values denotes agreement worse than chance!) • Value of kStrength of agreement <0 Poor0 - 0.20 Slight0.21 - 0.40 Fair0.41 - 0.60 Moderate0.61 - 0.80 Substantial0.81 - 1.0 Almost perfect
K - Issue of Prevalence • The prevalence of condition affects the likelihood that observers will agree purely due to chance - hence the importance of using kappa.Example: • Observer A classified 120/150 patients • Observer B classified 130/150 patients • Pe is now 72%.
K - More Complicated Scenarios • Overall (summary) kappa: • several observers or raters and/or where the subjects are classified into several different categories. • Weighted kappa: • measuring the relative degree of disagreement when subjects are classified into several ordinal categories (e.g., normal, slightly abnormal and very abnormal). • MacClure and Willett (1987): • Use kappa for dichotomous data or nominal polytomous data only. • For ordinal data use either Spearman’s rank correlation or R.
IV. Consequences of variability of clinical data • A. Clinical impact • Errors in diagnosis, prognosis and even treatment. • Clinical disagreement between clinicians. • B. Research Impact • Between-person biological variability is a prerequisite for etiologic studies. • Random within-person variability (a form unreliability) results in non-differential misclassification - with a resulting dilution or attenuation of effect.
B. Research impact • Generally, imprecision has less impact in research setting than individual clinical setting because can average over a large number of observations (but still require measure to be valid). • Variability and misclassification result in the need for larger samples sizes (and increased costs). • Measurement errors can introduce bias if they do not occur at random - non-differential misclassification
Regression Dilution Bias • Example: MacMahon et al., (1990) • imprecision resulting from a single measurement of diastolic blood pressure resulted in a 60% attenuation of RR’s (for the effect of elevated blood pressure on stroke and MI). • “regression dilution bias”.
C. Regression towards the mean • Group of individuals selected based on the results of an “abnormal” test can be divided into: • a) those with a true underlying abnormal value, and • b) those with a true underlying normal value (but random fluctuations resulted in an outlying [abnormal] value). • On retesting, patients in group b are closer to their typical (normal) values, so, the overall mean is less extreme (= regression to the mean). • Occurs when repeated observations are performed on a variable that is inherently variable.
C. RTTM • Often interpreted as a sign of clinical improvement, regardless of effectiveness of treatment (an important explanation for the placebo effect) • If first reading is d units higher than the true value (), then on average, the next value will be closer to the mean by d(1 - r) units, • where r is the correlation between the two measurements • RTTM increases if d is large and r is small. • RTTM is a general tendency for describing the average behaviour of a group, not necessarily individuals!!
V. Remedies for variability of clinical data • A. Within-person biologic variation • Standardized measurements: use a standard protocol i.e., time of day, body position etc. • Average repeated tests e.g., take several blood pressure reading. • Use a less variable test e.g., for diabetes use glycosolated Hb, rather than blood glucose. • Plot the data - what is the trend? • Develop reference values for each individual - especially if: • within-person variability <<< between-person variability • this results in a wide reference range which makes it difficult to identify individual deviations • e.g., body weight, PSA, EKG
B. Measurement Error • Measurement imprecision corrected by adjusting the machine or re-training the tester, (or, average several values?). • Measurement error that causes bias requires quality assurance testing. Fix by re-calibration (don’t average!!).
Sackett - Six strategies for preventing or minimizing clinical disagreements • 1. Match diagnostic environment to the diagnostic task. • 2. Corroborate key findings by: • repeating observations and questions • confirm information with other sources (e.g., family members) • confirm key findings using appropriate diagnostic tests • seek confirmation from “blinded” colleagues • 3. Report actual findings then report inference • 4. Use appropriate technical aids to avoid imprecision (e.g., ruler). • 5. “Blinded” assessments of diagnostic findings. • 6. Apply skills of social sciences • establish understanding, follow a logical order, listen, observe, interrupt only where necessary).