270 likes | 454 Vues
Understanding P- values and Confidence Intervals. Thomas B. Newman, MD, MPH. Clinepi 2004Understanding P- values and CI 10Nov04. Overview. Introduction and justification What P-values and Confidence Intervals don’t mean
E N D
Understanding P- values and Confidence Intervals Thomas B. Newman, MD, MPH \Clinepi 2004\Understanding P- values and CI 10Nov04
Overview • Introduction and justification • What P-values and Confidence Intervals don’t mean • What they do mean: analogy between diagnostic tests and clinical research • Useful confidence interval tips • CI for “negative” studies; absolute vs relative risk • Confidence intervals for small numerators
Why cover this material here? • P-values and confidence intervals are ubiquitous in clinical research • Widely misunderstood and mistaught • Pedagogical argument: • Is it important? • Can you handle it?
Example: Douglas Altman Definition of 95% Confidence Intervals* • "A strictly correct definition of a 95% CI is, somewhat opaquely, that 95% of such intervals will contain the true population value. • Little is lost by the less pure interpretation of the CI as the range of values within which we can be 95% sure that the population value lies.“ Hard to understand Wrong!
Understanding P-values and confidence intervals is important because • It explains things which otherwise are paradoxical and do not make sense, e.g. need to state hypotheses in advance, correction for multiple hypothesis testing • You will be using them all the time • You are future leaders in clinical research
You can handle it because • We have already covered the important concepts at length earlier in this course • Prior probability • Posterior probability • What you thought before + new information = what you think now • We will support you through the process
Review of traditional statistical significance testing • State null (Ho) and alternative (Ha) hypotheses • Choose α • Calculate value of test statistic from your study • Calculate P- value from test statistic • If P-value < α, reject Ho
Problem: • Traditional statistical significance testing has led to widespread misinterpretation of P-values
What P-values don’t mean • If the P-value is 0.05, that means that there is a 95% probability that… • The results did not occur by chance • The null hypothesis is false • There really is a difference between the groups
Chalk board: • Easy illustration of why non-Bayesian approach is wrong • Analogy with diagnostic tests: 2x2 tables and “false positive confusion” • Extending the analogy to understand a priori vs post hoc hypotheses, multiple hypotheses, etc. • (This is covered step-by-step in the course book.)
Bonferroni • Inequality: If we do k different tests, each with significance level alpha, the probability that one or more will be significant is less than or equal to k*alpha • Correction: If we test k different hypotheses and want our total Type 1 error rate to be no more than alpha, then we should reject H0 only if P < alpha/k
Confidence Intervals for negative studies: 5 levels of sophistication • Example 1: Oral amoxicillin to treat possible occult bacteremia in febrile children* • Randomized, double-blind trial • 3-36 month old children with T> 39 C (N= 955) • Treatment: Amox 125 mg/tid (< 10 kg) or 250 mg tid (> 10 kg) • Outcome: major infectious morbidity Jaffe et al., New Engl J Med 1987;317:1175-80
Amoxicillin for possible occult bacteremia 2: Results • Overall 27 children (~3%) bacteremic • Of these 27, major infectious morbidity occurred in 3: 2 persistent bacteremia, 1 periorbital cellulitis: • 2/19 (10.5%) with amoxicillin vs 1/8 (12.5%) with placebo. (P = 0.9) • Conclusion: “Data do not support routine use of standard doses of amoxicillin…”
5 levels of sophistication • Level 1: P > 0.05 = treatment does not work • Level 2: Look at power for study. (Authors reported power = 0.24 for OR=4. Therefore, study underpowered and negative study uniformative.)
5 levels of sophistication, cont’d • Level 3: Look at 95% CI for RRRR= .84; 95% CI (.09 to 8.0)(This was level of TBN and RHP letter to the editor, 1987. Note authors calculated OR= 1.2 and 95% CI 0.02 to 30.4)) • Level 4: Make sure you do ITT analysis! (Not OK to restrict attention to bacteremic patients!) So it’s 2/507 vs 1/448; RR= 1.8 (amoxicillin worse); 95% CI (0.05 to 6.2)
Level 5: the clinically relevant quantity is the Absolute Risk Reduction (ARR)! • 2/507 (0.4%) with amoxicillin vs 1/448 (0.2%) with placebo • ARR = -0.17% {amoxicillin worse} • 95% CI (-0.9% {harm} to +.5% {benefit}) • Therefore, LOWER limit of 95% CI for benefit (I.e., best case) is NNT= 1/0.5% = 200 • So this study suggests need to treat >= 200 children to prevent Major Infectious Morbidity in one
Stata output . csi 2 1 505 447 | Exposed Unexposed | Total -----------------+------------------------+---------- Cases | 2 1 | 3 Noncases | 505 447 | 952 -----------------+------------------------+---------- Total | 507 448 | 955 | | Risk | .0039448 .0022321 | .0031414 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------- Risk difference | .0017126 | -.005278 .0087032 Risk ratio | 1.767258 | .1607894 19.42418 Attr. frac. ex. | .4341518 | -5.219315 .9485178 Attr. frac. pop | .2894345 | +----------------------------------------------- chi2(1) = 0.22 Pr>chi2 = 0.6369
Example 2: Pyelonephritis and new renal scarring in the International Reflux Study in Children* • RCT of ureteral reimplantation vs prophylactic antibiotics for children with vesicoureteral reflux • Overall result: surgery group fewer episodes of pyelonephritis (8% vs 22%; NNT = 7; P < 0.05) but more new scarring (31% vs 22%; P = .4) • This raises questions about whether new scarring is caused by pyelonephritis Weiss et al. J Urol 1992; 148:1667-73
Within groups no association between new pyelo and new scarring • Trend goes in the OPPOSITE direction RR=0.34; 95% CI (0.09-1.32)Weiss, J Urol 1992:148;1672
Stata output to get 95% CI: . csi 2 28 18 68 | Exposed Unexposed | Total -----------------+------------------------+---------- Cases | 2 28 | 30 Noncases | 18 68 | 86 -----------------+------------------------+---------- Total | 20 96 | 116 | | Risk | .1 .2916667 | .2586207 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------- Risk difference | -.1916667 | -.3515216 -.0318118 Risk ratio | .3428571 | .0887727 1.32418 Prev. frac. ex. | .6571429 | -.3241804 .9112273 Prev. frac. pop | .1133005 | +----------------------------------------------- chi2(1) = 3.17 Pr>chi2 = 0.0749
Conclusions • No evidence that new pyelonephritis causes scarring • Some evidence that it does not • P-values and confidence intervals are approximate, especially for small sample sizes (and subject to manipulation) • Key concept: calculate 95% CI for ARR for negative studies
P-values and Confidence Intervals • Probably won’t cover this, but FYI: • Usually P < 0.05 means 95% CI excludes null value. • But both 95% CI and P-values are based on approximations, so this may not be the case • Illustrated by IRSC slide above • If you want 95% CI and P- values to agree, use “test-based” confidence intervals – see next slide
Alternative Stata output: Test-based CI . csi 2 28 18 68, tb | Exposed Unexposed | Total -----------------+------------------------+---------- Cases | 2 28 | 30 Noncases | 18 68 | 86 -----------------+------------------------+---------- Total | 20 96 | 116 | | Risk | .1 .2916667 | .2586207 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------- Risk difference | -.1916667 | -.4035313 .0201979 (tb) Risk ratio | .3428571 | .1050114 1.119412 (tb) Prev. frac. ex. | .6571429 | -.1194122 .8949886 (tb) Prev. frac. pop | .1133005 | +----------------------------------------------- chi2(1) = 3.17 Pr>chi2 = 0.0749