Use and abuse of P values

Use and abuse of P values Clinical Research Methodology CourseRandomized Clinical Trials and the “REAL WORLD” NY, 14 December 2007 Emmanuel LesaffreBiostatistical Centre, K.U.Leuven, Leuven, BelgiumDept of Biostatistics, Erasmus MC, Rotterdam, the Netherlands

Contents • P-value: What is it? • Type I error • Multiple testing • Type II error • Sample size calculation • Negative studies • Testing at baseline • Statistical significance  clinical relevance • Confidence interval  P-value • P-value of clinical trial  of epidemiological study • Take home messages

1. P-value: What is it?

1. P-value: What is it? • Etoricoxib  Placebo • WOMAC Pain Subscale: difference in means = -15.07 • What does this result mean? • What do you expect ifetoricoxib=placebo? difference  0 • But even if etoricoxib=placebo, result will vary around 0 • What is a large/small difference? • What is the play of chance? • The same questions for the other scores & comparisons

1. P-value: What is it? • Etoricoxib  Placebo • Suppose H0: E=P • P=0.05  result belongs to the 5% extreme results that could happen under H0 (if H0 is true) • P=0.01  result belongs to the 5% extreme results that could happen under H0 (if H0 is true) and only 1% is MORE EXTREME • P<0.0001  result belongs to the 5% extreme results that could happen under H0 (if H0 is true) and IS VERY EXTREME

1. P-value: What is it? • GENERAL RULE • When P < 0.05 (= significance level ): • Result is considered to beTOO EXTREMEto believe thatH0 is true • H0 is rejected we do NOT believe that E=P • Significant at 0.05 (*, **, ***) • When P  0.05: • Result could have happened whenH0 is true • H0 is NOT rejected it is possible that E=P • Result is  0, but we believe that this is due toPLAY OF CHANCE • NOT significant at 0.05 (NS)

1. P-value: What is it? • Results ECP • E P, WOMAC Pain • P < 0.0001  Significant at 0.05 (***) • We do NOT believe that E=P • E C, WOMAC Physical Function • P = 0.367  NS • It could be that E=C, result isPLAY of CHANCE • E C, Patient Global Assessment • P = 0.051  NS • It could be that E=C, result isPLAY of CHANCE

1. P-value: What is it? • Previous decision rule = hypothesis testing • Test H0: E=P versusHA: E≠P • Using a statistical test (t-test, ²-test, etc) • With 2-sided significance level =  = 0.05 • In clinical trial setting: • Above test is interpreted as:H0: E  P versusHA: E > P • And at 1-sided significance level =/2 = 0.05/2 = 0.025 (2.5%) Whenresult is on the wrong side (E < P) with P < 0.05, then efficacy of E over P is not demonstrated

1. P-value: What is it? • What if H0: E=P is true & P=0.023? • We will reject H0 • We will make an ERROR = Type I error • P(Type I error) = False-positive rate =Probability that result belongs to 5% extreme resultsif H0 is true =0.05

2. Type I error • Type I error: Practical implications • Suppose H0 is TRUE • Risk = 5% implications: • 100 studies  on average 5 studies wrong conclusion •  Prob(at least 1 study wrong conclusion) 1 • Regulatory agencies mandate astrict controlof the overall false-positive rate • False positive trial findings could lead to approval of inefficacious drugs

3. Multiple testing • Multiple testing: Definition • Suppose H0 is TRUE • Test 1 (WOMAC pain subscale): risk = 5% • Test 2 (WOMAC Physical Function Subscale): risk = 5% • Test 1 & Test 2:risk  5% + 5% = 10% of claiming that 2 treatments (on one of the tests) are different when they are not • If no adjustment: multiple testing problem

3. Multiple testing • Multiple testing: Typical cases • 2 treatments are compared for several endpoints • More than 2 treatments are compared • 2 treatments are compared in several subgroups • 2 treatments are compared at several time points

3. Multiple testing: example • 2 treatments are compared for several endpoints

3. Multiple testing: example • More than 2 treatments are compared

3. Multiple testing: example • 2 treatments are compared in several subgroups • Treatments were not significantly different overall • Then, treatments were compared in subgroups: • Males & Females • < 60 yrs &  60 yrs • Diabetes & no-diabetes • .... • Suppose in 1 subgroup:P < 0.05, meaning????  Significant result will be a play of chance

3. Multiple testing: example • 2 treatments are compared atseveral time points Comparison at each time point: PLAY OF CHANCE!

3. Multiple testing: example • Protocol specified: 2.2 Administration of visits Patients will be examined at baseline (day 0), day 7, day 14 and day 28. At each visit the systolic BP, etc... will be measured. 9.4 Primary endpoint The primary endpoint for the comparison of treatment A  B is systolic BP.

3. Multiple testing: example • This “scientific finding” was printed in the Belgian newspapers! It was even stated that those who awake before 7.21 AM, have astatistically significant higherstress level during the day, than those who awake after 7.21 AM!

3. Multiple testing: example Signs of the times: Feb 22nd 2007 | SAN FRANCISCO From The Economist print edition Interesting finding? PEOPLE born under the astrological sign of Leo are 15% more likely to be admitted to hospital with gastric bleeding than those born under the other 11 signs. Sagittarians are 38% more likely than others to land up there because of a broken arm. Those are the conclusions that many medical researchers would be forced to make from a set of data presented to the American Association for the Advancement of Science by Peter Austin of the Institute for Clinical Evaluative Sciences in Toronto. At least, they would be forced to draw them if they applied the lax statistical methods of their own work to the records of hospital admissions in Ontario, Canada, used by Dr Austin.

3. Multiple testing • Multiple testing: Solution?? • Choose 1 primary endpointrisk = 5% • What if more than one endpoint is needed? • Construct combined endpoint based on clinical/statistical reasoning • Correct for multiple testing • What for other (secondary+ tertiary) endpoints? • Call analyses EXPLORATORY • Correct for multiple testing

3. Multiple testing • Multiple testing: Solution?? • Test 1 (WOMAC pain subscale): risk = 5% • Test 2 (WOMAC Physical Function Subscale): risk = 5% • Test 1 & Test 2: risk = 10% • Both tests claim significance if P < 0.05 • Bonferroni adjustment: significance if P < 0.05/2=0.025  Family-wise error rate = 0.05 • More sophisticated approaches of Simes, Holm, Hochberg and Hommel, Closed Testing procedures, ... 2.5% 2.5% 5%

3. Multiple testing • CPMP guidance document “Points to consider on multiplicity issues in clinical trials” (Sept 19, 2002) “A clinical study that requires no adjustment of the Type I error is one that consists of two treatment groups, that uses a single primary variable, and has a confirmatory statistical strategy that pre-specifies just one single null hypothesis relating to the primary variable and no interim analysis”

4. Type II error • Type I error: • Result is statistically significant (P < 0.05) • Risk of making an error when H0 is true= 5% • (We do NOT know if H0 is true) • Type II error: • Result is NOT statistically significant (P  0.05) • Risk of making an error when H0 is NOT true= ??? • (We do NOT know if H0 is NOT true)

5. Sample size calculation • P(Type II error): 1- = 1- Power • LARGE(R) insmallstudies • Can be controlled by adaptingstudy (sample) size • Calculation sample size: • Determine clinically important difference • Search for information • % rate control group • SD of measurements • Fix P(Type II)  0.20  Power  0.80 (80%) • Look for statistician ((s)he will look for computer program) • Pray • Let computer work  sample size

power = 0.95  = 0.05  = 20% n = 2x300 5. Sample size calculation: example

5. Sample size calculation: example??

6. Negative studies • Negative study: Not significant study • Sample size calculation done (power at least 80%) ? • Yes: • Difference between treatments is probably smaller than  • No: • Message ???? • DOES NOT imply: NO difference between treatments

6. Negative studies: example Sample size calculation???? Message????

6. Negative studies: “Trend” • Trend in the data: • P > 0.05, but difference is in the good direction • One speaks of a “trend in the data” • OK? • No, for confirmatory study • Perhaps, for pilot study or exploratory studies

7. Testing at baseline Why no P-values? How many significant (at 0.05) tests would you expect?

8. Statistical significance  clinical relevance • Statistical significance: • P < 0.05 • Message: two treatments are (probably/possibly) different • Clinical relevance: • Difference is clinically relevant

8. Statistical significance  clinical relevance: Example • Compare two treatments • Response = 10-year mortality • 2 x 200 patients • A: 2%, B: 10% • Chi-square test: P < 0.001 • Measures of effect • ar = 10%-2% = 8% (abs risk reduction) • rr = 10%/2% = 5(risk ratio)

8. Statistical significance  clinical relevance: Example • Compare two treatments • Response = 10-year mortality • 2 x 100,000 patients • A: 0.002%, B: 0.0010% • Chi-square test: P < 0.001 • Measures of effect • ar = 0.0010%-0.002% = 0.008% (abs risk reduction) • rr = 0.0010%/0.002% = 5 (risk ratio)

8. Statistical significance  clinical relevance: Conclusion • Conclusion • For each (small)  (≠0), there is a sample sizesuch that H0 is rejected with high probability • Implications • Clinical trials are often too small to detect rare safety issues • When registered and on the market, after several years a safety issue appears (VIOX story)

8. Statistical significance  clinical relevance: Further reflections • Practical conclusions • Even if result is not significant, we will NOT conclude that H0 is true • Why doing the significance test, if we don’t believe in it? • Better estimate difference in treatment effect+ uncertainty Classical table indicating two types of errors (Decision-theoretic approach of Neyman-Pearson). Indicates that we can conclude in practicethat the 2 treatments are equally good It is not possible in statistics to show that 2 treatments are equally good (non-inferiority talk). We even DO NOT BELIEVE that H0 is TRUE in practice!

9. Confidence interval  P-value

9. Confidence interval  P-value • 95% confidence interval • Expresses uncertainty about true difference • When small good idea about true treatment effect • Examples • WOMAC Pain Subscale: • E  C: 95% CI = [-7.02, 0.77]  0 is possible • E  P: 95% CI = [-19.72, -10.41]  E is better • C  P: 95% CI = [-16.57, -7.32]  C is better • GENERAL RESULT: P<0.05 95% CI does not contain 0

medication medication study 95% confidence interval study 9. Confidence interval  P-value Two anti-hypertensive drugs 95% CI gives a clearer message

10. P-valueclinical trial  epi study • Clinical trial • Randomized • No confounding • P < 0.05 causal effect of treatment on patient’s condition • Epidemiological study • Observatory • Possible confounding • P < 0.05  at most association, correction for confounding

10. P-valueclinical trial  epi study

11. Biased set up & reporting

11. Biased setup & reporting • Bias in set up of studies, e.g. inappropriate doses of competing drug • Choice of patient populations, e.g. exclusion of patients who were previously nonresponder to treatment • Noninferiority designs with different thresholds • Biased reporting, e.g. minimal information on negative aspects of drug of sponsor

12. Take home messages • If possible, take 1 primary endpoint • Always determine necessary sample size • Always WATCH OUT for problem of multiple testing • Always and ONLY interpret NS as NOT possible to show “difference” • Always be careful when talking about “trend” • Always determine 95% confidence intervals

Thank you for your attention

Use and abuse of P values

Use and abuse of P values

Presentation Transcript

Soil Use and Abuse

Use and Abuse of Gadgets

p-values and Discovery

Substance Use and Abuse

P-Values

DRUG USE AND ABUSE

Substance Use and Abuse

Drug Use and Abuse

Substance Use and Abuse

Drug Use and Abuse

Drug Use and Abuse

Alcohol Use and Abuse

URI Use and Abuse

P Values

p-values and Discovery

USE AND ABUSE OF

Drugs “Use and Abuse”

Use and abuse of citations

p-values and Discovery

Substance Use and Abuse

COLOR Use and Abuse

P Values