Rick Chappell, Ph.D. Professor Department of Statistics and

University of Wisconsin – Madison Non-inferiority Trials - part “D” of the general Phase III lecture Rick Chappell, Ph.D. Professor Department of Statistics and Department of Biostatistics and Medical Informatics University of Wisconsin chappell@stat.wisc.edu Stat 542 – Spring 2018

Outline I. Definition of Equivalence Trials with a Motivating Example II. Specifying the Null Hypothesis • Consequences of the Choice of Null Hypothesis • Consequences of not Trying to Find an Effect

I. A Motivating Example - SPORTIF III (Darius, 2002) Company: AstraZeneca Comparison: Ximelagatran (T; experimental treatment) vs. Warfarin (C; active control) in a randomized, two parallel arm clinical trial Treatment and Followup: Up to twenty-six months Primary Outcome: Stroke and other events measured by annual incidence rate

I. SPORTIF III, cont. Blinding Status: Open label, blinded assessment Type:Non-inferiority, margin 2% per year (5.1% vs. 3.1%) Patient Population: Age ≥ 18 y.o., atrial fibrillation, high-risk (at least one of: hypertension; age ≥ 75 y.o.; previous stroke, TIA, or systemic embolism; left ventricular dysfunction; or age ≥ 65 y.o. and diabetes mellitus or coronary artery disease)

I. SPORTIF III, cont. Size of Test for Equivalence: alpha = .05, two-sided Power of Test for Equivalence: 90% for primary event rate of 3.1% / year in each treatment group Planned Sample Size: 3407 patients at 259 sites in 23 countries Planned Average Duration of Treatment / Followup: 16 months

Definition of Equivalence Trial Consider treatment rate T and control rate C: • A Superiority Trial examines the treatment effect T - C attempts to show it to be negative (when small values are good): ximelagatran has fewer events than warfarin. • A higher sample size n yields more power to precisely estimate T - C and detect a difference in effects.

An Equivalence (non-inferiority, “active control”) Trial examines T - C and attempts to show it to be not too large: ximelagatran doesn’t result in many more events than warfarin. • Naively, one can put a confidence interval on T - C and claim success if it does contain 0. The chance of this is maximized with lower n.

Solution Require that trial limits margin to be greater than a "prespecified degree of inferiority" D [ICH E-3], T - C <D . T might be a little bigger (worse) than C but not much. Temple and Ellenberg (2002) point out that this is a "not-too-much-inferiority trial". Here also a higher n, by giving more precision to estimate T - C, raises the chance of success.

A comment on terminology - “Active Control”≠ “Non-inferiority” Superiority trials can be active control (comparing one treatment to another); or Inactive control (comparing a treatment to placebo or standard of care). Non-inferiority trials are usually active control because we usually don’t want to show a treatment non-inferior to nothing.

A comment on terminology - “Active Control”≠ “Non-inferiority” Superiority trials can be active control (comparing one treatment to another); or Inactive control (comparing a treatment to placebo or standard of care). Non-inferiority trials are usually active control because we usually don’t want to show a treatment non-inferior to nothing. “Usually”?

II. Specifying the Null Hypothesis Hypotheses in Superiority Trials H0: T - C ≥ 0 (Treatment may have no effect or cause harm) vs. HA: T - C < 0 (Treatment has a good Effect) Hypotheses in Equivalence Trials H0 : T - C ≥D (Treatment might be much worse) vs. HA : T - C <D (Treatment isn’t much worse)

Ways Effects can be Defined for a Time-to-Event Trial 1. Survival function at all followup times 2. Hazard (event rate) function at all followup times 3. Survival by a given time such as one year. 4. Median time to event H0 in a superiority trial is the same for definitions 1. and 2. Definitions 3. and 4. are weaker, but are implied by 1. and 2. But they all differ for non-inferiority trials.

Margins are not zero in equivalence studies, so we must pick the scale carefully. This ought to be the scale on which the specified margin is relevant. It may not result in the most convenient statistical analysis: how many clinical trials’ primary question pertains to a difference in annual event rates (hazards)?. But for purposes of interpretability, this scale was chosen for SPORTIF III.

Back to a Motivating Example - SPORTIF III Company: AstraZeneca Comparison: H 376/95 vs. Warfarin Treatment and Followup: Up to twenty-six months Outcome: Stroke and other Events Measured by annual incidence rate Type:Equivalency, margin 2% per year Sample size:3000 patients

In the SPORTIF example, outcome is the annual incidence rateso that the hypotheses are H0: T - C ≥ 2% per year vs. HA: T - C < 2% per year

If event rates are constant then the incidenceis just the exponential rate parameter, estimated by Observed Incidence = # of events / total time at risk . • But typically the event rate will change with followup. Then Observed Incidence becomes an average annual rate over twenty-six months. • How to estimate it efficiently - using full two years of data? What if followup varies? • How to estimate it robustly - not assuming any parametric distribution?

Ximelagatran better Clinical equivalence Warfarin better Non-inferiority Inconclusive Superiority Inferiority -6 -4 -2 0 2 4 6 Event Rate Difference (ximelagatran minus warfarin), % per year Interpretation of possible trial results f

III. Consequences of the Choice(s) In the above example, we chose • The non-inferiority margin, D = 2%/year; and • The scale of comparison in H0, the annual incidence rate. There is much literature on choice #1, almost none on #2.

Choosing D = 2%/year: • should be smaller than the original estimated effect of Warfarin, so that if the event rate with Warfarin was 4% reduced from say 7% for placebo it would be nonsensical to consider Δ ≥ 3%. That would imply that therapeutic equivalence could be as bad as no treatment. • Should also be clinically relevant. But how do we choose the Scale of comparison?

A. Scale choice and balanced randomization Consider the simple two-sample continuous outcome case:Then the usual (unstandardized) test statistic for superiority is the difference in sample means. The allocation which minimizes its variance and yields greatest power for the trial is a balanced (1:1) one.

Suppose we are designing an equivalence trial to test the hypothesisH0 : T - C ≥ 2.Then the optimal allocation is still balanced. However, if we want to testH0 : T / C≥ 2,which is equivalent to H0 : T - 2 ×C≥0instead, then the optimal allocation is 2 : 1 in favor of the control group - a big difference.

B. Scale choice and power Suppose, in a trial (percutaneous coronary intervention, PCI) with binary “failure” outcome, we are deciding between the null hypothesis of additive inferiority in rates: H0: T - C ≥ .004 and that of multiplicative inferiority H0: T / C ≥ 1.5 . These are identical at C= .008, T = .012. Should power be the same for the common alternative HA: T = C = .008?

Answer: no (surprisingly, to me and the trial’s principal investigator). For proportions less than .01, the range in which we are interested, the hypothesis of multiplicative non-inferiority is much more demanding and requires a larger sample size: About 21,000 instead of 14,000!

IV. Consequences of “Not Trying to Find an Effect” A. Low quality can naïvely appear good (help prove non-inferiority): - noncompliance - drug impurity - loss to follow-up - enrollment of ineligibles - other protocol violations

Obviously we want to minimize protocol violations in all trials. However, they have fundamentally different effects depending on the type of study they afflict: • In Superiority Trials conducted with proper randomization and blinding, these violations degrade the treatment effect T - C and are thus usually conservative: they bias results towards a conclusion of no effect. • In Equivalence Trials, even with proper randomization and blinding, these violations can degrade the treatment effect and are thus anti-conservative: they can bias results towards a conclusion of equivalence.

Intent to Treat Revisited Is it still the ironclad standard for primary analysis? "No participants should be withdrawn from the analysis due to lack of adherence. The price to be paid is a possible decrease in power." - Friedman, Furberg and DeMets, referring to superiority trials. General agreement, including in ICH guidelines: "An analysis using all available data should be carried out for all studies intended to establish efficacy" [ICH E-3].

“”Intent to Treat” Analysis in the presence of noncompliance • Has decreased power compared to situation with full compliance and • Results in estimate of T - C biased towards 0 - conservative in superiority trials - anticonservative in non-inferiority trials

“”Intent to Treat” Analysis in the presence of noncompliance Conservative in superiority trials Results in estimate of T - C biased towards 0 Anticonservative in non-inferiority trials Anticonservative in non-inferiority trials?

"As Treated" Analysis in the presence of noncompliance • Also has decreased power compared to situation with full compliance: and • Biases the estimate of T - C in an unknown fashion - ? in superiority trials - ? in non-inferiority trials

Recommendation: Stick to Intent to Treat in equivalence (non-inferiority) trials but • Take Care to maximize quality of the data • Pay Attention to patterns of quality during the trial • Summarize aspects of quality in the report Quality "Before, During and After”.

ICH E-9 says that in an equivalence trial, the role of the full analysis (intent-to-treat) data set "should be considered very carefully." What percent of noncompliance is unacceptable?

ICH E-10 states: "The trial should also be conducted with high quality (e.g., good compliance, few losses to follow-up)." • It also has useful advice: "The trial conduct should also adhere closely to that of the historical trials." • That is, the design and patient population should be similar to previous trials used to determine evidence of sensitivity to drug effects. • My conclusion: If noncompliance is less than that achieved in prior trials which showed efficacy, good. But if not, beware (same logic as in choice of Δ).

Other interesting problems: B. Finding the true treatment effect of Ximelagatran compared to placebo: we don’t just want to know if Ximelagatran is non-inferior to Warfarin, we want to know if it “works” - if it is better than Placebo. We infer about: Effect of Xi. vs. Pl. = Effect of Xi. Vs. Wa. + Effect of Wa. Vs. Pl. But combining two clinical trials invokes an assumption:

HISTORICAL TRIAL EQUIVALENCE TRIAL nH PATIENTS nE PATIENTS RANDOMIZATION RANDOMIZATION PLACEBO, nH /2 OLD DRUG, nH /2 OLD DRUG, nE /2 NEW DRUG, nE /2 ? COMPARE COMPARE

Inference’s validity depends on randomization and comparison with past trial in order to estimate treatment effect without direct comparison with placebo • Past trials give Historical Evidence of Sensitivity to Drug Effects (HESDE) • HESDE is relevant only if populations in two trials are • What if neither drug works in the current population? Then they’re certainly equivalent!

This contrasts with a superiority trial's validity, which depends upon randomization: arms are drawn from same population (Lachin, 1988). n Patients RANDOMIZATION Old Drug, n/2 New Drug, n/2 COMPARE

But populations change: • Age distribution change • Other characteristics change • Adjuvant therapies arise • Earlier diagnosis is possible • The disease itself may change These imply that we should use a recent trial for comparison.

C.“Biocreep”: There is a problem with continuously comparing to the most recent trials: ”Equivalency Drift“ (referred to as “Bio-creep” in one FDA guidance).

The Problem of Equivalency Drift - +4% +4% Margin of Equivalency = 2% - - BENEFIT BENEFIT - 0 0 DRUG 1 DRUG 2 DRUG 3 DRUG 4 DRUG 1 DRUG 3 DRUG 4 DRUG 2 EQUIVALENT EQUIVALENT EQUIVALENT EQUIVALENT EQUIVALENT EQUIVALENT

Another Thorny Consideration +4% Suppose you represent a drug manufacturer conducting a clinical trial and you know that the trial’s results would be used to help a future competitor show its drug to be non-inferior to yours. Then the narrower your confidence intervals, the easier you make it for your competitor! You are motivated to make your results as imprecise as possible, while still permitting FDA approval. BENEFIT 0 DRUG 1 DRUG 2 EQUIVALENT EQUIVALENT EQUIVALENT

Chappell’s prediction In 30 years (2 years longer than my residual life expectancy), all positive clinical trials will have significance p = .049.

“I guess I should warn you that if I turn out to be particularly clear, you've probably misunderstood me.” - Alan Greenspan at his 1988 confirmation hearings for Fed. Reserve chair.

References Chappell, R. “Non-inferiority Trials”. In Clinical Trials in Neurology, ed. Ravina, B., Cummings, J., McDermott, M.P., and Poole, M. Cambridge University Press, Cambridge, UK. 2012 Friedman, L.M., Furberg, C., and DeMets, D.L. Fundamentals of Clinical Trials, Springer-Verlag, New York (1998). Halperin, J.L. “Ximelagatran compared with warfarin for prevention of thromboembolism in patients with nonvalvular atrial fibrillation: Rationale, objectives, and design of a pair of clinical studies and baseline patient characteristics (SPORTIF III and V).”Am. Heart J. 146, pp. 431-8 (2003). International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. Guidances E3: Structure and Content of Clinical Study Reports (1995); E9: Statistical Principles for Clinical Trials (1998); and E10: Choice of Control Group in Clinical Trials (2000). http://www.ich.org/ich5e.html#Reports Lachin, J.M. "Statistical properties of randomization in clinical trials.“Controlled Clinical Trials9, pp. 289-311 (1988). Temple, R. and Ellenberg, SS. “Placebo-controlled trials and active-control trials in the evaluation of new treatments. Part 1: ethical and scientific issues.”Annals of Internal Medicine133, pp.455-63 (2000).

Rick Chappell, Ph.D. Professor Department of Statistics and

Rick Chappell, Ph.D. Professor Department of Statistics and

Presentation Transcript

department of statistics

Kumar Vemaganti, Ph.D. Associate Professor Department of Mechanical Engineering

Department of Mathematics and Statistics

Rick Miner, Ph.D.

Robert A. DiTomasso , Ph.D., ABPP Professor and Chairman, Department of Psychology

Bruce L. Lambert, Ph.D. Professor Department of Pharmacy Administration

Department of Mathematics and Statistics

Anthony Chow, Ph.D. Assistant Professor Department of Library and Information Studies

Laura N. Gitlin, Ph.D. Professor, Department of Occupational Therapy

George A. Dunaway, Ph.D. Emeritus Professor Department of Pharmacology

Professor Pin-Han Ho Ph.D. Department of Electrical and Computer Engineering

Department of Statistics

Department of Mathematics and Statistics

Department of Statistics

Lisa Grace S. Bersales, Ph.D. Professor of Statistics and Dean

Mark Cole, Ph.D., ATC Assistant Professor Department of Kinesiology

Rick Chappell Professor Department of Statistics and

Michel Sabourin, Ph.D. Professor and Chair Department of Psychology University of Montreal

Rick Chappell Professor Department of Statistics and

Rick Chappell, Ph.D. Professor, Department of Biostatistics and Medical Informatics

Rick Chappell, Ph.D. Department of Biostatistics and Medical Informatics Depart of Statistics