Statistical Issues in Contraceptive Trials Daniel L. Gillen, PhD Department of Statistics University of California, Irvine FDA Reproductive Drugs Advisory Committee Meeting, Jan 23-24
Minimum requirements of a clinical trial • Appropriate target population • Use of appropriate comparison groups • Use of appropriate outcome measure • Ability to maintain statistical criteria for evidence • Controlling type I and II errors in the Frequentist setting
Outline • Outcome measures • Pearl Index vs. life-table methods • Comparison populations • Historical vs. active control trials • Defining statistical evidence • Testing for superiority vs. non-inferiority
The Pearl Index • The Pearl Index (number of pregnancies per 100 woman years) is a common measure used to summarize contraceptive effectiveness • However, a drawback of the Pearl Index is that in most situations it is dependent on time and must be interpreted accordingly • Such dependence occurs because of the changing baseline risk of pregnancy within study samples as time marches forward
Ex: Sensitivity of Pearl Index to duration of follow-up • Suppose our study population consists of two groups • “Low risk” group (90% of population): • Constant risk of pregnancy • 1 year probability of pregnancy is 5% • “High risk” group (10% of population): • Constant risk of pregnancy • 1 year probability of pregnancy is 50%
Ex (cont’d): One-year Pearl Index • Now consider the Pearl Index calculated over the first year • Expected number of pregnancies • 5000*(0.90*0.05 + 0.10*0.50) = 475 • Expected person-years at risk with censoring for pregnancy • 4525*1 + 475*.5 = 4762.5 • Pearl Index • (475 / 4762.5)*100 = 9.97 pregnancies per 100 per year
Ex (cont’d): Two-year Pearl Index • For the Pearl Index calculated over 2 years, we need to consider the impact of censoring the “high risk” group at pregnancy • By the end of one year • Number left in low risk group: 5000*0.90*(1-0.05) = 4275 • Number left in high risk group: 5000*0.10*(1-0.50) = 250 • Percent of total population in high risk group at one year is 250/4275 = 5.8%
Ex (cont’d): Two-year Pearl Index Now consider the Pearl Index calculated between years 1 and 2 • Expected number of pregnancies occurring between 1 and 2 years of follow-up • 4525*(0.942*0.05 + 0.058*0.50) = 344.4 • Expected person-years at risk between year 1 and year 2 • 4180.6*1 + 344.4*.5 = 4352.8 person-years • Pearl Index calculated between years 1 and 2 • (344.4 / 4352.8)*100 = 7.92 pregnancies per 100 per year
Ex (cont’d): Two-year Pearl Index Now consider the Pearl Index calculated over 2 years • Expected number of pregnancies observed over 2 years • 475 + 344.4 = 819.4 • Expected person-years at risk over 2 years • 4762.5 + 4352.8 = 9115.3 person-years • Pearl Index calculated over 2 years • (819.4 / 9115.3)*100 = 8.99 pregnancies per 100 per year
When is the Pearl Index independent of study support? • The Pearl Index will change with the length of follow-up unless: • The rate of pregnancies is homogeneous across all possible subgroups • This rate remains constant with time
When is the Pearl Index independent of study support? • In the previous example, it should be noted that even if we allow participants with failures to re-enter the risk set the Pearl Index will still depend upon time • This is because a failure results in less at-risk time, thus total years of follow-up will be proportionately less in the “high risk” group as duration of maximal follow-up increases
A further issue in quantifying the Pearl Index… • Most confidence intervals for the Pearl Index assume a Poisson Distribution • This distribution is defined as having variance equal to the mean (or rate) • However, count or rate data is typically characterized as stemming from an overdispersed Poisson distribution • That is, the true variance in the rate that we observe is more that we assume from the Poisson distribution • Overdispersion in Poisson rates typically arises from heterogeneity of patient populations
Computation of confidence intervals for the Pearl Index • Consider our previous example with a “low risk” and a “high risk” group • Low risk group (90% of population): • Constant risk of pregnancy • 1 year probability of pregnancy is 5% • High risk group (10% of population): • Constant risk of pregnancy • 1 year probability of pregnancy is 50%
Computation of confidence intervals for the Pearl Index • We previously calculated the (true) 1 year Pearl Index to be 9.97 pregnancies per 100 per year • Suppose that in reality, we observed 457 pregnancies over 1 year with a total of 4763 years of followup, resulting in a Pearl Index of 9.60 per 100 per year • Assuming a Poisson distribution the corresponding 95% confidence interval for the 1 year Pearl Index would be (8.73, 10.51)
Computation of confidence intervals for the Pearl Index • However, because the Pearl Index is really composed of a mixture of Poisson distributions (from the high and low risk groups) the true variance is actually 19.2% larger than assumed by the usual (single) Poisson model • This means that we have underestimated the variance, ie. Our confidence interval is shorter than it should be! • In this case, a 95% confidence interval accounting for the heterogeneity of groups is (8.63, 10.55). • This is approximately 8% wider than the previous interval
How to deal with the changing composition of the risk set? • We illustrated one way in our example • Consider the probability of failure at specific time points by using conditional probability • For example, if T is the time of failure we can compute the probability of failure within two years as Pr[T<2] = 1-Pr[T>2] = 1 - Pr[T>2|T>1]Pr[T>1] = 1-(1-0.0792)*(1-0.0997) = 0.171
How to deal with the changing composition of the risk set? • This is called a life-table estimate • In the setting of contraceptive failure, these conditional probabilities are typically computed monthly to more accurately incorporate the risk set (see eg. Potter, 1966) • When the life-table estimate is evaluated at all (distinct) failure times, this is called a Kaplan-Meier estimate.
Are there any benefits of to using the Pearl Index? • Clearly, the Pearl Index has been in wide use • The reasons for this are • Ease of interpretation • Although the Kaplan-Meier estimator also has a clinically relevant interpretation (probability of failure over T years of use) • For historically controlled trials, there is a great deal of data summarized in terms of the Pearl Index • This will, of course, change as the popularity of Kaplan-Meier estimates grow in the field
Can we incorporate changing treatment regiments? • Patients may discontinue use or use additional contraceptives for some intervals of time • Technically, the Kaplan-Meier estimator could incorporate such left and right censoring. • However, it is not clear when patients should re-enter the risk set
Can we incorporate changing treatment regiments? • For example, consider the case where a participant uses back-up contraception during the interval (t1, t2). • This individual could be considered at risk for the interval (0, t1) then re-entered into the risk set at time t2. • However, by doing this we are implicitly making the assumption that this person’s hazard (or risk of pregnancy) at time t2 is the same as all others who have been at risk from (0, t2) • This is not a reasonable assumption to me and I would advise against it
Can we incorporate changing treatment regiments? • Another option for incorporating changing treatment regiments would come from post-hoc analyses • Stratified Kaplan-Meier estimates • Number of strata could become large • Time-dependent covariates • Eg. Consider a proportional hazards framework
Regardless of the measure, what defines a failure and who is at risk? • For all new interventions we must consider: • Safety: Are there adverse effects that clearly outweigh any potential benefit? • Efficacy: Can the intervention reduce the probability of unintended pregnancy in a beneficial way? • Effectiveness: Would adoption of the intervention as a standard reduce the probability of unintended pregnancy in the population?
Regardless of the measure, what defines a failure and who is at risk? • One difference between evaluation of efficacy and effectiveness is in what defines a failure and who should be included in the risk set • In a clinical trial setting we can truly only evaluate efficacy because of possible selection bias of patients entering contraceptive trials • However, even in the clinical trial setting it is useful to evaluate • Intervention failure rates during actual use (including inconsistent or incorrect use) • Intervention failure rates during perfect use • (see eg. Trussell, Contraception, 2004)
Regardless of the measure, what defines a failure and who is at risk? • To assess true method efficacy, counting only “method failures” during perfect use, we must only include perfect use exposure patients in the risk set • Also, need to consider if those who are lost to follow-up should be considered at risk all the way up to the time of drop-out • One reasonable approach is to censor patients three months prior to the time at which they become lost to follow-up (Trussell, SIM, 1991)
Historical control trials vs. active control trials • In the past many methods have been assessed via a historical control trial • Eg. Criteria such as a Pearl Index of 1.5 (or more recently 2) or less has been used an efficacy criteria • Such criteria stems from the experience of historical controls • However, biases resulting from historical control studies can be numerous. Particularly when study samples are not comparable with respect to baseline risk, evaluative measure of outcome, or duration of study.
Criteria for superiority in historical control trials • As noted, past studies have considered point estimates of the (one year) Pearl Index of less than 1.5 or 2 unintended pregnancies per 100 per year • However, we must also acknowledge uncertainty of these estimates • EMEA requires sufficient sample size to guarantee the width of the 95% CI for the Pearl Index to be no larger than 1 • Better (in my opinion) to require that upper bound of CI is less than the chosen threshold • In either case, if the Pearl Index is used the previous notes on computation of the CI need to be considered
Historical control trials vs. active control trials • Because it is impossible to guarantee comparability between historical controls and current study samples, it is almost always advantageous to employ randomization when ethically feasible • Given a wide use of standard contraceptives, it is not feasible to consider a placebo controlled trial • However, one can (and should) consider the use of an active control when comparable interventions are in use • Also allows for comparison of entire survival curve (logrank test or proportional hazards model?)
Superiority vs. non-inferiority in active control trials • Statistical criteria for evidence in a superiority trial • Evidence to rule out equality of effect as measured by the chosen parameter (eg. Pearl Index, 1-year survival estimate, or a hazard ratio) Example: • Contrast may be difference in 1-year failure rates as measured by the Kaplan-Meier estimator • KMTx(1) - KMAC(1) • Test: H0: KMTx(1) - KMAC(1) 0 Vs. H1: KMTx(1) - KMAC(1) < 0 • Rejection of null hypothesis corresponds to upper bound of CI for KMTx(1) - KMAC(1) being less than 0
Superiority vs. non-inferiority in active control trials • Statistical criteria for evidence in a non-inferiority trial • Evidence to rule out some margin of efficacy less than the active control Example: • Contrast may be difference in 1-year failure rates as measured by the Kaplan-Meier estimator • KMTx(1) - KMAC(1) • Test: H0: KMTx(1) - KMAC(1) Vs. H1: KMTx(1) - KMAC(1) < for some > 0 • Rejection of null hypothesis corresponds to upper bound of CI for KMTx(1) - KMAC(1) being less than
Superiority vs. non-inferiority in active control trials • When is it reasonable to consider non-inferiority instead of superiority? • ICH E-10 Guidelines • Active control treatment must truly be active in the study population • If active control is truly active in the study population • Can a margin to define non-ineferiority be established? • If active control is standard of care, is new treatment also superior on secondary endpoints?
Superiority vs. non-inferiority in active control trials • Issues in setting the non-inferiority “margin”? • What measure compares distributions? • Is the treatment effect random? • How much of a decrease in effect is acceptable? • How to account for variability in the estimate(s) from historical trials?
Superiority vs. non-inferiority in active control trials • Precedence for setting the non-inferiority “margin” • Is the treatment effect random? • Ideally use meta-analysis of multiple trials • Careful! Do trials have same duration of follow-up? • How much of a decrease in effect is acceptable? • 10%, 20%, 50% of active control effect? • How to account for variability in the estimate(s) from historical trials? • Use worst case from historical 95% CI? • Explicitly account for variability in historical trial
Summary • Need to define appropriate target population, comparison group, outcome measure, and maintain statistical criteria for evidence • Pearl Index is (usually) implicitly dependent on the length of follow-up, whereas Kaplan-Meier (life table) estimates make this dependence explicit • In either case, we need to obtain correct inference (CI’s) and the definition of the risk set must correspond to the definition of failure • When ethically and logistically possible, active controls should be used • If historical controls are used, uncertainty should be accounted for in defining superiority criteria