H.M. James Hung (DB1/OB/OPaSS/CDER/FDA) Lu Cui (Aventis Pharmaceuticals)

Inference and Operational Conduct Issues with Sample Size Adjustment Based On Interim Observed Effect Size H.M. James Hung (DB1/OB/OPaSS/CDER/FDA) Lu Cui (Aventis Pharmaceuticals) Sue-Jane Wang (DB2/OB/OPaSS/CDER/FDA) John Lawrence (DB1/OB/OPaSS/CDER/FDA) Presented in Annual Symposium of New Jersey Chapter of ASA, Piscataway, NJ, June 4, 2002

Disclaimer The views expressed in this presentation are not those of the U.S. Food and Drug Administration, nor of Aventis Pharmaceuticals. Dr. Lu Cui was one of the primary investigators of this research during his tenure in FDA.

Acknowledgments The research was supported by FDA/CDER RSR Funds, #96-010A and #99/00-008. Thanks are due to Dr. Lu Cui for sharing some of his slides

Selected References in Adaptive Design/InterimAnalysis Bauer & Köhne (1994, Biometrics) Bauer & Röhmel (1995, Stat. In Med.) Lan & Trost (1997, ASA Proceedings) Fisher (1998, Stat. In Med.) Posch & Bauer (1999, Biometrical J.) Kieser, Bauer & Lehmacher (1999, Biometrical J.) Lehmacher & Wassmer (1999, Biometrics) Müller & Schäfer (2001, Biometrics) Berry (2002, ASA Biopharmaceutical Report) Brannath, Posch & Bauer (2002, JASA) ………. etc

The materials of this presentation are selected from the main results of our RSR research work. Cui, Hung, Wang (1997 ASA; 1999 Biometrics) Lawrence & Hung (2002 ENAR talk)

Background Sample size (or amount of statistical information) is one of the design specifications vital to success of Phase II/III (confirmatory?) clinical trials It relates directly and closely to the true effect size (treatment difference normalized by the measure of variability) of the targeted response variable

Background Common recommendation Make “educated guess” about the effect size and plan sample size to detect this effect size (or a range of plausible effect sizes) with sufficient power [e.g., > 90% - Hung et al (1997 Biometrics)] This is always good because the fixed-info design 1) provides statistics that have important good statistical properties 2) avoids data-driven adjustments that may induce biases (statistical or operational) making the results not interpretable

Biases use internal data to: change selection of patients, tune-up endpoints, drop/select centers, change patient mixture, do more data dredging or torturing to adjust analysis to get the desired conclusion, eliminate potential dropouts, change design to make treatment-related problems go away, tune up any design element to make the treatments easily differentiated, adjust or sample to a foregone conclusion …………...

Background But …. The effect size depends on a primary parameter (e.g., mean treatment difference) and nuisance parameters (e.g., standard deviation, background event rate) The effect size for detection may need to be clinically significant or meaningful (sometimes minimum clinically meaningful)  benefit/risk assessment (subjective) that might not be doable in designing the trial, hard to reach consensus

Background But …. The effect size may depend on patient mixtures  potential heterogeneous effects in subpopulations For a hard clinical outcome endpoint, “educated guess” about effect size is difficult e.g., for composite event endpoint, require “educated” guess of where the potential signal lies and what noises may be

Background But …. The effect size for detection may depend on $$  benefit/risk/cost consideration ………………. etc Practical considerations effect size for detection can be a moving target and change as background circumstances change and maximum amount of statistical information one can commit to may also change

Background Experiences: Often oversimplify clinical trial designs and inferences and impose too many restrictions to the designs. If a trial fails, it is difficult to know whether it is because the treatment does not have an important effect or the study was underpowered for detecting it.

Background Lan (2001, FDA/OB Mini-Symposium) If we know the values of design elements (e.g., effect size) a priori, No Need and Not Ethical to conduct a confirmatory trial Bauer et al (2002, Method Inform Med) “….. It does not make sense to apply uniformly most powerful test in an unchanged design even if we have convincing evidence that this ‘best’ test in the preplanned design may be severely underpowered …….”

Need to enhance flexibility in traditional clinical trial design/analysis strategy because practical considerations may change and may often be unpredictable at the design stage

Emerging Strategy Mid-course modification of design specifications - adjust sample size - change tested hypothesis from superiority to non-inferiority or vice versa - change from one pre-specified primary endpoint to another pre-specified endpoint - change test method - drop a treatment arm ….. etc

Impact of Design Modification Based on Interim Observed Data Type I error rate may greatly exceed the acceptable level Statistical power may be compromised Traditional estimate may be severely biased

Sample Size Re-estimation • Literature on sample size re-estimation is abundant. • Increasing sample size (or amount of statistical • information) based on nuisance parameters without • breaking blind • -has little effect on type I error • - may preserve the intended power level • - needs little or mild statistical adjustment (e.g., • estimate, CI) • Wittes & Brittain (1990), Gould (1992), Gould & Shih (1992) • Shih (1992, 1993, 1995), Birkett & Day (1994) • Jennison & Turnbull (1999, book), ……… etc

Sample Size Re-estimation But ….. Lan (1997, ASA talk), Liu (2000, ICSA talk) e.g.,knowing the components of the variance can lead to estimation of treatment difference; hence sample size re-estimation based on variance might affect type I error depending on how it is processed (e.g., by obtaining TSS & WSS)

Sample Size Re-estimation • Increasing sample size (or amount of statistical • information) based on the internal data path • may substantially inflate type I error, bias the estimate, invalidate CI • - crude estimate of maximum amount of inflation • obtainable, at least by simulation

Sample Size Re-estimation Question: At an interim time of a trial, if the observed treatment difference is far smaller than expected, we wish to increase sample size. Then, what adjustments are needed to perform valid statistical testing?

Selected References Bauer & Köhne (1994, Biometrics) Proschan & Hunsberger (1995, Biometrics) Lan & Trost (1997, ASA Proceedings) Cui, Hung & Wang (1997 ASA Proceedings, 1999, Biometrics) Fisher (1998, Stat. In Med.) Shen & Fisher (1999, Biometrics) Lehmacher & Wassmer (1999, Biometrics) Müller & Schäfer (2001, Biometrics) Liu & Chi (2001, Biometrics) Lan (2001, FDA/CDER/OB mini-symposium) Lan (2002, FDA/ASA workshop) Brannath, Posch & Bauer (2002, JASA) Lawrence & Hung (2002, ENAR)

Sample Size Re-estimation Test H0:  = 0 vs. H1:  > 0 Experimental (T) with N subjects Baseline Control (C) with N subjects  = 1  = T - C To detect  =  at sig. level  and power 1-, N (per group) = 2(z+z)2/2

Sample Size Re-estimation (non-sequential trial) Plan to enroll N=100 subjects/group to detect  = 0.46 at  = 0.025 and power 90% After 40 subjects per group contribute data, the estimate of  leads to * = 0.37 Re-estimate total sample size M = 150/group

Sample Size Re-estimation (non-sequential trial) At the end of the trial (M = 150) , compute the CHW adaptive test [Cui, Hung & Wang, 1999] U = (40/100)1/2Z0.40 + (60/100)1/2W0.60 W0.60 : normalized test for the additional 110 subjects per group after the interim time t=0.4 Lan (2001 FDA/CDER/OB Mini-symposium, 2002 FDA/ASA workshop)

Sample Size Re-estimation (non-sequential trial) U is standard normal under H0 If U > 1.96 , then conclude  > 0 Significance level = 0.025 U is more powerful than original Z w/o increasing N

Sample Size Re-estimation (non-sequential trial) Estimation & CI for  --- Lawrence & Hung (2002, ENAR talk) Construct consistent estimator and valid CI for  CHW test is Z-ratio of the consistent estimator

Sample Size Re-estimation (group sequential trial) Test H0:  = 0 vs. H1:  > 0 Experimental (T) with N subjects Baseline  = T - C Control (C) with N subjects N N/5 2N/5 0 0 IA-1 20% IA-2 40% Final 100%

Sample Size Re-estimation (group sequential trial) N is planned to detect  =  at level  and with power 1- At interim time s, estimate s  0 < * <  (say, based on conditional power)  increase sample size from N to M, approximately M=N( / *)2 Total information changes from 1 to  = M/N Compute b = ( - s)/(1- s)

Sample Size Re-estimation (group sequential trial) For interim analysis at time t when Mt subjects contribute information, compute Nt = (Mt - Ns)/b + Ns and t = Nt/N. Adapt the traditional repeated significance test: Traditional: Zt = Zs(Ns/Mt)1/2 + Wt-s(1- Ns/Mt)1/2 New: Ut = Zs(Ns/Nt)1/2 + Wt-s(1- Ns/Nt)1/2 Cui, Hung, Wang (1999, Biometrics)

Sample Size Re-estimation (group sequential trial) {Ut} w/ N possibly changed to M & {Zt} w/o change of N have identical distn. Find critical value Ct at time t based on the initially selected alpha-spending function Reject H0 if Ut > Ct ; otherwise, trial continues Cui, Hung, Wang (1999, Biometrics)

Empirical type I error rate(Adaptive test; type I error = 0.0255 w/o N increase; Gaussian)(increase N by <= 4x; O’Brien-Fleming boundary)

Empirical power(Adaptive test; 1-b = 0.587 w/o N increase; Gaussian)(increase N by <=4x; O’Brien-Fleming boundary)

Empirical type I error rate(Adaptive test; type I error = 0.0249 w/o N increase; Binomial, pC = 0.20)(increase N by <= 4x; O’Brien-Fleming boundary)

Empirical power(Adaptive test; 1- = 0.60 w/o N increase; Binomial, pC=0.20)(increase N by <= 4x; O’Brien-Fleming boundary)

Sample Size Re-estimation (Example: group sequential trial) Plan to have 100 subjects/group to detect  = 0.46 at  = 0.025 and power 90% After 50 subjects per group contribute data, the estimate suggests to detect * = 0.46/2 Re-estimate sample size M = 150/group  = 1.5 b = (1.5-0.5)/(1-0.5) = 2

Sample Size Re-estimation (Example: group sequential trial) Suppose that an interim analysis will be done when additional 50 subjects/group contribute data (Mt = 100) Nt = (100 - 50)/2 + 50 = 75 and t = 0.75 Suppose that O-F alpha spending function is originally used for interim analysis Then the critical value for the adaptive test at t = 0.75 is C0.75 = 2.36

Sample Size Re-estimation (Example: group sequential trial) The adaptive test at M0.75 = 100 is U0.75 = T0.50(2/3)1/2 + W0.25(1/3)1/2 W0.25 is the normalized test performed on the additional 50 subjects per group If U0.75 > 2.36 , then stop the trial and conclude that experimental treatment is superior to control

Sample Size Re-estimation (Example: group sequential trial) If the trial continues to the end, then the final adaptive test (i.e., at M1 = 150) is U1 = T0.50(1/2)1/2 + W0.50(1/2)1/2 W0.50 is the normalized test performed on the additional 100 subjects per group If U1 > 2.01 , then conclude that experimental treatment is superior to control

Sample Size Re-estimation CHW adaptive test has type I error rate attained at the targeted level and large power increase (relative to w/o re-estimation) and its implementation is very easy Consistent estimator and confidence interval compatible with CHW adaptive test are readily available All the above discussions are based on asymptotic (i.e., ‘sufficiently large’ sample size) theory Cui, Hung, Wang (1999, Biometrics) Lawrence & Hung (2002, ENAR talk)

Sample Size Re-estimation CHW adaptive test reduces to the conventional test if sample size is not changed. So do the consistent estimator and confidence interval compatible with CHW adaptive test. Cui, Hung, Wang (1999, Biometrics) Lawrence & Hung (2002, ENAR talk)

Sample Size Re-estimation CHW adaptive test has another look using a combination of p-values from the incremental group data [Lehmacher & Wassmer (1999), Brannath, Posch & Bauer (2002)]

Sample Size Re-estimation Sample size re-estimation criterion After obtaining the observed s at time s, one could recalculate sample size M such that conditional power CP(*) = Pr{CHW rejects H0 at the end | s , = *} = 1- for the new intended *. Then, the power of CHW for detecting * is at least 1- Lawrence (2002, personal communication)

Sample Size Re-estimation Sample size re-estimation criterion Better to look for more stable signal via examination of sample path over time - reduce the chance of being misled by possible aberration of early data

Operational Conduct Issues • Sample size re-estimation based on unblinded • data opens rooms for operational biases. • During adaptive change, only unblind data that • are necessary to be unblinded in order to avoid • operational bias. Standard Operation Procedure • (SOP) must be in place in the protocol and trial • conduct must comply with the SOP.

Operational Conduct Issues Sample size re-estimation based on unblinded data opens rooms for multiple analyses that may lead to more protocol amendments and other changes of design elements. These types of changes or amendments (potentially driven by current data) may introduce problems in the interpretation of the results. Attention should be given to this potential hazard with the design.

Operational Conduct Issues But …. Most of the operational conduct issues related to sample size re-estimation based on unblinded data are also encountered in traditional designs with interim analyses

Operational Conduct Issues Recommend that sample size modification be done, if needed, by an independent third party that has no conflict of interest issue What took place after sample size (or any design) modification needs to be documented fully

Other Issues Estimation following data-driven sample size change is an important issue, particularly the effect estimate may be used to plan future superiority trials or active-control non-inferiority trials. Careful consideration of the benefit and risk of such sample size modification is needed in practical applications

Other Issues Bauer et al (2002, Method Inform Med) “….. Clearly in such designs more logistics must be put in to properly handle all problems of interim analyses including their consequences for the design. They need rigid rather than flexible planning modalities. ……”

Summary Conventional fixed-information design has made tremendous contributions to clinical science. The statistics have good statistical properties. This design needs to permit sample size adjustment when it falls short (perhaps by surprise) in many applications, e.g., in studying endpoints for which - prior data are poor or hard to provide reasonably good educated guess about effect size - minimum clinical meaningful effect not available

H.M. James Hung (DB1/OB/OPaSS/CDER/FDA) Lu Cui (Aventis Pharmaceuticals)