Biostatistics in Practice

Biostatistics in Practice Session 4: Study Size and Power Peter D. Christenson Biostatistician http://gcrc.LABioMed.org/Biostat

Readings for Session 4from StatisticalPractice.com • Sample Size Calculations • Some underlying theory and some practical advice. • Controlled trials

Outline for this Session • Example from a current local protocol. • Review statistical hypothesis testing. • Formulate example as hypothesis test. • Software for study size and power. • Other issues.

Local Protocol Example Project #10038: Dan Kelly & Pejman Cohan Hypopituitarism after Moderate and Severe Head Injury Brief study outline: Subjects arrive at ER with TBI (traumatic brain injury). Those with low cortisol, indicating possible adrenal insufficiency and pituitary damage, may or may not recover better if given hydrocortisone (HC) injections. Subjects who consent are randomized to receive HC or placebo for 4 days. Changes in recovery status from pre to post injection periods are compared between HC and placebo groups.

Local Protocol Example, Cont’d Project #10038: Dan Kelly & Pejman Cohan Hypopituitarism after Moderate and Severe Head Injury • “The primary outcomes for the hydrocortisone trial are changes in mean MAP and vasopressor use from the 12 hours prior to initiation of randomized treatment to the 96 hours after initiation.” • Mean changes in placebo subjects will be compared with hydrocortisone subjects using a two sample t-test. Before examining the study size, let’s first discuss how the results will be analyzed.

Recall Statistical (t) test From Last Session Suppose results from the study are plotted as: Each point is the change in MAP for an individual subject. [Of course, the real study will have many more subjects.] Change in MAP Δ Placebo HC Is Δ large enough to claim that HC is more effective? Use t-test.

Local Protocol Example: Analysis with t-test We are testing: H0: μHC- μPlacebo = 0 vs. HA: μHC- μPlacebo≠ 0 where μHC is the expected post-pre change in “all potential TBI patients” if HC therapy is applied as in this study. Our decision rule is: Choose HA if the estimate of μHC- μPlacebo from our limited sample, i.e., the observed mean change under HC minus the observed mean change under placebo, call it Δ, is too far from 0 (which is specified by H0). “Too far” is > tc*SE or < tc*SE, where tc is usually about 2. SE is SE(Δ), calculated from the data, and is ↓ for larger N and smaller SD. In other words, choose HA if |Δ| > tc*SE, or |t|=|Δ/SE| > tc. By following this rule, there is only a 5% probability of choosing HA if in fact H0 is true.

Potentially Underpowered Studies From the previous slide: By following this rule, there is only a 5% probability of choosing HA if in fact H0 is true. So, the probability is small (5%) that our study will (incorrectly) recommend that TBI subjects receive HC if it is worthless. But, is it able to correctly recommend that TBI subjects receive HC if it is effective? The probability of this is called the power of the study. Actually, there is not a single value for power. The study may have, say, 59% power if the true mean HC effect is 3 mmHg in MAP, but will have more power if the true effect is 4, since the subjects are more likely to reflect this greater effectiveness. Let’s go back to last session’s graph to see this.

Graphical Representation of Power H0: true effect=0 HA: true effect=3 Effect in study=1.13 41% HA H0 5% Effect (HC change – Placebo change) \\\ = Probability of concluding HA if H0 is true. /// = Probability of concluding H0 if HA is true. Power=100-41=59% Note greater power if larger N, and/or if true effect>3.

P-Value Recall that our decision rule is: Choose HA if |Δ| > tc*SE, or |t|=|Δ/SE| > tc. By following this rule, there is only a 5% probability of choosing HA if in fact H0 is true. In practice, though, we do not just report our decision as HA or H0. The p-value is the probability , if H0 is correct, that we would observe a Δ as far from 0 as actually eventually occurred in the study. Here, p=Prob(Δ>1.13), which is the area under H0 to the right of the green line in the previous figure. Small p-values support HA. Choosing HA is equivalent to p<0.05, so the study result is reported as the p-value. HC is declared to have an effect if p<0.05.

Summary: Factors that Determine Study Size • Five factors including power are inter-related. Fixing four of these specifies the fifth: • Study size, N. • Power (often 80% is desirable). • p-value (level of significance, e.g., 0.05). • Magnitude of treatment effect to be detected. • Heterogeneity among subjects (standard deviation, SD). The next slide shows how these factors (except SD) are typically presented in a study protocol.

Quote from Local Protocol Example Thus, with a total of the planned 80 subjects, we are 80% sure to detect (p<0.05) group differences if treatments actually differ by at least 5.2 mm Hg in MAP change, or by a mean 0.34 change in number of vasopressors.

Comments on the Table on Previous Slide • Typically power=80% and almost always p<0.05 are fixed. • SD was not mentioned. If available, several estimates of SD may be used (different populations, intervention characteristics such as dosage, time, etc). Here, a pilot study exactly like the trial was performed by the investigators. • Detectable difference refers to the unknown true difference, μHC- μPlacebo , not the difference that will eventually be seen in the study. • N ↑ as detectable difference ↓. • So, the major consideration is usually a tradeoff between N and the detectable difference.

Software for Study Size Calculations • Calculations depend on the specific statistical method. We are using the t-test as an example, but the same concepts apply for, say, comparing % subjects who respond to treatment using another method such as a chi-square test. • In software, you specify the method, and 4 of the 5 factors. The value of the fifth factor is calculated. • Two free sites for calculations: • 1. http://calculators.stat.ucla.edu/powercalc • 2. http://www.stat.uiowa.edu/~rlenth/Power

A Software Site for Study Size Calculations

Local Protocol Example: Calculations Pilot data: SD=8.16 for ΔMAP in 36 subjects. For p-value<0.05, power=80%, N=40/group, the detectable Δ of 5.2 in the previous table is found as:

Power analysis assures that effects of a specified magnitude can be detected. • Five factors including power are inter-related. Fixing four of these specifies the fifth. • For comparing means, need pilot or data from other studies on variability of subjects for the outcome measure. [E.g., Std dev from previous study.] Comparing rates (%s) does not require pilot variability data. Use if no pilot data is available for means. • Helps support the believability of (superiority) studies if the conclusions turn out to be negative. • To prove no effect (e.g., that a less invasive therapy is equally as effective as standard care), use an equivalency study design. Summary: Study Size and Power

Self-Test Exercise #1 Go to www.stat.uiowa.edu/~rlenth/Power. Select Type of Analysis = “Two sample t-test”. Reproduce the detectable difference between placebo and HC treatment in change in MAP of 6.0 mm when using 30+30=60 subjects in the local example table that appears 6 slides back. Note that SD for MAP change was 8.16 in the pilot study. Notes: “Sigma” refers to SD. Check “equal sigmas”, which assumes that SD is the same for both HC and placebo groups. Do not check “equivalence”.

Self-Test Exercise #2 Go to www.stat.uiowa.edu/~rlenth/Power. Select Type of Analysis = “Test comparing two proportions”. Suppose that the outcome for the local example is not mean magnitude of change in MAP or vasopressor use, but instead is the proportion of subjects with an MAP reduction of at least 5 mm. If true such proportions among “all” potential TBI subjects who do or don’t receive HC therapy are 30% and 60%, how many subjects are needed to be 80% sure to declare a HC effect with p<0.05? Notes: We have not studied proportions. The statistical test is a “binomial” test, rather than t-test (for means only), but the concept for study size is the same. SD is not used here.

Self-Test Exercise #3 A study was powered to detect a 10 point mean reduction in LDL cholesterol. A colleague claims that this means that if the subjects decrease LDL cholesterol by a mean 10 points, then p<0.05 and this will be a significant reduction. Explain.

Self-Test Exercise #4 True story: A protocol was designed with 80% power to detect (p<0.05) a 10% disease incidence in subjects receiving placebo vs. a 3.5% incidence in subjects receiving a new drug. This corresponds to a 65% reduction in disease incidence. A comment on the study was: “… there may not be a large enough sample to see the effect size required for a successful outcome. Power calculations indicate that the study is looking for a 65% reduction in incidence of … [disease]. Wouldn’t it also be of interest if there were only a 50% or 40% reduction, thus requiring smaller numbers and making the trial more feasible?” What is your comment on the comment?

Biostatistics in Practice