Introduction to Biostatistics for Clinical Researchers

Introduction to Biostatistics for Clinical Researchers University of Kansas Department of Biostatistics & University of Kansas Medical Center Department of Internal Medicine

Schedule 5thlecture, TBD

Materials • PowerPoint files can be downloaded from the Department of Biostatistics website at http://biostatistics.kumc.edu • A link to the recorded lectures will be posted in the same location

Topics • Comparing Two (or more) Population Means (continued) • Simple Linear Regression • Comparing Two (or more) Independent Proportions

Comparing Two (or More) Population Means (Continued)

Sampling Distribution Detail • What exactly is the sampling distribution of the difference in sample means? • A Student’s t distribution is used with n1-n2- 2 degrees of freedom (total sample size minus two)

Two-Sample t-test • In a randomized design, 23 patients with hyperlipidemia were randomized to either treatment A or treatment B for 12 weeks • 12 to A • 11 to B • LDL cholesterol levels (mmol/L) measured on each subject at baseline and 12 weeks • The 12-week change in LDL cholesterol was computed for each subject

Two-Sample t-test • Is there a difference in LDL change between the two treatment groups? • Methods of inference • CI for the difference in mean LDL cholesterol change between the two groups • Statistical hypothesis test

95% CI for Difference in Means

95% CI for Difference in Means • How many standard errors to add and subtract (i.e., what is the correct multiplier)? • The number we need comes from a t with 12 + 11 - 2 = 21 degrees of freedom • From t table or excel, this value is 2.08 • The 95% CI for true mean difference in change in LDL cholesterol, drug A to drug B is:

Hypothesis Test to Compare Two Independent Groups • Two-sample (unpaired) t-test: getting a p-value • Is the change in LDL cholesterol the same in the two treatment groups? • HO: μ1 = μ2 HO: μ1-μ2 = 0 • HA: μ1 ≠ μ2 HA: μ1- μ2 ≠ 0

Hypothesis Test to Compare Two Independent Groups • Recall the general “recipe” for hypothesis testing: • Assume HO is true • Measure the distance of the sample result from the hypothesized result (here, it’s 0) • Compare the test statistic (distance) to the appropriate distribution to get the p-value

Diet Type and Weight Loss Study • In the diet types and weight loss study, recall: • In this study: • This study result was 4.4 standard errors below the null mean of 0

How are p-values Calculated? • Is a result 4.4 standard errors below 0 unusual? • It depends on what kind of distribution we are dealing with • The p-value is the probability of getting a result as extreme or more extreme than what was observed (-4.4) by chance, if the null hypothesis were true • The p-value comes from the sampling distribution of the difference in two sample means • What is the sampling distribution of the difference in sample means? • t12+11-2 = 21

Hyperlipidemia Example • To compute a p-value, we need to compute the probability of being 4.4 or more SE away from 0 on the t with 21 degrees of freedom P = 0.0003

Summary: Weight Loss Example • Statistical Methods • Twenty-three patients with hyperlipidemia were randomly assigned to one of two treatment groups: A or B • 12 patients were assigned to receive A • 11 patients were assigned to receive B • Baseline LDL cholesterol measurements were taken on each subject and LDL was again measured after 12 weeks of treatment • The change in LDL cholesterol was computed for each subject • The mean LDL changes in the two treatment groups were compared using an unpaired t-test and a 95% confidence interval was constructed for the difference in mean LDL changes

Summary: Weight Loss Example • Result • Patients on A showed a decrease in LDL cholesterol of 1.41 mmol/L and subjects on treatment B showed a decrease of 0.32 mmol/L (a difference of 1.09 mmol/L, 95% CI: 0.57 to 1.61 mmol/L) • The difference in LDL changes was statistically significant (p < 0.001)

FYI: Equal Variances Assumption • The “traditional” t-test assumes equal variances in the two groups • This can be formally tested using another hypothesis test • But why not just compare observed values of s1 to s2? • There is a slight modification to allow for unequal variances-this modification adjusts the degrees of freedom for the test, using slightly different SE computation • If you want to be truly ‘safe’, it is more conservative to use the test that allows for unequal variances • Makes little to no difference in large samples

FYI: Equal Variances Assumption • If underlying population level standard deviations are equal, both approaches give valid confidence intervals, but intervals assuming unequal standard deviations are slightly wider (p-values slightly larger) • If underlying population level standard deviations are unequal, the approach assuming equal variances does not give valid confidence intervals and can severely under-cover the goal of 95%

Non-Parametric Analogue to the Two-Sample t

Alternative to the Two Sample t-test • “Non-parametric” refers to a class of tests that do not assume anything about the distribution of the data • Nonparametric tests for comparing two groups • Mann-Whitney Rank-Sum test (Wilcoxon Rank Sum Test) • Also called Wilcoxon-Mann-Whitney Test • Attempts to answer: “Are the two populations distributions different?” • Advantages: does not assume populations being compared are normally distributed, uses only ranks, and is not sensitive to outliers

Alternative to the Two Sample t-test • Disadvantages: • often less sensitive (powerful) for finding true differences because they throw away information (by using only ranks rather than the raw data) • need the full data set, not just summary statistics • results do not include any CI quantifying range of possibility for true difference between populations

Health Education Study • Evaluate an intervention to educate high school students about health and lifestyle over a two-month period • 10 students randomized to intervention or control group • X = post-test score - pre-test score • Compare between the two groups

Health Education Study • Only five individuals in each sample • We want to compare the control and intervention to assess whether the ‘improvement’ in scores are different, taking random sampling error into account • With such a small sample size, we need to be sure score improvements are normally distributed if we want to use the t test (BIG assumption) • Possible approach: Wilcoxon-Mann-Whitney test

Health Education Study • Step 1: rank the pooled data, ignoring groups • Step 2: reattach group status • Step 3: find the average rank in each of the two groups

Health Education Study • Statisticians have developed formulas and tables to determine the probability of observing such an extreme discrepancy in ranks (6.8 versus 4.2) by chance alone (p) • The p-value here is 0.17 • The interpretation is that the Mann-Whitney test did not show any significant difference in test score ‘improvement’ between the intervention and control group (p = 0.17) • The two-sample t test would give a different answer (p = 0.14) • Different statistical methods give different p-values • If the largest observation was changed, the MW p would not change but the tp-value would

Notes • The t or the nonparametric test? • Statisticians will not always agree, but there are some guidelines • Use the nonparametric test if the sample size is small and you have no reason to believe data is ‘well-behaved’ (normally distributed) • Only ranks are available

Summary: Educational Intervention Example • Statistical methods • 10 high school students were randomized to either receive a two-month health and lifestyle education program or no program • Each student was administered a test regarding health and lifestyle issues prior to randomization and after the two-month period • Differences in the two test scores were computed for each student • Mean and median test score changes were computed for each of the two study groups • A Mann-Whitney rank sum test was used to determine if there was a statistically significant difference in test score change between the intervention and control groups at the end of the two-month study period

Summary: Educational Intervention Example • Results • Participants randomized to the educational intervention scored a median five points higher on the test given at the end of the two-month study period, as compared to the test administered prior to the intervention • Participants randomized to receive no educational intervention scored a median one point higher on the test given at the end of the two-month study period • The difference in test score improvements between the intervention and control groups was not statistically significant (p = 0.17)

Comparing Means between More than Two Independent Populations

Motivating Example • Suppose you are interested in the relationship between smoking and mid-expiratory flow (FEF), a measure of pulmonary health • Suppose you recruit study subjects and classify them into one of six smoking categories • Nonsmokers (NS) • Passive smokers (PS) • Non-inhaling smokers (NI) • Light smokers (LS) • Moderate smokers (MS) • Heavy smokers (HS)

Motivating Example • You are interested in whether differences exist in mean FEF among the six groups • Main outcome variable is FEF in liters per second

Motivating Example • One strategy is to perform lots of two-sample t-tests (for each possible two-group comparison) • In this example, there would be 15 comparisons you would need to do: • NS-PS • NS-NI • . . . • MS-HS • It would be nice to have one “catch-all” test • Something that would tell you whether there were any differences among the six groups • If so, you could then do group-to-group comparisons to look for specific differences

Extension of the Two-Sample t-test • Analysis of Variance (ANOVA) • The t-test compares means in two populations • ANOVA compares means among more than two populations with one test • The p-value from ANOVA answers the question: • “Are there any differences in the means among the populations?”

Extension of the Two-Sample t-test • General idea behind ANOVA, comparing means for k > 2 groups: • HO: μ1 = μ2 = . . . = μk • HA: At least one μjis different

Example • Smoking and FEF (Forced Mid-Expiratory Flow Rate)1 • A sample of over 3,000 persons was classified into one of six smoking categorizations based on responses to smoking related questions 1 White, J.R., Froeb, H.F. (1980). Small-airways dysfunction in non-smokers chronically exposed to tobacco smoke, NEJM 302: 13.

Example • Nonsmokers (NS) • Passive smokers (PS) • Non-inhaling smokers (NI) • Light smokers (LS) • Moderate smokers (MS) • Heavy smokers (HS)

Example • Smoking and FEF • From each smoking group, a random sample of 200 men was drawn (except for the non-inhalers, as there were only 50 male non-inhalers in the entire sample of 3,000) • FEF measurements were taken on each of the subjects

Data Summary • Based on a one-way analysis of variance, there are statistically significant differences in FEF levels among the six smoking groups (p < 0.001)

What’s the Rationale? • In the simplest case, the variation in subject responses is broken down into parts: variation in response attributed to the treatment (group/sample), to error (subject characteristics + everything else not controlled for) • The variation in the treatment (group/sample) means is compared to the variation within a treatment (group/sample) • If the between treatment variation is a lot bigger than the within treatment variation, that suggests there are some different effects among the treatments

Example: Scenarios 1 2 3

Example: Scenarios • There is an obvious difference between scenarios 1 and 2. What is it? • Just looking at the boxplots, which of the two scenarios (1 or 2) do you think would provide more evidence that at least one of the populations is different from the others? Why?

F Distribution Properties, F(dfnum, dfden) • The total area under the curve is one. • The distribution is skewed to the right. • The values are non-negative, start at zero and extend to the right, approaching but never touching the horizontal axis. • The distribution of F changes as the degrees of freedom change.

F Statistic • Case A: If all the sample means were exactly the same, what would be the value of the numerator of the F statistic? • Case B: If all the sample means were spread out and very different, how would the variation between sample means compare to the value in A?

F Statistic • So what values could the F statistic take on? • Could you get an F that is negative? Why not? • What type of values of F would support the alternative hypothesis?

Example: F Statistic Three independent random samples • Scenario 1: means 60, 65, 70; s = 1.5 • Scenario 2: means 60, 65, 70; s = 3 • Scenario 3: means 65, 65, 65; s = 3

Summary: Smoking and FEF • Statistical Methods • 200 men were randomly selected from each of five smoking classification groups (non-smoker, passive smokers, light smokers, moderate smokers, and heavy smokers), as well as 50 men classified as non-inhaling smokers for a study designed to analyze the relationship between smoking and respiratory function

Summary: Smoking and FEF • Statistical Methods • Analysis of variance was used to test for any differences in FEF levels among the six groups of men • Individual group comparisons were performed with a series of two-sample t-tests and 95% confidence intervals were constructed for the mean difference in FEF between each combination of groups • Analysis of variance showed statistically significant (p < 0.001) differences in FEF between the six groups of smokers • Non-smokers had the highest mean FEF value (3.78 L/s) and this was statistically significantly larger than the five other smoking-classification groups

Summary: Smoking and FEF • Results • Analysis of variance showed statistically significant (p < 0.001) differences in FEF between the six groups of smokers • Non-smokers had the highest mean FEF value (3.78 L/s) and was statistically significantly larger than the five other smoking-classification groups • The mean FEF value for non-smokers was 1.19 L/s higher than the mean FEF for heavy smokers (95% CI: 1.03-1/35 L/s), the largest mean difference between any two smoking groups • Confidence intervals for all smoking group FEF comparisons are in Table 1

Example • FEV1 and three medical centers1 • Data was collected on 63 patients with coronary artery disease at 3 different medical centers: Johns Hopkins, Ranchos Los Amigos Medical Center, St. Louis University School of Medicine) • Purpose of study was to investigate effects of carbon monoxide exposure on these patients • Prior to analyzing CO effects data, researchers wished to compare the respiratory health of these patients across the three medical centers 1Pagano, M., Gauvreau, K. (2000). Principles of Biostatistics. Duxbury Press

Introduction to Biostatistics for Clinical Researchers