Understanding Biases in Medical Statistics: The Case of Snus and Ulcer Association

Traps and pitfalls in medical statistics Arvid Sjölander

Motivating example • You are involved in a project to find out if snus causes ulcer. • A questionnaire is sent out to 300 randomly chosen subjects. • 200 subjects respond: • We can use the relative risk (RR) to measure the association between snus and ulcer: • Can we safely conclude that snus prevents ulcer? Arvid Sjölander

Outline • Systematic errors • Selection bias • Confounding • Randomization • Reverse causation • Random errors • Confidence interval • P-value • Hypothesis test • Significance level • Power Arvid Sjölander

One possible explanation • It is a wide spread hypothesis that snus causes ulcer. • Snus users who develop ulcer may therefore feel somewhat guilty, and may therefore be reluctant to participate in the study • Hence, RR<1 may be (partly) explained by an underrepresentation of snus users with ulcer among the responders. • This is a case of selection bias. Arvid Sjölander

Selection bias Population • We only observe the RR among the potential responders. • The RR among the responders (observed) may not be equal to the population RR (unobserved). Sample Potential non- responders Potential responders Arvid Sjölander

How do we avoid selection bias? Population • Make sure that the sample is drawn randomly from the whole population of interest - must trace the non-responders. • Send out the questionnaire again, follow up phone calls etc. Sample Potential non- responders Potential responders Arvid Sjölander

Another possible explanation • Because of age-trends, young people use snus more often than old people. • For biological reasons, young people have a smaller risks for ulcer than old people. • Hence, RR<1 may be (partly) explained by snus-users being in “better shape” than non-users. • This is a case of confounding, and age is called a confounder. Arvid Sjölander

? Confounding • The RR measures the association between snus and ulcer. • The association depends on both the causal effect, and the influence of age. • In particular, even in the absence of a causal effect, there will be an (inverse) association between snus and ulcer (RR  1). Arvid Sjölander

? How do we avoid confounding? • At the design stage: randomization, i.e. assigning “snus” and “no snus” by “the flip of a coin”. • + reliable; it eliminates the influence of all confounders. • - expensive and possibly unethical. • At the analysis stage: adjust (the observed association) for (the influence of) age, e.g. stratification, matching, regression modeling. • + cheap and ethical. • - not fully reliable; cannot adjust for unknown or unmeasured confounders. Arvid Sjölander

Yet another explanation • It is a wide spread hypothesis among physicians that snus causes and aggravates ulcer. • Snus users who suffers from ulcer may therefore be advised by their physicians to quit. • Hence, RR<1 may be (partly) explained by a tendency among people with ulcer to quit using snus. • This is a case of reverse causation. Arvid Sjölander

Reverse causation • Reverse causation can be avoided by randomization. ? Snus Ulcer Arvid Sjölander

Systematic errors • Selection bias, confounding, and reverse causation, are referred to as systematic errors, or bias. • “You don’t measure what you are interested in”. • How can you tell if your study is biased? • You can’t! (At least not from the observed data). • It is important to design the study carefully and “think ahead” to avoid bias. • What may the reason be for potential response/non-response? • How can we trace the non-responders? • Which are the possible confounders? • Do we need to randomize the study? Would randomization be ethical and practically possible? Arvid Sjölander

Example cont’d • Assume that we believe that the study is unbiased (no selection bias, no confounding and no reverse causation). • Can we safely conclude that snus prevents ulcer? Arvid Sjölander

Random errors Population Sample • True RR = observed RR? • True RR  observed RR! True RR Observed RR=0.7 Arvid Sjölander

Confidence interval • Where can we expect the true RR to be? • The 95% Confidence Interval (CI) answers this question. • It is a range of plausible values for the true RR. • Example: RR=0.7, 95% CI: (0.5,0.9). • The narrower CI, the less uncertainty in the true RR. • The width of the CI depends on the sample size, the larger sample, the narrower CI. • How do we compute a CI?Ask a statistician! Arvid Sjölander

CI for our data • RR=0.7, 95% CI: (0.16,2.74). • Conclusion? Arvid Sjölander

P-value • Often, we specifically want to know whether the true RR is equal to 1 (no association between snus and ulcer). • The hypothesis that the true RR = 1 is called the “null hypothesis”; H0. • The p-value (p) is an objective measure of the strength of evidence in the observed data against H0. • 0 < p < 1. • The smaller p-value, the stronger evidence against H0. • How do we compute p?Ask a statistician? Arvid Sjölander

Factors that determine the p-value • What do you think p depends on? • The sample size: the larger sample, the smaller p. • The magnitude of the observed association: the stronger association, the smaller p. • A common mistake: “The p-value is low, but the sample size is small so we cannot trust the results”. • Yes you can! • The p-value takes the sample size into account. Once the p-value is computed, the sample size carries no further information. Arvid Sjölander

P-value for our data • P = 0.81 • Conclusion? Arvid Sjölander

Making a decision • The p-value is an objective measure of the strenght of evidence against H0. • The smaller p-value, the stronger evidence against H0. • Sometimes, we have to make a formal decision of whether or not to reject H0. • This decision process is formally called hypothesis testing. • We reject H0 when the evidence against H0 are “strong enough”. • i.e. when the p-value is “small enough”. Arvid Sjölander

Significance level • The rejection threshold is called the significance level. • E.g. “5% significance level” means that we have decided to reject H0 if p<0.05. • That we use a low significance level level means that we require strong evidence against H0 for rejection. • That we use a high significance level means that we are satisfied with weak evidence against H0 for rejection. • What is the advantage of using a low significance level? What about a high significance level? Arvid Sjölander

A parallell to the court room • H0 = the prosecuted is innocent. • p value = the strength of evidence against H0. • Low significance level = need strong evidence to condemn to jail. • Few innocent in jail, but many guilty in freedom. • High significance level = weak evidence sufficient to condemn to jail. • Many guilty in jail, but many innocent in jail as well. Arvid Sjölander

Type I and type II errors • There is always a trade-off between the risk for type I and the risk for type II errors. • Low significance level (difficult to reject H0)  small risk for type I errors, but large risk for type II errors. • High significance level (easy to reject H0)  small risk for type II errors, but large risk for type I errors. • By convention, we use 5% significance level (reject H0 if p<0.05). Arvid Sjölander

Relation between significance level and type I errors • In fact, the significance level = the risk for type I errors. • If we follow the convention and use 5% significance level (reject H0 if p<0.05) then we have 5% risk of type I errors. • What does this mean, more concretely? Sig level Arvid Sjölander

Power Power • Power = the chance of being able to reject H0, when H0 is false. • Relation between significance level and power: • High significance level (easy to reject H0)  high power. • Low significance level (difficult to reject H0)  low power. Sig level Arvid Sjölander

Power calculations • It is important to determine the power of the study before data is collected. • That the power is low means that we will probably not find what we are looking for. • Direct calculation of the power is beyond the scope of this course • Ask a statistician! Arvid Sjölander

Power calculations, cont’d • Heuristically, the power of the study is determined by three factors: • The significance level; higher significance level gives higher power. • The true RR; stronger association gives higher power. • The sample size; larger sample gives higher power. • Typically, we want to have a power of at least 80%. • In practice, the significance level is fixed at 5%. • We also typically have an idea of what deviations from H0 that are scientifically relevant to detect (e.g. RR > 1.5). • We determine the sample size that we need, to have the desired power. Arvid Sjölander

Systematic vs random errors • There are two qualitative differences between systematic and random errors. • #1 • Data can tell us if an observed association is possibly due to random errors - check the p-value. • Data can never tell us if an observed association is due to systematic errors. • #2 • Uncertainty due to random errors can be reduced by increasing the sample size  narrower confidence intervals. • Systematic errors results from a poor study design, and can not be reduced by increasing the sample size. Arvid Sjölander

Summary • In medical research, we are often interested in the causal effect of one variable on another. • An observed association between two variables does not necessarily imply that one causes the other. • Always be aware of the following pitfalls: • Selection bias • Confounding • Reverse causation • Random errors Arvid Sjölander

Understanding Biases in Medical Statistics: The Case of Snus and Ulcer Association