Introduction to Biostatistics (ZJU 2008)

Introduction to Biostatistics (ZJU 2008) Wenjiang Fu, Ph.D Associate Professor Division of Biostatistics, Department of Epidemiology Michigan State University East Lansing, Michigan 48824, USA Email: fuw@msu.edu www: http://www.msu.edu/~fuw

Homework 1 • correction: • 2. Referring to Table 3.4 The three people are: 77 years old man, 76 years old woman and 82 years old woman.

Parameter estimation What we have learned so far: • Random variables. • Distributions of random variables (Bin, Pois, Gaussian). • Calculation of probability based on distributions including approximation methods. • Application of probability theory (small probability events). • All the above are based on known distribution: know types of distribution and known parameters of distribution.

Parameter estimation Distributions of random variables (Bin, Pois, Gaussian). • Calculation of probability based on distributions including approximation methods. • Examples: D.B.P. N (80, 12.52) # cases of cancer Pois () ,  = 6 # lymphocytes B(100, .34) • Calculate probability based on assumptions of the distribution and the parameters of the distribution. • Application of probability theory (small probability events). • All the above are based on known distribution: know types of distribution and known parameters of distribution. • Where do we find the info for parameters? • The only answer is from the data (or samples)!

Parameter estimation • The only answer is from the data (or samples). • Data set Estimation of parameters Hypothesis testing Statistical inference • Estimation: point estimation Interval estimation: CI.

Relation between population and sample • Random sample --- selection of some members of population such that each member is independently chosen and has a known non-zero probability. Example 1: 10 birth weights x1, …, x10 is a sample from the entire population of birth weight. Example 2: WBC of 30 students independently selected from MSU x1, …, x30 is a sample from the population of WBC of all MSU students. • Simple random sample --- a random sample in which each member has the sample probability of being selected. A random sample is referred to a simple random sample. Some non-simple random samples: cluster sampling: Within state choose clusters (geographic locations, regions, sparse populations) • Random samples within selected cluster The reference, target or study population is the group we wish to study (to make inference). The random sample is selected from the study population (hoped to be a good representation of the study population to draw conclusion from).

Estimation of the Mean • Estimation of the mean of a distribution  = E(X) • A random sample x1, …, xn from the distribution of X. • The natural estimate of  is the sample mean Let x1, …, xn be a random sample drawn from the same population with mean . Then the sample mean satisfies E(x) = , an unbiased estimator. • An estimator e of parameter  is unbiased if E(e) = . • Then we know x is an unbiased estimator of .

Estimation of the Mean • For normal distribution N (, 2), x is the “best” unbiased estimator – having the smallest variance and no bias. • Standard Error of the Mean x1, …, xn a random sample from a underlying distribution with mean  and variance 2. Then • Standard error of the mean (SEM) is the standard deviation of the sample mean x, which is equal to Standard error of the mean is estimated by . S2 sample variance. • I.I.D. (or i.i.d.) – independently identically distributed x1, …, xn iid r.v.’s – x1, …, xn are indep r.v. with the same distribution (same mean, variance, quantiles, etc.). • A random sample from a population is iid.

Estimation of the standard error • Estimation of standard error (s.e.) se (x) = . n – sample size, usually known.  -- may be known or unknown. • If 2 is unknown, use sample variance Use S to estimate  and use to estimate se (x).

Central Limit Theorem • Notation ^ to be an estimate of certain parameter -- the estimate of the mean , -- the estimate of the variance • Central Limit Theorem for normal rv. If x1, …, xn are iid then . In fact, for large n, even for iid r.v.'s x1, …, xn not normally distributed, the central limit theorem still holds. • Central limit theorem If x1, …, xn are indep r.v.'s with the same mean and variance . Then for large n the mean is approximately normally distributed with mean and variance /n . • Point estimation:  se(x) • Interval estimation confidence interval (C.I.) estimate of precision ^

Interval Estimation • Interval estimation – known variance Assume population follows normal distr. N (, 2). A random sample x1, …, xn has mean x N (, 2/n ) • If and 2 are known, then we have or equivalently Pr ( - 1.96  / < x <  + 1.96 /) = .95 or equivalently Pr (x - 1.96 / < <x + 1.96 / ) = .95 • Definition (Confidence Interval) A 95% confidence interval (C.I.) for  when 2 is known is defined by the interval ( x - 1.96 / , x + 1.96 / ) • Interpretation: we are 95% confident that the population mean is in the CI ( x - 1.96 / , x + 1.96 / ).

Confidence Interval • Note CI (x-1.96/ , x+1.96/ ) is random since it depends on x , which is random and depends on the random sample. If many samples are drawn from the population, and the CI is calculated for each sample, then over the collection of all 95% CIs that could be constructed from the repeated random samples of size n, 95% will contain the parameter of population.

95% Confidence Intervals

## plot standard normal N(0,1) ## density and construct 95% ## CI for random samples of ## size 20 from N(0,1) #### Plot of density N(0,1) a <- c(-100:100)/25 plot (a, dnorm(a), type = ‘l’ , ylim = c(-1,.5) ) abline (v = 0, col=2) ## Construct 95% CI for a ## random sample of size 20 ## and repeat for 1000 times B <- 1000 size <- 20 CImat <- matrix(NA, B, 3) for (i in 1:B) { samp <- rnorm(size, mean=0, sd = 1) samp.mean <- mean(samp) ## normal N(0,1) distribution with known variance 1 CImat[i,1:2] <- c(samp.mean-1.96*sd(samp) /sqrt (size), samp.mean+1.96*sd(samp)/sqrt(size) ) ## normal distribution with unknown variance ## CImat[i,1:2] <- c (samp.mean –qt (p=.975,df=size-1)*sd (samp) /sqrt(size), samp.mean+qt (p=.975, df=size-1)*sd(samp)/sqrt(size)) CImat[i,3] <- 1*(CImat[i,1]*CImat[i,2]<=0) ## plot a segment for the CI at a random height with different colors lines ( CImat[i,1:2], rep(runif(1,min=-1,max=0), col=5*(i/5-ceiling(i/5))+1,2) ) } sum(CImat[,3]) / B 95% Confidence Intervals

Confidence Intervals of Mean • Length of CI:the larger the CI, the less precise the estimate. CI – a safeguard: not to make mistakes in estimation. Large CI – not to make mistakes frequently, but useless. • Example. SBP If 12 = 100, 22 = 400, n = 9, then 95% CI = ? Sample 1. x1 1.96 1/√9 =150  1.96x 10/3=150 6.53 = (143.47, 156.53) Sample 2. x2 1.96 2/√9 =150  1.96x 20/3=150 13.07 = (136.93, 163.07) • CI at any  - level : Factors affecting the length of CI (width) 1). n – sample size: n increases, length of CI decreases: narrower; 2). -- standard deviation of population:  increases, length of CI increases (wider); 3).  -- (1-) level of confidence: (1- ) increases, length of CI increases (wider).

Confidence Intervals of Mean • CI at any  - level : Using percentile Zu, Pr (X Zu ) = u for X  N (0, 1) Pr (X  -Z1-/2) = Pr (X Z1-/2)=/2 left tail and right tail prob. Tail probability = Pr (|X| Z1-/2 ) =  (1-) x 100% CI for  is (x - Z1-/2/ , x + Z1-/2/ )  can be any level. the most frequently used are .01, .05, .1 • Interval estimation - 2 unknown. Using estimate S2 to estimate 2 for CI. For 2 known: N (0, 1) For 2 unknown: tn-1 t- distribution (Student's t-distribution) (W. Gossett) tn-1, a student's t- distribution with (n-1) degrees of freedom (df) Percentile of t- distribution: td,uof (100x u) % or Pr (tdtd, u) = u t-distribution table.

Confidence Intervals of Mean • Note that t dN (0, 1) for very large d : When d < 30, we see the difference between td, uand Zu. When d > 30, the difference is small. • CI of  with 2 unknown: Estimate 2 by S 2 and change Zu to tn-1,uto follow similar procedure for 2 known. (1- )x 100% C.I. for  when 2 is unknown is ( x - tn-1,1-/2S/ , x + tn-1,1-/2S/ )

Confidence Interval of Mean • Example. Table 6.9, 27 rats with LVEF It is known that xi = 6.05, xi2 = 1.522, Assume normal distr. Calculate mean, S2, s.e., 95% CI • x = xi /n = 6.05/27 = .224 S2 = .0064 S = .08 s.e.(x) = S/√27 = .0154 • 95% CI : xt26, .975S/ √27 = .224  2.056 x .0154 = .224  .0317 = (.1923, .2557)

Estimation of Variance • Point estimation natural estimate: sample variance E(S2) = 2 ? • Theorem. If x1, …, xn is a random sample from population with mean  and variance 2, then E(S 2) = 2 i.e. S2 is an unbiased estimator of 2. • If we use denominator n rather than (n-1) in S 2 to estimate 2, • E{2} = E{ [(n-1)/n] S2 } = [(n-1)/n] E {S2} = [(n-1)/n] 2 < 2 i.e. the average of the squared distance from the sample mean is a biased estimator of 2. ~

Interval estimation of variance • Chi-squares distribution If G =X12 + … + Xn2 , where X1, …, Xn iid N (0, 1), then G is said to follow a Chi-squares distribution with n degrees of freedom. • Denote G 2n , it only takes positive values with mean E (2n) = n. • u-th percentile of 2n, denoted by 2n, u, satisfies Pr (2n < 2n, u) = u, can be obtained from 2n table. • Distribution of S2 x1, …, xn an iid random sample from N (, 2) Then or equivalently

Confidence Interval of Variance • Similar to the derivation of the CI for , we have • (1-) x 100% CI for 2 is • Example S.B.P. N (, 2) sample 1. 1 = 150 S12 = 250 n = 5 sample 2. 2 = 150 S22 = 1700 n = 5

95% CI for σ2 • 95% CI : (1-) = .95, = .05 95% CI = [ (n-1) S2/χ24,.975, (n-1)S2/ χ24,.025 ] sample 1: [4x250 / 11.14, 4x250 / .484] = [89.77, 2066.12] sample 2: [4x1700 / 11.14, 4x1700 / .484] = [610.4, 14049.6]

Estimation for Bin(n, p) • Example. A random sample of 1000 adults. Among them 30 had heart attack(s). How to estimate p ? • p = Pr (having heart attack(s) before) = relative freq. = 30 / 1000 = .03 • Q: Is this a good estimator? • A: X B (n, p). Let X1, …, Xn be indep. Bernoulli trials. Pr (Xi=1) = p, and Pr (Xi= 0) = 1- p. 1 ≤ i ≤ n Then X = ∑1nXi and X/n = ∑1nXi /n = X --- sample mean with expected value E(Xi) = p E (X) = E (∑Xi /n) = p • So, X= X/n is an unbiased estimator for p. or • p = X/n, s.e.(p) = ? ^ ^

Estimation for Bin(n, p) ^ • var (p) = var(X/n)= var(X)/n2 = npq/n2 =pq/n s.e. (p) = with q = (1- p) • Estimate s.e. (p): replace p with p • s.e. (p) = (pq/n)1/2 • Example: n = 1000, X = 30. p = X/n =.03 • s.e. (p) = .00539 ^ ^ ^ ^ ^ ^

Interval estimation of Binomial p • Normal theory method • X B (n, p) then X = ∑1nXi with indep. Bernoulli trials X1, …, Xn • p = X/n, sample mean of Bernoulli trials. • By central limit theorem (CLT), • pN (p, pq/n) then use normal distribution • Condition: npq 5. ^ ^

Confidence Interval for p ^ ^ • 95% CI for p with normal theory (npq≥5) is (p-1.96(pq/n)1/2 , p+1.96(pq/n)1/2 ) • (1-)x 100% CI for p with normal theory (npq≥5) is (p-z1- /2 (pq/n)1/2 , p+z1- /2 (pq/n)1/2 ) • Example. Eosinophils : p = 2/100 = .02 , np(1-p) = 100x.02x.98=1.96 < 5 • Normal approximation does not work! • Exact method: use Table 7 for 95% CI. ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^

One-sided CI • Example: hypertensive treatment to lower BP Comparing Standard v.s. new drug. • Suppose out of 100 hypertensives, new drug brings 40 subjects’ BP down to normal while the standard has 30% efficacy. Q: 1). Is the new drug different from the standard ? 2). Is the new drug better than the standard? • A: 1). Two sided. Can be better or worse if different. 2). One sided. Can be better or no better. • Upper one-sided (1-)x 100% CI for p of B(n; p) p > p - Z(1-) (pq/n)1/2 for npq >= 5 Lower one-sided (1-)x 100% CI for p of B(n;p) p < p +Z(1-) (pq/n)1/2 for npq >= 5 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^

CI: one-sided vs two-sided • Example. Hypertension study 100 people receive drug for treatment on high BP. 20 of them got BP lowered by the drug. If by reference, BP is also lowered by placebo on 10% people. Q1: any drug effect? Q2: drug better than placebo? • A: Pr (lowering BP in drug group) = 20/100 = .2 > .1 of Placebo Use p = .2 to calculate np(1-p) = 100x.2x.8 = 16 > 5 Normal approximation valid: pN (p, pq/n) . 95% CI of p is p  1.96 (npq)1/2 = .2 1.96 x .04 = .2  .0784 = (.1216, .2784) • Since p = .1 for placebo and .1 is not in the 95% CI for p, i.e. we are 95% confident that the placebo effect (p=.1) is different from the drug effect. ^ ^ ^ ^ ^ ^ ^

CI: one-sided vs two-sided • Example. Hypertension study 100 people receive drug for treatment on high BP. 20 of them got BP lowered by the drug. If by reference, BP is also lowered by placebo on 10% people. Q1: any drug effect? Q2: drug better than placebo? • Q2. Is the drug better than placebo: p > .1 A: One-sided 95% CI for p is p > p - Z1-.05 (pq/n)1/2 = .2 – 1.645 x .04 = .2 - .0658 = .1342 One-sided 95% CI: (0.1342, +∞) Compare with two-sided (.1216, .2784) ^ ^ ^ 0.2784 0.1216 0.1342

Estimation for Poisson Distribution • Pois () .  = t,  -- intensity • Estimator  =  / t, estimated by X/t • Instead of estimating the mean, estimate the intensity. where t can be area, time duration, etc.

Introduction to Biostatistics (ZJU 2008)