Introduction to Biostatistics (ZJU 2008)

Introduction to Biostatistics (ZJU 2008) Wenjiang Fu, Ph.D Associate Professor Division of Biostatistics, Department of Epidemiology Michigan State University East Lansing, Michigan 48824, USA Email: fuw@msu.edu www: http://www.msu.edu/~fuw

Chapters 4-5 Probability distribution Random Variable (rv) • Definition 1. A random variable (r.v.) is a numerical quantity that takes different values with specified probability. • Definition 2. A discrete r.v. is a r.v. for which there exists a discrete set of values with specified probability. • Examples: discrete r.v. number of episodes of a disease/symptoms, heart attacks, diarrhea, blood cell counts. • Definition 3. A continuous r.v. is a r.v. whose values form a continuum, and the range of values occur with specified probability. • Examples: height, weight, temperature, FEV, blood pressure, etc.

Probability distribution • Definition 4. A probability mass function (PDF) is a mathematical relationship/rule that assigns probability to each possible outcome. • Pr (X = r) – Prob. Distribution • Example: Hepatitis A. Household with 4 people r: # people contracted H.A r | 0 1 2 3 4 Pr (X = r) | .3 .1 .1 .3 .2

Probability distribution Properties of Probability 1. 0  Pr (X = r)  1 2. Total probability ∑ r Pr(X=r) = 1 • Expected value of a discrete r.v. where xi's are the values X assumes with positive probability. • Example = 0 x .3 + 1x .1 + 2 x .1 + 3 x .3 + 4 x .2 = 2.0 Interpretation: On the average, a household with 4 people has 2 people contracted Hepatitis A ; or two people are expected to be contracted H.A in a household of 4.

Probability distribution • Variance of a discrete r.v. X var (X) = 2 = where xi are the values that X takes with positive probabilities. • Variance is the expected value of (X-)2, or Var (X) = E [(X-)2] the expected value of the squared distance from the mean. • A short version of var (X) is given by

Probability distribution Example: 2 = 22x.3+12x.1+02x.1+12x.3+22x.2 = 2.4 or 2 = 02x.3+12x.1+22x.1+32x.3+42x.2 - 22 = 2.4 SD (X) =  = 1.55 • A Rule that is true for most cases Approximately 95% of the probability mass falls within two SD of the mean (expected value) of a r.v.  2 = 2.0 2 x 1.55 = 2.0  3.1 = [-1.1, 5.1]

Probability distribution • Cumulative distribution function (CDF) of a discrete r.v. F ( x ) = Pr ( Xx ) Example F(0) = Pr (X 0) = Pr (X = 0) = .3 F(1.8) = Pr (X 1.8) = Pr (X = 0) = .3 F(1) = Pr (X 1) = Pr (X < 1) + Pr (X = 1) = .3 + .1 = .4 F(2) = Pr (X 2) = Pr (X < 2) + Pr (X = 2) = .4 + .1 = .5 F(3) = Pr (X 3) = Pr (X < 3) + Pr (X = 3) = .5 + .3 = .8 F(4) = Pr (X 4) = Pr (X < 4) + Pr (X = 4) = .8 + .2 = 1

Probability distribution • Properties of mean and variance of distribution • Let X, Y and Z are random variables. If Y = aX +b with constants a and b, then E(Y) = aE(X) + b Var (Y) = a2 Var(X) SD (Y) = |a| SD (X) If Z = aX+bY, then E(Z) = a E(X) + b E(Y).

Binomial distribution • Permutation : The number of permutations of n things taken k at a time is nPk = n (n-1) (n-2) … (n-k+1) • It represents the number of ways of selecting k items out of n, where the order of selection is important. 8P5 = 8 x 7 x 6 x 5 x 4 = 6720 nPn = n (n-1) (n-2) … x 2 x 1 = n! n! = n factorial = n x (n-1) x … x 2 x 1 Definition 0! = 1.

Binomial distribution • Combination: the number of combinations of n taken k at a time is The number of ways to choose 5 doctors from 8 doctors is

Binomial distribution • The Binomial Distribution n independent trials, each trial has two outcomes: success (1) or failure (0) with constant probability Pr (success) = p, and Pr (failure) = 1 – p = q • Example. Flu infection. 5 indep individuals were together. Prob (being contracted with flu) = .6 Prob (2 out of 5 contract flu) = ? • 1st question: does order matter? Who becomes 1st case, who 2nd. • No. So use combination.

Binomial distribution • Each possibility: Fi = {i-th subj. got flu} • Pr ( F1F2F3F4F5) = pxpxqxqxq = p2 (1-p)3 = =.62x(1-.6)3 = .02304 • Pr (2 out of 5 flu) = 5C2 p2(1-p)3 =10 x.02304 = .2304 • The combination number represents the total number of different events: F1F2F3F4F5, F1F2F3F4F5, F1F2F3F4F5, etc..

Binomial distribution • Binomial distribution B(n, p) The distribution of the number of successes in n statistically indep trials with the prob. of success on each trial p is a binomial distribution and has a probability mass function • Pr (X = k) = nCk pk(1-p)n-k , k = 0, 1, …, n. • Example. The ratio of # boys to girls is 2:3 in one classroom. Pr (having 3 boys out of 5 children) = ? p = # boys / # children = 2 / (2+3) = .4 Pr (3 boys out of 5) = 5C3 p3(1-p)5-3 =10x .43(1-.4)5-3 =.2304 Pr (having at least 3 boys out of 5 children) = ? Pr (3 boys out of 5) + Pr (4 boys out of 5) + Pr (5 boys out of 5) = .2304 + 5x.0256x.6 + 1x.01024x1 = .3174

Binomial distribution • Binomial distribution B(n, p) • Binomial Tablen = 2, 3, …, 20, p = .05, 01, .15, …, .50. • Recursion rule --- simplify the calculation of binomial prob. in the old days. Pr (X=k+1) = (n-k) / (k+1) * p/q * Pr (X=k), k=0,1,2,… If Pr (X=0) is known, then Pr (X=1) is known, then Pr (X=2), …, • New Approach: computer programs: Splus, R, SAS, etc. In R: dbinom(x, n, p), pbinom(x, n, p), qbinom(x, n, p),

Binomial distribution • Expected value and variance of binomial dist B(n, p) Var (X) = np(1-p) • Some important points of B(n, p) 1. mean and variance depend on n and p. 2 The larger the number of trials, the larger the mean and variance. 3 The larger the probability of success p, the larger the mean. 4 var (X) is small for very small p close to 0 and very large p close to 1. It attains the maximum value at p = .5 with var (X) = n/4. • p = .5 : success and failure are equally likely to occur. Toss a coin.

Binomial distribution • Example Disease asthma caused by pollution from nearby industry p = Pr(having asthma nationwide) = .03 (reference level). • In a small community of n=100, we observe 8 cases. Is this an alarming evidence of asthma? • Pr (having asthma in community) = 8/100 = .08 > reference level • Is this much higher? Usually or unusually high? With a small probability? Is {having 8 cases} a small probability event? • If {8 cases} is unusually high, then {9 cases}, {10 cases}… all high. • Criterion to use Is {having at least 8 cases} a small probability event?

Binomial distribution • So need Pr (at least 8 cases) = Pr (X 8) • Need to calculate 100 –8 +1 = 93 probabilities. Pr (at least 8 cases) = 1- Pr (at most 7 cases) = = 1 – ∑k=07Pr(X=k) = 0.028, A small probability event (< 0.05) based on reference. Claim that having such a small probability event is very unlikely. Interpretation: However, since we have observed such a small probability event, we believe this is an unusual observation, or having observed 8 cases out of 100 in the community is alarmingly high.

Poisson distribution • Three assumptions for Poisson distribution 1. The event is rare, i.e. Pr (observing 1 event instantly ) ≈ λΔt; Pr (observe > 1 events instantly ) ≈ 0  Pr (0 event) ≈ 1- λΔt. 2. Stationary process. The number of events per unit time remains the same during the entire duration of time. 3. Independence. The outcome in one time interval does not affect the probability of another time interval of no time overlap. • Poisson distribution Pr (X=k) = e–μμk / k!, k = 0, 1, 2, … Where μ = λt, t is the time duration or area, λ is the intensity per unit. • Notice that k has no upper bound or ceiling.

Poisson distribution • Mean and variance of Poisson distribution E(X) = ∑ k k Pr (X=k) = ∑ k k e–μμk / k! = μ Var (X) = ∑ k k2 Pr (X=k) – [E(X)]2 = μ • That’s why Poisson distribution has only one parameter μ. • Example. Assume 3 traffic accidents are expected in the city of Detroit every day in the winter (Nov – Feb), while only 1 accident is expected per day other time of the year due to the weather conditions. If 5 accidents were observed on one day, was this an alarming event? • Distribution? No cap or upper bound of # events (n). So use Poisson distribution. • Warning: only use this distribution in the time period when the intensity λremains constant; should not use for whole year! Why?

Poisson distribution • Notice the 2 different intensities, winter (λ=3) and other (λ=1). • If the day was in winter, λ=3 Pr(k≥5) = 1- [Pr(k=0)+ …+ Pr(k=4)]= 1-0.815 = 0.185 If the day was in summer, λ=1 Pr(k≥5) = 1- [Pr(k=0)+ …+ Pr(k=4)]= 1-0.996 =0.004, a small probability event. • Conclusion, if 5 accidents were observed in the summer, it was alarmingly high, but not in the winter. • How to work on the prob of observing 20 in 2 consecutive months (Feb & March) together?

Gaussian (Normal) distribution • Continuous rv X ~Gaussian distrib. N( μ, σ2) Prob density function (PDF) for some parameters ,  with > 0. 2 parameters ,  2 determine the distribution, the mean  for location and variance  2 for shape. • X may take any real number, either > 0 or < 0. • f is symmetric about . • F: the cumulative distribution function CDF

Standard normal distribution • X ~standard normal distrib. N( 0, 1) Prob density function (PDF) X may take any real number, either > 0 or < 0. • f is symmetric about 0. • Φ: the cumulative distribution function CDF A very useful function in statistics. Φ (-x) = 1 – Φ (x), frequently used for the calculation of p-value.

Properties of N (0, 1) • Properties of standard normal N (0, 1) PDF f (x), –  < x <  • 1). symmetric about 0: f (x) = f (– x) • 2). Pr (-1 X  1) = .6827, or about 68% (more than 2/3) of the area lies in [–1, 1]. Pr (-1.96 X  1.96) = .95, or about 95% area lies between –1.96 and 1.96. Pr (-2.576 X  2.576) = .99 , or about 99% area lies between –2.5 and 2.5

Illustration of N (μ, σ2) 68% 68% 95.4% 95% 99.7% 99.7%

Special notation for N(0,1) • Definition The 100 x uth percentile of N (0,1) is denoted by Zu : Pr ( X Zu ) = u, where X N (0,1) then (Zu) = Pr ( X Zu ) = u • Frequently used quantiles Z.975, Z.95 , Z.5 , Z.05, Z.025 (1.96) = .975, (1.645) = .95, (0) = .5 (-1.645) = 1 - (1.645) = 1 - .95 = .05 (-1.96) = 1 - (1.96) = 1 - .975 = .025 Z.975= 1.96. Z.95 = 1.645 Z.5 = 0 Z.05 = -1.645 Z.025 = -1.96

Calculation of probability • X ~ N( μ, σ2). Calculate Pr (a < X< b) • Z = (X- μ) / σ, then Z ~ N(0,1), use . • Pr (a < X< b) = Pr {[(a-μ)/σ] < [(X-μ)/σ] < [(b-μ)/σ]} = Pr {[(a-μ)/σ] < Z < [(b-μ)/σ]} =  [(b-μ)/σ] -  [(a-μ)/σ] • Example: Hypertension. SBP X N (80, 144 ) Pr (90 < X < 95)=Pr[(90 – 80)/12 <X< (95-80)/12] = Pr (.83 < Z < 1.25 ) = ( 1.25 ) - ( .83 ) = .8944 - .7967 = .098

Calculation of probability • Example: Cerebrovascular disease. To Diag. stroke, use cerebral blood flow (CBF), clinically diagnose patient at risk: CBF < 40. • Assume normal people's CBF has normal distribution with mean 75 and SD = 17. Find percentage of normal people mistakenly classified by CBF as stroke patients. • Let X be CBF in normal person. Then XN (75, 172) Need to find Pr (X < 40). • Pr (X < 40) = Pr (Z < (40 – 75)/17 ) = Pr (Z < – 2.06) = (– 2.06) = 1 – (2.06) = 1 – .9803 = .02 = 2 %

Approximation of distribution • Bin (n, p) can be difficult to calculate, • Pois (μ) can also be difficult to calculate. • Easy to calculate N (0,1). • Need: use Normal distribution to approximate others so that the computation will be much easier, and yet the probability accuracy will not be compromised a lot.

Pois (μ) Approximation to Bin (n, p) • Basic judgment: two distributions must be close to each other enough: their parameters close: μ1 and μ2 are close, and σ12 and σ22 are close. • Bin (n, p) mean np, var npq Pois (μ) mean μ, var μ. • So np ≈ npq and moderate. • When n large, p small (q = 1-p close to 1) and np is moderate, can approximate X ~ Bin (n,p) with Y ~ Pois(np).

Normal Approximation to Bin (n, p) • Bin (n, p) mean np, var npq N (μ, σ2) mean μ, var σ2. • Bin (n,p) needs to be roughly symmetric. • When npq ≥ 5, can approximate X ~ Bin (n,p) with Y ~ N(np, npq). • Pr(X=k) = Pr(k-.5 <Y<k+.5); Pr(X≥k) = Pr( Y > k-.5) Pr(X≤k) = Pr( Y < k+.5) • If use normal approximation, n ≥ 20. Why? • If p is so small that npq < 5, then do not use normal approximation, but rather use Poisson approximation.

Normal Approximation to Pois (μ) • Pois (μ) mean μ, var μ N (μ, σ2) mean μ, var σ2. • Pois (μ) needs to be roughly symmetric. • When μ ≥ 10, can approximate X ~ Pois (μ) with Y ~ N(μ, μ). • Pr(X=k) = Pr(k-.5 <Y<k+.5); Pr(X≥k) = Pr( Y > k-.5) Pr(X≤k) = Pr( Y < k+.5)

Introduction to Biostatistics (ZJU 2008)