530 likes | 625 Vues
Section II Descriptive stats for continuous data Descriptive stats for binary data and bivariate associations in binary data . Types of data Numerical: Continuous-age, SBP,glucose Interval-parity, num infections Ordinal (ranks) Cancer stage, Apgar score Nominal (no order)
E N D
Section II Descriptive stats for continuous data Descriptive stats for binary data and bivariate associations in binary data
Types of data Numerical: Continuous-age, SBP,glucose Interval-parity, num infections Ordinal (ranks) Cancer stage, Apgar score Nominal (no order) Gender, ethnicity, treatment
Dataset used to illustrate some statistics in this section Stomach cancer survival times in controls (Cameron & Pauling, PNAS, Oct 1976) Days from end of treatment to death 4, 6, 8, 8, 12, 14, 15, 17, 19, 22, 24, 34,45 n= 13 subjects
Measures of central tendency (middle) Data: 4, 6, 8, 8,12, 14, 15, 17, 19, 22, 24, 34,45 mean = 17.5 days median = 15 days mode = 8 days Geometric mean-GM=13√4x6x8x8x…x45=14.25 If we delete the most extreme value, 45, mean is now 15.24, median is 14.5, GM=13, median changes least
Mean versus Median (lesson #1 in how to lie with statistics) Yearly income data from n=11 persons, one income is for Dr Brilliant, the other 10 incomes from her 10 graduate students Yearly income in dollars 950 960 970 980 990 1010 1020 1030 1040 1050 $100,000 $110,000 (total) mean = 110,000/11 = $10,000, median = 1010 (the sixth ordered value) Which is better summary of “typical” value?
Example - Survival times in women with advanced Breast Cancer Survival time in days after end of radiotherapy woman after 275 days f/u after 305 days f/u 1 14 14 2 26 26 3 43 43 4 45 45 5 50 50 6 58 58 7 60 60 8 62 62 9 70 70 10 70 70 11 83 83 12 98* 128* 13 104* 134* 14 124* 154* 15 125* 155* 16 275* 305* mean 75.6 83.1 median 66.0 66.0 SD 55.8 66.3 * still alive (censored) The median is still a valid measure when less than half the data are censored.
Cumulative frequencies & survival num pct cum cum pct cum pct Days dead dead dead dead alive=S 1-10 4 30.8 4 30.8 69.2 11-20 5 38.5 9 69.2 30.8 21-30 2 15.4 11 84.6 15.4 31-40 1 7.7 12 92.3 7.7 41-50 1 7.7 13 100.0 0 total 13
Summarizing mortality – hazard rates Hazard rate = h = number of persons with outcome total person-time follow up in all at risk This is a rate per person-time. It is NOT a probability (not a risk) In stomach cancer n=13, with 13 deaths, total follow up is 4+6+8+8+12+14+15+17+19+22+24+34+45 = 228 person-days Hazard rate = mortality rate = 13/228 = 0.057 or 5.7 deaths per 100 person-days of follow up. Do NOT report as 5.7%-wrong
Example: Why hazard rates? Group n num dead mean f/u total f/u rate per 1000 A 100 7 36 3600 7/3600=1.94 B 100 2 3 300 2/300 =6.66 Mortality rate is higher for B than A even though the number of persons in each group is the same and more people died in group A. The hazard rate ratio for A/B is 1.94/6.66=0.291. When ALL patients are followed to the endpoint, (no censoring) mean time to event= 1/hazard.
Hazard rates & survival curves loge(S) = cum haz= h t, h is (average) slope of loge(S) vs t
Hazard rate ratios & Survival curves ha = hazard rate in group A hb = hazard rate in group B, hazard rate ratio, (HR) for A compared to B is HR = ha/hb If HR is constant over time one can compute the Survival in group A from the Survival in group B. Sa = SbHR Ex: HR=0.291, S at t=12 mos is 90% in group B, S=0.900.291 = 0.970 or 97.0% in group A at t=12 months. A “protective” HR < 1 increases survival. HR >1 decreases survival.
Cumulative hazard rate Loge(S)=Cumulative hazard = Σt hi = ∫h(t)dt If h is constant over time Cumulative hazard = h T where T is the follow up time. In this case, h = cum hazard/T h is the slope of the cum hazard vs t plot.
From: Risks and Benefits of Estrogen Plus Progestin in Healthy Postmenopausal Women: Principal Results From the Women's Health Initiative Randomized Controlled Trial JAMA. 2002;288(3):321-333. HR indicates hazard ratio; nCI, nominal confidence interval; andaCI, adjusted confidence interval. Global index = first occurrence of CHD, cancer, stroke, pulmonary embolism, hip fracture or death.
Distribution skewness Long right tailed distribution median < mean (common for survival data)
Example: ICU length of stay(Howard) n=94, mean=11.3 days, median= 6 days min=1 day, max=80 days
Skewness Long left tailed distribution median > mean (not as common in biology/medicine)
Symmetric(common in biology) mean median Can be symmetric without being bell curve shaped – has one mode When data has a skewed distribution, must use “non parametric” methods
Measures of variation, spread IQR – interquartile range
Box-whisker plot Q1 median Q3 min max mean
Variation-Variance & SD _ Mean = Y= 17.54 days _ _ Y Y-Y (Y-Y)2 4 -13.54 183.3 6 -11.54 133.2 8 -9.54 91.0 8 -9.54 91.0 12 -5.54 30.7 14 -3.54 12.5 15 -2.54 6.5 17 -0.54 0.3 19 1.46 2.1 22 4.46 19.9 24 6.46 41.7 34 16.46 270.9 45 27.46 754.1 sum 0 1637.2 _ Variance = (Yi - Y)2 (n-1) Var=1637.2/12=136.4 SD=√Variance=√136.4= 11.6 days
Variation- Interpreting the SD Rule of thumb from Gaussian (“Normal”) theory (will study more shortly) rule ok if data has unimodel symmetric distribution Range of middle 2/3 of the data: mean +/- SD Range of middle 95% of the data: mean +/- 2 SD Implies SD ≈ range/4 (after extreme values removed from range)
SD of differences-paired datachol in mmol/L person chol at start chol at end difference 1 12.6 10.0 2.6 2 8.5 7.5 1.0 3 7.0 5.8 1.2 4 6.9 4.9 2.0 5 5.8 4.0 1.8 6 4.1 3.8 0.3 mean 7.48 6.00 1.48 SD 2.90 2.38 0.82 Corr of start vs end: r=0.971
If authors only report (mmol/L) start end change?? mean 7.48 6.00 SD 2.90 2.38 Easy to get mean difference=7.48 – 6.00=1.48 But can’t get SD of differences 2.90 - 2.38 = 0.52 ≠ 0.82 The 1.48 mean diff is average response The 0.82 diff SD is variation in response. SDdiff= √ SD2start +SD2end – 2 r SDstart SDend r= correlation coeff
SD of differencestwo independent groups Comparing ages in groups A vs B Data->
Rule for SD of differencestwo independent groups Var(Y - X) = Var(Y) + Var(X) Var(Y + X) = Var(Y) + Var(X) SD(Y-X)= √ SD2(Y) + SD2(X) SD(Y+X)=√ SD2(Y) + SD2(X) SD(Y-X) SD(Y) SD(X)
BINARY DATA Statistics
Associations for Binary data risk=Podds=O Pe= a/(a+b) Oe= a/b Pu = c/(c+d) Ou= c/d RR =Pe/Pu OR= Oe/Ou
Risk vs Odds P=risk, O=odds O=P/(1-P), P=O/(1+O) P=1/10, O=1/9. Risk=num sick/total Odds=num sick/num not sick RR = OR/(1 – Pu + OR Pu) When Pu is small, RR=OR In general, OR is more extreme than RR
Oral Contraceptive exposure vs Cancer Prospective study (unbiased est of pop)
Ratios and differences For rare events or diseases Pe=1/10,000, Pu= 1/100,000 RR = 10, risk difference = 9/100,000 Misleading to only report ratio and not actual risks.
Why use ORs? 1.In prospective study, usually quote disease risk & risk ratio (RR). In case-control, we always quote OR, not RR. Case-control OR of exposure in disease/no disease Equals Prospective OR of disease in exposed/unexposed in population if the probability of exposure is same as in the target population. (Not necessarily true if there is confounding, bias). 2. OR more “stable” (universal) across studies. If unexposed risk=20%, RR=2, exposed risk=40% If unexposed risk=60%, RR can’t be 2.
Independence rule for ORs ORs for heart attack (MI) For smokers/non smoker: OR = 4 For alcohol/no alcohol: OR = 2 Ifindependent, OR for those who smoke AND drink alcohol is 4 x 2 = 8 (relative to no smoke, no alcohol). Only true if smoking, drinking are independent influences on MI. However, smoking & drinking can be correlated with each other.
NNT – number needed to treat (or harm) (clinical trials) Pc (like Pu)=prop w/ disease in control group Pt (like Pe)=prop w/ disease in treat group ARR=absolute risk reduction= risk difference= RD=Pc-Pt RRR=Relative risk reduction=(Pc-Pt)/Pc = ARR/Pc=1-RR NNT=number needed to treat=1/ARR
NNT Example Pc=0.36=36%, Pt=0.34=34% ARR=RD=0.02=2% RRR=0.2/0.36 = 5.5% (a percent of a percent) NNT = 1/0.02 = 50 So 50 patients must be given the treatment to cure one additional disease case. Can be extended to more complex stats.
NNT–Ovarian Ca screening “Tests commonly recommended to screen healthy women for ovarian cancer do more harm than good and should not be performed, a panel of medical experts said on Monday. The screenings —blood tests for a substance linked to cancer and ultrasound scans to examine the ovaries — do not lower the death rate from the disease, and they yield many false-positive results that lead to unnecessary operations with high complication rates, the panel said. … “To find one case of ovarian cancer, 20 women had to undergo surgery. “ (NY Times–10 Sept 2012)
Summary-Ratios RiskOddsHazard P O h Ratio: RR=Pe/Pu OR=Oe/Ou HR=he/hu All have the null value of 1.0 when there is no association. The distribution of the logs of their ratios from study to study are usually bell curve shaped around the true log scale value.
Sensitivity and Specificity Sensitivity=a/(a+c), false negative=c/(a+c) Specificity=d/(b+d), false positive=b/(b+d) Positive predictive value=PPV=a/(a+b) * Negative predictive value=NPV=d/(c+d) * * Depends on disease prevalence-not just attribute of test
Sensitivity, Specificity, Accuracy Accuracy = W Sensitivity + (1-W) Specificity where 0 < W < 1. Often W=0.5 (unweighted accuracy) We wish to maximize accuracy=minimize misclassification = 1- Accuracy Choose W depending on “costs”.
ROC curve–choose continuous data cutpoint (threshold) for highest accuracy, best “separation”
“Modern” format for ROC Highest accuracy is NOT necessarily where sens=spec, only when SD1=SD2
C (concordance) statistic for ROC C = area under the “traditional” ROC curve 0.5 (bad) < C < 1.0 (good) If nd=a+c true num w/disease nnd=b+d true num w/o disease From all possible nd x nnd pairs with one diseased and one not, call a pair “concordant” if diseased is positive and non diseased is negative. C is the proportion of the pairs that are concordant.
Positive and Negative predictive value Positive predictive value (PPV) & negative predictive value (NPV) depend on sensitivity (sens), specificity (spec) & disease prevalence (P). Sensitivity and specificity do NOT depend on disease prevalence. Can only compute PPV=a/(a+b) & NPV=d/(c+d) when disease prevalence P = (a+c)/(a+b+c+d) = (a+c)/n Bayes formulas for PPV and NPV Let P = prevalence of disease PPV = test true pos/ (test true pos + test false pos) = sens x P / [ sens x P + (1- spec) x (1- P) ] NPV = test true neg/ (test true neg + test false neg) = spec x (1-P) / [ spec x (1-P) + (1-sens) x P ] But don’t use these formulas – there is an easier way
Example Sens = 95/100=0.95, Spec= 1980/2000 = 0.99, Disease prevalence=P = 100/2100 = 0.0476 PPV = (0.95 x 0.0476) / [ 0.95 x 0.0476 + 0.01 x 0.9524 ] = 0.826 PPV = 95/115=0.826 NPV = (0.99 x 0.9524) / [0.99 x 0.9524 + 0.05 x 0.0476] = 0.9974 NPV = 1980/1985 = 0.9974