Introductory Statistics John Matthews, Professor of Medical Statistics, School of Mathematics and Statistics Janine Gray, Senior Lecturer and Deputy Director, Newcastle Clinical Trials Unit University of Newcastle-upon-Tyne
Course Outline • Data Description • Mean, Median, Standard Deviation • Graphs • The Normal Distribution • Populations and Samples • Confidence intervals and p-values • Estimation and Hypothesis testing • Continuous data • Categorical data • Regression and Correlation
Course Objectives • To have an understanding of the Normal distribution and its relationship to common statistical analyses • To have an understanding of basic statistical concepts such as confidence intervals and p-values • To know which analysis is appropriate for different types of data
Recommended Textbooks • Swinscow TDV and Campbell MJ. Statistics at Square One (10th edn). BMJ Books • Altman DG. Practical Statistics for Medical Research. Chapman and Hall • Bland M. An Introduction to Medical Statistics. Oxford Medical Publications • Campbell MJ & Machin D. Medical Statistics A Commonsense Approach. Wiley
Other reading • Chinn S. Statistics for the European Respiratory Journal. Eur Respir J 2001; 18:393-401 • www.mas.ncl.ac.uk/~njnsm/medfac/MDPhD/notes.htm • BMJ statistics notes
Types of Data • Numerical Data • discrete • number of lesions • number of visits to GP • continuous • height • lesion area
Types of Data • Categorical • unordered • Pregnant/Not pregnant • married/single/divorced/separated/widowed • ordered (ordinal) • minimal/moderate/severe/unbearable • Stage of breast cancer: I II III IV
Exercise • What type are the following variables?a) sexb) diastolic blood pressurec) diagnosisd) heighte) family sizef) cancer stage
Types of Data • Outcome/Dependent variable • outcome of interest • e.g. survival, recovery • Explanatory/Independent variable • treatment group • age • sex
Summary Statistics • Location • Mean (average value) • Median (middle value) • Mode (most frequently occurring value) • Variability • Variance/SD • Range • Centiles
Birthweights (g) at 40 weeks Gestation • mean = 3441g • median = 3428g • sd = 434g • min = 2050g • max = 4975g • range = 2925g
Symmetric Data • mean = median (approx) • standard deviation
Skew Data • median = "typical" value • mean affected by extreme values - larger than median • SD fairly meaningless • centiles (less affected by extreme values/outliers)
Half of all doctors are below average…. • Even if all surgeons are equally good, about half will have below average results, one will have the worst results, and the worst results will be a long way below average • Ref. BMJ 1998; 316:1734-1736
Summarising data - Summary • Choosing the appropriate summary statistics and graph depends upon the type of variable you have • Categorical (unordered/ordered) • Continuous (symmetric/skew)
The Normal Distribution • N(2 • unknown population mean - estimate using sample mean • unknown population SD - estimate using sample SD • Birthweight is N(3441, 4342)
N(0,1) - Standard Normal Distribution 68% within ± 1 SD Units 95% within ± 1.96 99% within ± 2.58 z - SD units
Birthweight (g) at 40 weeks 95% within 1.96 SDs 2590 - 4292 grams 99% within 2.58 SDs 2321 - 4561 grams
Further Reading • http://www.mas.ncl.ac.uk/~njnsm/medfac/docs/intro.pdf • Altman DG, Bland JM (1996) Presentation of numerical data. BMJ 312, 572 • Altman DG, Bland JM. (1995) The normal distribution. BMJ 310, 298.
Samples and Populations • Use samples to estimate population quantities (parameters) such as disease prevalence, mean cholesterol level etc • Samples are not interesting in their own right - only to infer information about the population from which they are drawn • Sampling Variation • Populations are unique - samples are not.
Sample and Populations • How much might these estimates vary from sample to sample? • Determine precision of estimates (how close/far away from the population?)
(Artifical) example • Have 5000 measurements of diastolic blood pressure from airline pilots. This accounts for ALL airline pilots and is the population of airline pilots. • (Artificial example - if we had the whole population we wouldn’t need to sample!!) • Since we have the population, we know the true population characteristics. It is these we are trying to estimate from a sample.
Population distribution of diastolic BP from Airline Pilots (in mmHg) True mean = 78.2 True SD = 9.4
Example • Write each measurement on a piece of paper and put into a hat. • Draw 5 pieces of paper and calculate the mean of the BP. • replace and repeat 49 more times • End up with 50 (different) estimates of mean BP
Sampling Distribution • Each estimate of the mean will be different. • Treat this as a random sample of means • Plot a histogram of the means. • This is an estimate of the sampling distribution of the mean. • Can get the sampling distribution of any parameter in a similar way.
Distribution of the mean = 78.2, = 9.4 Population 50 samples N=5 50 samples N=10 50 samples N=100
Distribution of the Mean • BUT! Don’t need to take multiple samples • Standard error of the mean = • SE of the mean is the SD of the distribution of the sample mean
Distribution of Sample Mean • Distribution of sample mean is Normal regardless of distribution of sample(unless small or very skew sample) • SOCan apply Normal theory to sample mean also
Distribution of Sample Mean • i.e. 95% of sample means lie within 1.96 SEs of (unknown) true mean • This is the basis for a 95% confidence interval (CI) • 95% CI is an interval which on 95% of occasions includes the population mean
Example • 57 measurements of FEV1 in male medical students
Example • 95% of population lie withini.e. within 4.06 ±1.960.67, from 2.75 to 5.38 litres
Example • Thus for FEV1 data, 95% chance that the interval contains the true population meani.e. between 3.89 and 4.23 litres • This is the 95% confidence interval for the mean
Confidence Intervals • The confidence interval (CI) measures uncertainty. The 95% confidence interval is the range of values within which we can be 95% sure that the true value lies for the whole of the population of patients from whom the study patients were selected. The CI narrows as the number of patients on which it is based increases.
Standard Deviations & Standard Errors • The SE is the SD of the sampling distribution (of the mean, say) • SE = SD/√N • Use SE to describe the precision of estimates (for example Confidence intervals) • Use SD to describe the variability of samples, populations or distributions (for example reference ranges)
The t-distribution • When N is small, estimate of SD is particularly unreliable and the distribution of sample mean is not Normal • Distribution is more variable - longer tails • Shape of distribution depends upon sample size • This distribution is called the t-distribution
N=2 t(1) 95% within ± 12.7 N(0,1) t(1)
N=10 t(9) 95% within ± 2.26 N(0,1) t(9)
N=30 t(29) 95% within ± 2.04
t-distribution • As N becomes larger, t-distribution becomes more similar to Normal distribution • Degrees of Freedom (DF)- sample size - 1 • DF measure of amount of information contained in data set
Implications • Confidence interval for the mean • Sample size < 30 Use t-distribution • Sample size > 30 Use either Normal or t distribution • Note: Stats packages (generally) will automatically use the correct distribution for confidence intervals
Example • Numbers of hours of relief obtained by 7 arthritic patients after receiving a new drug: 2.2, 2.4, 4.9, 3.3, 2.5, 3.7, 4.3 • Mean = 3.33, SD = 1.03, DF = 6, t(5%) = 2.45 • 95% CI = 3.33 ± 2.451.03/ 72.38 to 4.28 hours • Normal 95% CI = 3.33 ± 1.961.03/ 72.57 to 4.09 hours TOO NARROW!!
Hypothesis Testing • Enables us to measure the strength of evidence supplied by the data concerning a proposition of interest • In a trial comparing two treatments there will ALWAYS be a difference between the estimates for each treatment - a real difference or random variation?
Null Hypothesis • Study hypothesis - hypothesis in the mind of the investigator (patients with diabetes have raised blood pressure) • Null hypothesis is the converse of the study hypothesis - aim to disprove it (patients with diabetes do not have raised blood pressure) • Hypothesis of no effect/difference
Two-Sample t-test • Two independent samples • Can the two samples be considered to be the same with respect to the variable you are measuring or are they different? • Sample means will ALWAYS be different - real difference or random variation? • ASSUMPTION: Data are normally distributed and SD in each group similar
Two-Sample t-test • 24 hour total energy expenditure (MJ/day) in groups of lean and obese women • Do the women differ in their energy expenditure? • Null hypothesis: energy expenditure in lean and obese women is the same