1 / 73

Introduction to Biostatistics Descriptive Statistics and Sample Size Justification

Introduction to Biostatistics Descriptive Statistics and Sample Size Justification. Julie A. Stoner, PhD October 18, 2004. Statistics Seminars. Goal: Interpret and critically evaluate biomedical literature Topics: Sample size justification Exploratory data analysis Hypothesis testing.

thea
Télécharger la présentation

Introduction to Biostatistics Descriptive Statistics and Sample Size Justification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to BiostatisticsDescriptive Statistics and Sample Size Justification Julie A. Stoner, PhD October 18, 2004

  2. Statistics Seminars • Goal: Interpret and critically evaluate biomedical literature • Topics: • Sample size justification • Exploratory data analysis • Hypothesis testing

  3. Example #1 • Aim: Compare two antihypertensive strategies for lowering blood pressure • Double-blind, randomized study • 5 mg Enalapril + 5 mg Felodipine ER to 10 mg Enalapril • 6-week treatment period • 217 patients • AJH, 1999;12:691-696

  4. Example #2 • Aim: Demonstrate that D-penicillamine (DPA) is effective in prolonging the overall survival of patients with primary biliary cirrhosis of the liver (PBC) • Mayo Clinic • Double-blind, placebo controlled, randomized trial • 312 patients • Collect clinical and biochemical data on patients • Reference: NEJM. 312:1011-1015.1985.

  5. Example #2 • Patients enrolled over 10 years, between January 1974 and May 1984 • Data were analyzed in July 1986 • Event: death (x) • Censoring: some patients are still alive at end of study (o) 1/1974 5/1984 6/1986 _____________________________X ___________________________o ________________________o

  6. Statistical Inference • Goal: describe factors associated with particular outcomes in the population at large • Not feasible to study entire population • Samples of subjects drawn from population • Make inferences about population based on sample subset

  7. Why are descriptive statistics important? • Identify signals/patterns from noise • Understand relationships among variables • Formal hypothesis testing should agree with descriptive results

  8. Outline • Types of data • Categorical data • Numerical data • Descriptive statistics • Measures of location • Measures of spread • Descriptive plots

  9. Types of Data • Categorical data: provides qualitative description • Dichotomous or binary data • Observations fall into 1 of 2 categories • Example: male/female, smoker/non-smoker • More than 2 categories • Nominal: no obvious ordering of the categories • Example: blood types A/B/AB/O • Ordinal: there is a natural ordering • Example: never-smoker/ex-smoker/light smoker/heavy smoker

  10. Types of Data • Numerical data (interval/ratio data) • Provides quantitative description • Discrete data • Observations can only take certain numeric values • Often counts of events • Example: number of doctor visits in a year • Continuous data • Not restricted to take on certain values • Often measurements • Example: height, weight, age

  11. Descriptive Statistics: Numerical Data • Measures of location • Mean: average value For n data points, x1, x2,, …, xn the mean is the sum of the observations divided by the number of observations

  12. Descriptive Statistics: Numerical Data • Measures of location • Mean: • Example: Find the mean triglyceride level (in mg/100 ml) of the following patients 159, 121, 130, 164, 148, 148, 152 Sum = 1022, Count = 7, Mean = 1022/7 = 146

  13. Descriptive Statistics: Numerical Data • Measures of location • Percentile: value that is greater than a particular percentage of the data values • Order data • Pth percentile has rank r = (n+1)*(P/100) • Median: the 50th percentile, 50% of the data values lie below the median

  14. Descriptive Statistics: Numerical Data • Measures of location • Median • Example: Find the median triglyceride level from the sample 159, 121, 130, 164, 148, 148, 152 Order: 121, 130, 148, 148, 152, 159, 164 Median: rank = (7+1) * (50/100) = 4 4TH ordered observation is 148

  15. Descriptive Statistics: Numerical Data • Measures of location • Mode: most common element of a set • Example: Find the mode of the triglyceride values 159, 121, 130, 164, 148, 148, 152 Mode = 148

  16. Descriptive Statistics: Numerical Data • Measures of location: comparison of mean and median • Example: Compare the mean and median from the sample of triglyceride levels 159, 141, 130, 230, 148, 148, 152 Mean = 1108/7=158.29, Median = 148 • The mean may be influenced by extreme data points.

  17. Skewed Distributions • Data that is not symmetric and bell-shaped is skewed. • Mean may not be a good measure of central tendency. Why? Positive skew, or skewed to the right, mean > median Negative skew, or skewed to the left, mean < median

  18. Motivation • Example: 1) 2 60 100  =54 2) 53 54 55  =54 • Both data sets have a mean of 54 but scores in set 1 have a larger range and variation than the scores in set 2.

  19. Descriptive Statistics: Numerical Data • Measures of spread • Variance: average squared deviation from the mean For n data points, x1, x2,, …, xn the variance is • Standard deviation: square root of variance, in same units as original data

  20. Descriptive Statistics: Numerical Data • Measures of spread • Standard Deviation: • Example: find the standard deviation of the triglyceride values 159, 121, 130, 164, 148, 148, 152 Distance from mean: 13, -25, -16, 18, 2, 2, 6 Sum of squared differences: 1418 Standard deviation: sqrt(1418/6)=15.37

  21. Descriptive Statistics: Numerical Data • Standard deviation: How much variability can we expect among individual responses? • Standard error of the mean: How much variability can we expect in the mean response among various samples?

  22. Descriptive Statistics: Numerical Data • The standard error of the mean is estimated as where s.d. is the estimated standard deviation • Based on the formula, will the standard error of the mean will always be smaller or larger than the standard deviation of the data? • Answer: smaller

  23. Descriptive Statistics: Numerical Data • Measures of spread • Minimum, maximum • Range: maximum-minimum • Interquartile range: difference between 25th and 75th percentile, values that encompass middle 50% of data

  24. Descriptive Statistics: Numerical Data • Measures of spread • Example: find the range and the interquartile range for the triglyceride values 159, 121, 130, 164, 148, 148, 152 Range: 164 - 121 = 43 Interquartile Range: Order: 121, 130, 148, 148, 152, 159, 164 IQR: 159 - 130 = 29

  25. Descriptive Statistics: Numerical Data • Helpful to describe both location and spread of data • Location: mean Spread: standard deviation • Location: median Spread: min, max, range interquartile range quartiles

  26. Descriptive Statistics: Categorical Data • Measures of distribution • Proportion: Number of subjects with characteristics Total number subjects • Percentage: Proportion * 100%

  27. Descriptive Statistics: Categorical Data • Measures of distribution: example • What percentage of vaccinated individuals developed the flu? 198/400 = 0.495 49.5%

  28. Example • Consider the table of descriptive statistics for characteristics at baseline • What do we conclude about comparability of the groups at baseline in terms of gender and age?

  29. Descriptive Plots: • Single variable • Bar plot • Histogram • Box-plot • Multiple variables • Box-plot • Scatter plot • Kaplan-Meier survival plots

  30. Barplot • Goal: Describe the distribution of values for a categorical variable • Method: • Determine categories of response • For each category, draw a bar with height equal to the number or proportion of responses

  31. Barplot

  32. Histogram • Goal: Describe the distribution of values for a continuous variable • Method: • Determine intervals of response (bins) • For each interval, draw a bar with height equal to the number or proportion of responses

  33. Histogram

  34. Box-plot • Goal: Describe the distribution of values for a continuous variable • Method: • Determine 25th, 50th, and 75th percentiles of distribution • Determine outlying and extreme values • Draw a box with lower line at the 25th percentile, middle line at the median, and upper line at the 75th percentile • Draw whiskers to represent outlying and extreme values

  35. Boxplot 75th percentile Median 25thpercentile

  36. Box-plot

  37. Scatter Plot • Goal: Describe joint distribution of values from 2 continuous variables • Method: • Create a 2-dimensional grid (horizontal and vertical axis) • For each subject in the dataset, plot the pair of observations from the 2 variables on the grid

  38. Scatter Plot

  39. Scatter Plot

  40. Kaplan-Meier Survival Curves • Goal: Summarize the distribution of times to an event • Method: • Estimate survival probabilities while accounting for censoring • Plot the survival probability corresponding to each time an event occurred

  41. Kaplan-Meier Survival Curves

  42. Kaplan-Meier Survival Curves

  43. Kaplan-Meier Survival Curves

  44. Descriptive Plots Guidelines • Clearly label axes • Indicate unit of measurement • Note the scale when interpreting graphs

  45. Descriptive Statistics Exercises

  46. Example • Below are some descriptive plots and statistics from a study designed to investigate the effect of smoking on the pulmonary function of children • Tager et al. (1979) American Journal of Epidemiology. 110:15-26

  47. Example • The primary question, for this exercise, is whether or not smoking is associated with decreased pulmonary function in children, where pulmonary function is measured by forced expiratory volume (FEV) in liters per second. • The data consist of observations on 654 children aged 3 to 19.

  48. Proportion Male: • (336/654)100% = 51.4% • Proportion Smokers: • (65/654)100% = 9.9% • Proportion of Smokers who are Male: • (26/65)100% = 40%

  49. Compare the FEV1 distribution between smokers and non-smokers • Answer • The smokers appear to have higher FEV values and therefore better lung function. Specifically, the median FEV for smokers is 3.2 liters/sec. (IQR 3.75-3=0.75) compared to a median FEV of 2.5 liters/sec. (IQR 3-2=1) for non-smokers.

  50. Compare the age distribution between smokers and non-smokers. • Answer: • The smokers are older than the non- smokers in general. Specifically, the median age for the smokers is 13 years (IQR 15-12=3) compared to 9 years (IQR 11-8=3) for the non-smokers.

More Related