Statistics for Non-Statisticians Kay M. Larholt, Sc.D. Vice President, Biometrics & Clinical Operations Abt Bio-Pharma Solutions
Topics • Basic Statistical Concepts 2) Study Design 3) Blinding and Randomization 4) Hypothesis testing 5) Power and Sample Size
Statistics Per the American Heritage dictionary - “The mathematics of the collection, organization, and interpretation of numerical data, especially the analysis of population characteristics by inference from sampling.” • Two broad areas • Descriptive – Science of summarizing data • Inferential – Science of interpreting data in order to make estimates, hypothesis testing, predictions, or decisions from the sample to target population.
Introduction to Clinical Statistics • Statistics - The science of making decisions in the face of uncertainty • Probability - The mathematics of uncertainty • The probability of an event is a measure of how likely the event is to happen
Clinical Statistics • Biostatisticians are statisticians who apply statistics to the biological sciences. • Clinical statistics are statistics that are applied to clinical trials
Basic Statistical Concepts • Types of data • Descriptive statistics • Graphs • Basic probability concepts • Type of probability distributions in clinical statistics • Sample vs. population
Patient Categories • 1 Between 21 and 40 • 2 Between 41 and 60 • 3 Between 21 and 40 • Between 21 and 40 • 5 Between 61 and 80 • Patient Size (mm) • 1 24 • 2 45 • 3 26 • 23 • 5 67 • We can categorize into: • 0-20 mm • 21-40 mm etc. later Continuous Data Data should be collected in its “rawest” form. We can always categorize data later. (We can never “uncategorize” data.) • e.g. If you measure prostate size as part of the clinical trial then capture the size in mm on the CRF.
Basic Data Summarization Techniques • The objective of data summarization is to describe the characteristics of a data set. Ultimately, we want to make the data set more comprehensible and meaningful. • To put data in a concise form, use • Summary descriptive statistics • Graphs • Tables
Measures of central tendency Mean, Median, Mode Measures of dispersion Range, Variance, Standard deviation Measures of relative standing Lower quartile (Q1) Upper quartile (Q3) Interquartile range (IQR) range (IQR) Descriptive Statistics for Continuous Variables
Arithmetic average: sum of all observations divided by # of observations. Example: The average age of a group of 10 people is 24.2 years Who are they? Mean
Answer: They could be ten “twenty-somethings” who go out to dinner together: Pete aged 24, Jane aged 26, Louise aged 21, Bob aged 22, Julie aged 23, Sue aged 22, Jenn aged 27, John aged 28, Jeff aged 20 and Mark aged 29. The mean age for these 10 people is: (24+26+21+22+23+22+27+28+20+29)/10 = 24.2 years Mean
Or alternatively: They could be Mr. & Mrs. Smith and their 8 grandchildren: Susie aged 3, Abby aged 5, Max aged 8, Laura aged 10, Joshua aged 10, Emma aged 12, Jane aged 13, Sarah aged 18, Mrs. Smith aged 80, Mr. Smith aged 83. The mean age for these 10 people is: (3+5+8+10+10+12+13+18+80+83)/10= = 24.2 years Mean
Mean • Presenting the average alone does not give you much information about the data you are looking at.
The midpoint of the values after they have been ordered from the smallest to the largest, or the largest to the smallest. There are as many values above the median as below it in the data array. Median
Median • Example • The age of the people in our data set is: • 24, 26, 21, 23, 22, 27, 28, 20, 29 ( I took out one of the 22 year olds to make this example easier) • Arranging the data in ascending order gives: • 20, 21, 22, 23, 24, 26, 27, 28, 29 • The median is 24
There are three kinds of lies:lies, damned lies, and statistics. This well-known saying is part of a phrase attributed to Benjamin Disraeli and popularized in the U.S. by Mark Twain
Median Home Price Connecticut: Darien • Median home price: $1,295,000 • Location: about 40 miles northeast of midtown Manhattan • Population: 20,209, households 6,592
Properties of Mean and Median • There are unique means and medians for each variable in the data set. • Median is not affected by extremely large or small values and is therefore a valuable measure of central tendency when such values occur. • Mean is a poor measure of central tendency in skewed distributions.
The value of the observation that appears most frequently. Example The exam scores for ten students are: 81, 93, 84, 75, 68, 87, 81, 75, 81, 87. Since the score of 81 occurs the most, the modal score is 81. 3-14 Mode
Averages and What Else? • As we have seen, just knowing the mean or even the median of a data set does not tell us enough about the data. We need more information to really describe the data.
Measures of Dispersion • Once we know something about the centre of the data we need to understand how the data are dispersed around this centre. • How variable are the data?
Maximum value in the data set minus Minimum value in the data set The age of the patients in our data set is: 21, 25, 19, 20, 22 Range = 25 – 19 = 6 2. The age of the patients in our data set is: 21, 45, 19, 20, 22. Range = 45 – 19 = 26 When max and min are unusual values, range may be a misleading measure of dispersion. The range only uses the 2 extreme values in the data. Range
Variance and Standard Deviation • The variance of a data set measures how far each data point is from the mean of the data set. • It provides a measure of how spread out the data points are • The Standard Deviation is the square root of the variance
Variance and Standard Deviation • Variance: • Measure of dispersion, the square of the deviations of the data from the mean • Standard deviation: • positive square root of the variance • Small std dev: • observations are clustered tightly around the mean • Large std dev: • observations are scattered widely about the mean
Standard Deviation Take each observation and subtract it from the mean of the observations Square the answer Sum up all the results Divide by n-1 Take the square root
19 20 21 22 25 19 20 21 22 45 Example – Standard Deviation • The age of the patients in our data set is: • 21, 25, 19, 20, 22 • Mean = 21.4, Median = 21,StdDev = 2.302 • 2. The age of the patients in our data set is: • 21, 45, 19, 20, 22. • Mean = 25.4, Median = 21,StdDev = 11.014
Choosing an Appropriate Method of Central Tendency • The mean is ordinarily the preferred measure of central tendency. The mean should always be presented along with the variance or the standard deviation • There are situations when a median might be more appropriate: • - a skewed distribution • - a small number of subjects
Measures of Relative Standing • Descriptive measures that locate the relative position of an observation in relation to the other observations.
Measures of Relative Standing • The pth percentile is a number such that p% of the observations of the data set fall below and (100-p)% of the observations fall above it. • Lower quartile = 25th percentile (Q1) • Mid-quartile = 50th percentile (median or Q2) • Upper quartile = 75th percentile (Q3) • Interquartile range (IQR = Q3-Q1)
19 20 21 22 25 19 20 21 22 45 Measures of Relative Standing… an Example The age of the patients in our data set is: 21, 25, 19, 20, 22 Q1 = 20, Q2 = 21, Q3 = 22, IQR = 2 The age of the patients in our data set is: 21, 45, 19, 20, 22 Q1 = 20, Q2 = 21, Q3 = 22, IQR = 2
Definitions • Statistics - The science of making decisions in the face of uncertainty • Probability - The mathematics of uncertainty • The probability of an event is a measure of how likely the event is to happen
Basic Probability Concepts Sample spaces and events Simple probability Joint probability
Sample Spaces • Collection of all possible outcomes Example: All six faces of a die Example: All 52 cards in a deck
Sample Space Gumballs in a gumball machine 60 red 50 green 40 yellow 30 white 25 pink 20 blue 16 purple Total: 241 gumballs
Simple event Outcome from a sample space with one characteristic Examples: A red card from a deck of cards A purple gumball from the gumball machine Joint event Involves two outcomes simultaneously Example: An ace that is also red from a deck of cards Events
Events Mutually exclusive events • Two events cannot occur together Example: Drawing one card from a deck A: Drawing a queen of diamonds B: Drawing a queen of clubs As only one of these can happen Events A and B are mutually exclusive
Probability 1 Certain • Probability is the numerical measure of the likelihood that an event will occur • Value is between 0 and 1 .5 0 Impossible
Number of event outcomes P( E ) = Total number of possible outcomes in the sample space Computing Probabilities The probability of an event E: Assumes each of the outcomes in the sample space is equally likely to occur
Example: What is the probability of rolling a 4 when you roll a die? # of possible outcomes in the sample space = 6 # of 4s in the sample space = 1 Prob (rolling a 4 when you roll a die) = 1/6 Computing Probabilities
Example: What is the probability of rolling a six and a four when you roll 2 dice? # of possible outcomes in the sample space = 36 # of ways to roll one 6 and one 4 = 2 P( ) = 2/36 = .0555 Computing Probabilities
Computing Joint Probability The probability of a joint event, A and B:
Computing Joint Probability P (Red Card and an Ace) = 2 Red Aces Total # Cards = 2/52 = 1/26
Type of Probability Distributions in Clinical Statistics Bernoulli Binomial Normal
Bernoulli Distribution The bernoulli distribution is the “coin flip” distribution. X is bernoulli if its probability function is: • Examples:X=1 for heads in coin toss • X=1 for male in survey • X=1 for defective in a test of product
The binomial distribution is just n independent bernoullis added up. It is the number of “successes” in n trials. Probability of success is usually denoted by p, and therefore probability of failure is 1-p. Example:Number of heads when we flip a coin 10 times. Here n = 10, p=0.5 (the probability of getting a head when we toss the coin once). Binomial Distribution
Binomial Distribution • The binomial probability function Example:X =Number of heads when we flip a coin 10 times. Here X ~ Binomial (n = 10, p=0.5) n! = n factorial = n.n-1.n-2…..1 10!=10.9.8.7.220.127.116.11.2.1=3,628,800