Measures of Central Tendency and Dispersion: Analyzing Populations versus Analyzing Samples
E N D
Presentation Transcript
Chapter 3 Numerically Summarizing Data
3.1 3.2 3.3 3.4 3.5 Chapter 3 • Chapter 3 – Numerically Summarizing Data • Measures of Central Tendency • Measures of Dispersion • Measures of Central Tendency and Dispersion from Grouped Data • Measures of Position • The Five Number Summary and Boxplots
Chapter 3Section 1 Measures of Central Tendency
Chapter 3 – Section 1 • Analyzing populations versus analyzing samples • Analyzing populations versus analyzing samples • For populations • We know all of the data • Descriptive measures of populations are called parameters • Parameters are often written using Greek letters ( μ ) • Analyzing populations versus analyzing samples • For populations • We know all of the data • Descriptive measures of populations are called parameters • Parameters are often written using Greek letters ( μ ) • For samples • We know only part of the entire data • Descriptive measures of samples are called statistics • Statistics are often written using Roman letters ( )
Chapter 3 – Section 1 • The arithmeticmean of a variable is often what people mean by the “average” … add up all the values and divide by how many there are • The arithmeticmean of a variable is often what people mean by the “average” … add up all the values and divide by how many there are • Compute the arithmetic mean of 6, 1, 5 • The arithmeticmean of a variable is often what people mean by the “average” … add up all the values and divide by how many there are • Compute the arithmetic mean of 6, 1, 5 • Add up the three numbers and divide by 3 (6 + 1 + 5) / 3 = 4.0 • The arithmetic mean is 4.0
Chapter 3 – Section 1 • The arithmetic mean is usually called the mean • The arithmetic mean is usually called the mean • For a population … the population mean • Is computed using all the observations in a population • Is denoted μ • Is a parameter • The arithmetic mean is usually called the mean • For a population … the population mean • Is computed using all the observations in a population • Is denoted μ • Is a parameter • For a sample … the sample mean • Is computed using only the observations in a sample • Is denoted • Is a statistic
Chapter 3 – Section 1 • The median of a variable is the “center” • When the data is sorted in order, the median is the middle value • The median of a variable is the “center” • When the data is sorted in order, the median is the middle value • The calculation of the median of a variable is slightly different depending on • If there are an odd number of points, or • If there are an even number of points
Chapter 3 – Section 1 • To calculate the median (M) of a data set • Arrange the data in order • Count the number of observations, n • To calculate the median (M) of a data set • Arrange the data in order • Count the number of observations, n • If n is odd • There is a value that’s exactly in the middle • That value is the median M • To calculate the median (M) of a data set • Arrange the data in order • Count the number of observations, n • If n is odd • There is a value that’s exactly in the middle • That value is the median M • If n is even • There are two values on either side of the exact middle • Take their mean to be the median M
Chapter 3 – Section 1 • An example with an odd number of observations (5 observations) • Compute the median of 6, 1, 11, 2, 11 • An example with an odd number of observations (5 observations) • Compute the median of 6, 1, 11, 2, 11 • Sort them in order 1, 2, 6, 11, 11 • An example with an odd number of observations (5 observations) • Compute the median of 6, 1, 11, 2, 11 • Sort them in order 1, 2, 6, 11, 11 • The middle number is 6, so the median is 6
Chapter 3 – Section 1 • An example with an even number of observations (4 observations) • Compute the median of 6, 1, 11, 2 • An example with an even number of observations (4 observations) • Compute the median of 6, 1, 11, 2 • Sort them in order 1, 2, 6, 11 • An example with an even number of observations (4 observations) • Compute the median of 6, 1, 11, 2 • Sort them in order 1, 2, 6, 11 • Take the mean of the two middle values (2 + 6) / 2 = 4 • The median is 4
M = 79.5 62, 68, 71, 74, 77 5 on the left 82, 84, 88, 90, 94 5 on the right Chapter 3 – Section 1 • One interpretation • The median splits the data into halves 62, 68, 71, 74, 77, 82, 84, 88, 90, 94
Chapter 3 – Section 1 • The mode of a variable is the most frequently occurring value • The mode of a variable is the most frequently occurring value • Find the mode of 6, 1, 2, 6, 11, 7, 3 • The mode of a variable is the most frequently occurring value • Find the mode of 6, 1, 2, 6, 11, 7, 3 • The values are 1, 2, 3, 6, 7, 11 • The mode of a variable is the most frequently occurring value • Find the mode of 6, 1, 2, 6, 11, 7, 3 • The values are 1, 2, 3, 6, 7, 11 • The value 6 occurs twice, all the other values occur only once • The mode is 6
Chapter 3 – Section 1 • Qualitative data • Values are one of a set of categories • Cannot add or order them … the mean and median do not exist • The mode is the only one of these three measurements that exists • Qualitative data • Values are one of a set of categories • Cannot add or order them … the mean and median do not exist • The mode is the only one of these three measurements that exists • Qualitative data • Values are one of a set of categories • Cannot add or order them … the mean and median do not exist • The mode is the only one of these three measurements that exists • Find the mode of blue, blue, blue, red, green • The mode is “blue” because it is the value that occurs the most often
Chapter 3 – Section 1 • Quantitative data • The mode can be computed but sometimes it is not meaningful • Sometimes each value will only occur once (which can often happen with precise measurements) • Quantitative data • The mode can be computed but sometimes it is not meaningful • Sometimes each value will only occur once (which can often happen with precise measurements) • Quantitative data • The mode can be computed but sometimes it is not meaningful • Sometimes each value will only occur once (which can often happen with precise measurements) • Find the mode of 5.1, 6.6, 6.8, 9.3, 1.9 • Quantitative data • The mode can be computed but sometimes it is not meaningful • Sometimes each value will only occur once (which can often happen with precise measurements) • Find the mode of 5.1, 6.6, 6.8, 9.3, 1.9 • Each value occurs only once • The mode is not a meaningful measurement
Chapter 3 – Section 1 • One interpretation • In primary elections, the candidate who receives the most votes is often called “the winner” • One interpretation • In primary elections, the candidate who receives the most votes is often called “the winner” • Votes (data values) are • One interpretation • In primary elections, the candidate who receives the most votes is often called “the winner” • Votes (data values) are • The mode is “Kayla” … Kayla is the winner
Chapter 3 – Section 1 • The mean and the median are often different • This difference gives us clues about the shape of the distribution • Is it symmetric? • Is it skewed left? • Is it skewed right? • Are there any extreme values?
Chapter 3 – Section 1 • Symmetric – the mean will usually be close to the median • Skewed left – the mean will usually be smaller than the median • Skewed right – the mean will usually be larger than the median
Chapter 3 – Section 1 • If a distribution is symmetric, the data values above and below the mean will balance • The mean will be in the “middle” • The median will be in the “middle” • If a distribution is symmetric, the data values above and below the mean will balance • The mean will be in the “middle” • The median will be in the “middle” • Thus the mean will be close to the median, in general, for a distribution that is symmetric
Chapter 3 – Section 1 • If a distribution is skewed left, there will be some data values that are larger than the others • The mean will decrease • The median will not decrease as much • If a distribution is skewed left, there will be some data values that are larger than the others • The mean will decrease • The median will not decrease as much • Thus the mean will be smaller than the median, in general, for a distribution that is skewed left
Chapter 3 – Section 1 • If a distribution is skewed right, there will be some data values that are larger than the others • The mean will increase • The median will not increase as much • If a distribution is skewed right, there will be some data values that are larger than the others • The mean will increase • The median will not increase as much • Thus the mean will be larger than the median, in general, for a distribution that is skewed right
Chapter 3 – Section 1 • For a mostly symmetric distribution, the mean and the median will be roughly equal • Many variables, such as birth weights below, are approximately symmetric
Chapter 3 – Section 1 • What if one value is extremely different from the others? • What if we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 • What if one value is extremely different from the others ( this is so called an outlier)? • What if we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 • The mean is now ( 6000 + 1 + 2 ) / 3 = 2001 • The median is still 2 • The median is “resistant to extreme values”
Summary: Chapter 3 – Section 1 • Mean • The center of gravity • Useful for roughly symmetric quantitative data • Median • Splits the data into halves • Useful for highly skewed quantitative data • Mode • The most frequent value • Useful for qualitative data
Chapter 3Section 2 Measures of Dispersion
1 2 3 5 4 Chapter 3 – Section 2 • Learning objectives • The range of a variable • The variance of a variable • The standard deviation of a variable • Use the Empirical Rule • Use Chebyshev’s inequality
Chapter 3 – Section 2 • Comparing two sets of data • Comparing two sets of data • The measures of central tendency (mean, median, mode) measure the differences between the “average” or “typical” values between two sets of data • Comparing two sets of data • The measures of central tendency (mean, median, mode) measure the differences between the “average” or “typical” values between two sets of data • The measures of dispersion in this section measure the differences between how far “spread out” the data values are
Chapter 3 – Section 2 • The range of a variable is the largest data value minus the smallest data value • The range of a variable is the largest data value minus the smallest data value • Compute the range of 6, 1, 2, 6, 11, 7, 3, 3 • The range of a variable is the largest data value minus the smallest data value • Compute the range of 6, 1, 2, 6, 11, 7, 3, 3 • The largest value is 11 • The smallest value is 1 • The range of a variable is the largest data value minus the smallest data value • Compute the range of 6, 1, 2, 6, 11, 7, 3, 3 • The largest value is 11 • The smallest value is 1 • Subtracting the two … 11 – 1 = 10 … the range is 10
Chapter 3 – Section 2 • The range only uses two values in the data set – the largest value and the smallest value • The range is not resistant • The range only uses two values in the data set – the largest value and the smallest value • The range is not resistant • If we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 • The range only uses two values in the data set – the largest value and the smallest value • The range is not resistant • If we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 • The range is now ( 6000 – 1 ) = 5999
Chapter 3 – Section 2 • The variance is based on the deviation from the mean • ( xi – μ ) for populations • ( xi – ) for samples • The variance is based on the deviation from the mean • ( xi – μ ) for populations • ( xi – ) for samples • To treat positive differences and negative differences, we square the deviations • ( xi – μ )2 for populations • ( xi – )2 for samples
Chapter 3 – Section 2 • The populationvariance of a variable is the sum of these squared deviations divided by the number in the population • The populationvariance of a variable is the sum of these squared deviations divided by the number in the population • The populationvariance of a variable is the sum of these squared deviations divided by the number in the population • The population variance is represented by σ2 • Note: For accuracy, use as many decimal places as allowed by your calculator
Chapter 3 – Section 2 • Compute the population variance of 6, 1, 2, 11 • Compute the population variance of 6, 1, 2, 11 • Compute the population mean first μ = (6 + 1 + 2 + 11) / 4 = 5 • Compute the population variance of 6, 1, 2, 11 • Compute the population mean first μ = (6 + 1 + 2 + 11) / 4 = 5 • Now compute the squared deviations (1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36 • Compute the population variance of 6, 1, 2, 11 • Compute the population mean first μ = (6 + 1 + 2 + 11) / 4 = 5 • Now compute the squared deviations (1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36 • Average the squared deviations (16 + 9 + 1 + 36) / 4 = 15.5 • The population variance σ2 is 15.5
Chapter 3 – Section 2 • The samplevariance of a variable is the sum of these squared deviations divided by one less than the number in the sample • The samplevariance of a variable is the sum of these squared deviations divided by one less than the number in the sample • The sample variance is represented by s2 • We say that this statistic has n – 1 degrees of freedom
Chapter 3 – Section 2 • Compute the sample variance of 6, 1, 2, 11 • Compute the sample variance of 6, 1, 2, 11 • Compute the sample mean first = (6 + 1 + 2 + 11) / 4 = 5 • Compute the sample variance of 6, 1, 2, 11 • Compute the sample mean first = (6 + 1 + 2 + 11) / 4 = 5 • Now compute the squared deviations (1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36 • Compute the sample variance of 6, 1, 2, 11 • Compute the sample mean first = (6 + 1 + 2 + 11) / 4 = 5 • Now compute the squared deviations (1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36 • Average the squared deviations (16 + 9 + 1 + 36) / 3 = 20.7 • The sample variance s2 is 20.7
Chapter 3 – Section 2 • Why are the population variance (15.5) and the sample variance (20.7) different for the same set of numbers? • Why are the population variance (15.5) and the sample variance (20.7) different for the same set of numbers? • In the first case, { 6, 1, 2, 11 } was the entire population (divide by N) • Why are the population variance (15.5) and the sample variance (20.7) different for the same set of numbers? • In the first case, { 6, 1, 2, 11 } was the entire population (divide by N) • In the second case, { 6, 1, 2, 11 } was just a sample from the population (divide by n – 1) • Why are the population variance (15.5) and the sample variance (20.7) different for the same set of numbers? • In the first case, { 6, 1, 2, 11 } was the entire population (divide by N) • In the second case, { 6, 1, 2, 11 } was just a sample from the population (divide by n – 1) • These are two different situations
Chapter 3 – Section 2 • Why do we use different formulas? • The reason is that using the sample mean is not quite as accurate as using the population mean • If we used “n” in the denominator for the sample variance calculation, we would get a “biased” result • Bias here means that we would tend to underestimate the true variance
Chapter 3 – Section 2 • The standarddeviation is the square root of the variance • The standarddeviation is the square root of the variance • The populationstandarddeviation • Is the square root of the population variance (σ2) • Is represented by σ • The standarddeviation is the square root of the variance • The populationstandarddeviation • Is the square root of the population variance (σ2) • Is represented by σ • The samplestandarddeviation • Is the square root of the sample variance (s2) • Is represented by s
Chapter 3 – Section 2 • If the population is { 6, 1, 2, 11 } • The population variance σ2 = 15.5 • The population standard deviation σ = • If the population is { 6, 1, 2, 11 } • The population variance σ2 = 15.5 • The population standard deviation σ = • If the sample is { 6, 1, 2, 11 } • The sample variance s2 = 20.7 • The sample standard deviation s = • If the population is { 6, 1, 2, 11 } • The population variance σ2 = 15.5 • The population standard deviation σ = • If the sample is { 6, 1, 2, 11 } • The sample variance s2 = 20.7 • The sample standard deviation s = • The population standard deviation and the sample standard deviation apply in different situations
Chapter 3 – Section 2 • The standard deviation is very useful for estimating probabilities
Chapter 3 – Section 2 • The empirical rule • If the distribution is roughly bell shaped, then • The empirical rule • If the distribution is roughly bell shaped, then • Approximately 68% of the data will lie within 1 standard deviation of the mean • The empirical rule • If the distribution is roughly bell shaped, then • Approximately 68% of the data will lie within 1 standard deviation of the mean • Approximately 95% of the data will lie within 2 standard deviations of the mean • The empirical rule • If the distribution is roughly bell shaped, then • Approximately 68% of the data will lie within 1 standard deviation of the mean • Approximately 95% of the data will lie within 2 standard deviations of the mean • Approximately 99.7% of the data (i.e. almost all) will lie within 3 standard deviations of the mean
Chapter 3 – Section 2 • For a variable with mean 17 and standard deviation 3.4 • For a variable with mean 17 and standard deviation 3.4 • Approximately 68% of the values will lie between(17 – 3.4) and (17 + 3.4), i.e. 13.6 and 20.4 • For a variable with mean 17 and standard deviation 3.4 • Approximately 68% of the values will lie between(17 – 3.4) and (17 + 3.4), i.e. 13.6 and 20.4 • Approximately 95% of the values will lie between(17 – 2 3.4) and (17 + 2 3.4), i.e. 10.2 and 23.8 • For a variable with mean 17 and standard deviation 3.4 • Approximately 68% of the values will lie between(17 – 3.4) and (17 + 3.4), i.e. 13.6 and 20.4 • Approximately 95% of the values will lie between(17 – 2 3.4) and (17 + 2 3.4), i.e. 10.2 and 23.8 • Approximately 99.7% of the values will lie between(17 – 3 3.4) and (17 + 3 3.4), i.e. 6.8 and 27.2 • For a variable with mean 17 and standard deviation 3.4 • Approximately 68% of the values will lie between(17 – 3.4) and (17 + 3.4), i.e. 13.6 and 20.4 • Approximately 95% of the values will lie between(17 – 2 3.4) and (17 + 2 3.4), i.e. 10.2 and 23.8 • Approximately 99.7% of the values will lie between(17 – 3 3.4) and (17 + 3 3.4), i.e. 6.8 and 27.2 • A value of 2.1 and a value of 33.2 would both be very unusual
Chapter 3 – Section 2 • Chebyshev’s inequality gives a lower bound on the percentage of observations that lie within k standard deviations of the mean (where k > 1) • Chebyshev’s inequality gives a lower bound on the percentage of observations that lie within k standard deviations of the mean (where k > 1) • This lower bound is • An estimated percentage • The actual percentage for any variable cannot be lower than this number • Chebyshev’s inequality gives a lower bound on the percentage of observations that lie within k standard deviations of the mean (where k > 1) • This lower bound is • An estimated percentage • The actual percentage for any variable cannot be lower than this number • Therefore the actual percentage must be this value or higher
Chapter 3 – Section 2 • Chebyshev’s inequality • For any data set, at least of the observations will lie within k standard deviations of the mean, where k is any number greater than 1
Chapter 3 – Section 2 • How much of the data lies within 1.5 standard deviations of the mean? • From Chebyshev’s inequality so that at least 55.6% of the data will lie within 1.5 standard deviations of the mean
Chapter 3 – Section 2 • If the mean is equal to 20 and the standard deviation is equal to 4, how much of the data lies between 14 and 26? • 14 to 26 are 1.5 standard deviations from 20 so that at least 55.6% of the data will lie between 14 and 26
Summary: Chapter 3 – Section 2 • Range • The maximum minus the minimum • Not a resistant measurement • Variance and standard deviation • Measures deviations from the mean • Not a resistant measurement • Empirical rule • About 68% of the data is within 1 standard deviation • About 95% of the data is within 2 standard deviations
Chapter 3Section 3 Measures of Central Tendency and Dispersion from Grouped Data
1 2 3 Chapter 3 – Section 3 • Learning objectives • The mean from grouped data • The weighted mean • The variance and standard deviation for grouped data
Chapter 3 – Section 3 • Data may come in groups rather than individually • The values may have been summarized in frequency distributions • Ranges of ages (20 – 29, 30 – 39, ...) • Ranges of incomes ($10,000 – $19,999, $20,000 – $39,999, $40,000 – $79,999, ...) • The exact values for the mean, variance, and standard deviation cannot be calculated
1 2 3 Chapter 3 – Section 3 • Learning objectives • The mean from grouped data • The weighted mean • The variance and standard deviation for grouped data
Chapter 3 – Section 3 • To compute the mean for grouped data • Assume that, within each class, the mean of the data is equal to the class midpoint • Use the class midpoint in the formula for the mean • The number of times the class midpoint value is used is equal to the frequency of the class • To compute the mean for grouped data • Assume that, within each class, the mean of the data is equal to the class midpoint • Use the class midpoint in the formula for the mean • The number of times the class midpoint value is used is equal to the frequency of the class • If 6 values are in the interval [ 8, 10 ] , then we assume that all 6 values are equal to 9 (the midpoint of [ 8, 10 ]