Chapter 3 Using Numbers to Describe Distributions of Data
With one data point clearly the central location is at the point itself. Measures of Central Location • The measure of central location reflects the locations of all the actual data points. • How? With two data points, the central location should fall in the middle between them (in order to reflect the location of both of them). But if the third data point appears on the left hand-side of the midrange, it should “pull” the central location to the left.
Sum of the observations Number of observations Mean = The Arithmetic Mean • This is the most popular and useful measure of central location • This is often called the average.
Useful Notation x: lowercase letter x - represents any measurement in a sample of data. n: lowercase letter n – number of measurements in a sample ∑: uppercase Greek letter sigma – represents sum ∑x: - “add all the measurements in a sample. : – lowercase x with a bar over it – denotes the sample mean µ: lowercase Greek letter mu – denotes the population mean
The Arithmetic Mean Sample mean Population mean Sample size Population size
The Arithmetic Mean • Example 1 The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet. 0 7 22 11.0
. Find the median of the time on the internetfor the 10 adults of example 3.1 Suppose only 9 adults were sampled (exclude, say, the longest time (33)) Comment Even number of observations 0, 0, 5, 7, 8,9, 12, 14, 22, 33 The Median • The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude. Odd number of observations 8 8.5, 0, 0, 5, 7, 89, 12, 14, 22 0, 0, 5, 7, 8,9, 12, 14, 22, 33
Measures of Center 1) Sample Mean: where n is the sample size. 2) Sample Median: First, put the data in order. Then, the middle number for odd sample sizes median = the average of the two middle values for even sample sizes
Examples – Time to Complete an Exam A random sample of times, in minutes, to complete a statistics exam yielded the following times. Compute the mean and median for this data. 33, 29, 45, 60, 42, 19, 52, 38, 36 The mean is minutes Recall, we must rank (sort) the data before finding the median. 19, 29, 33, 36, 38, 42, 45, 52, 60 Since there are 9 (odd) data points, the 5th point is the median. The median is 38 minutes.
Examples – Miles Jogged Last Week A random sample of 12 joggers were asked to keep track of the distance they ran (in miles) over a week’s time. Compute the mean and median for this data. 5.5, 7.2, 1.6, 22.0, 8.7, 2.8, 5.3, 3.4, 12.5, 18.6, 8.3, 6.6 miles
Examples – Miles Jogged Last Week (Cont) A random sample of 12 joggers were asked to keep track of the distance they ran (in miles) over a week’s time. Compute the mean and median for this data. 5.5, 7.2, 1.6, 22.0, 8.7, 2.8, 5.3, 3.4, 12.5, 18.6, 8.3, 6.6 Recall, we must rank (sort) the data before finding the median. 1.6, 2.8, 3.4, 5.3, 5.5, 6.6, 7.2, 8.3, 8.7, 12.5, 18.6, 22.0 Since there are 12 (even) data points, the median is the average of the 6th and 7th points. The median is 6.9 miles.
Mean and Median Comparisons Recall the mean (8.54 miles) is larger than the median (6.9 miles) for this data. This occurs when the data is skewed to the right.
Mean and Median Comparisons If the data is symmetric, the mean and the median are approximately the same. If the data is skewed to the right, the mean is larger than the median. If the data is skewed to the left, the mean is smaller than the median. mean = -0.0373 mean = 10.71 mean = 4.829 median = -0.0173 median = 7.75 median = 6.629
Relationship among Mean, Median, and Mode • If a distribution is symmetrical, the mean, median and mode coincide • If a distribution is asymmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) Mode Mean Median
3.2 Measures of variability • Measures of central location fail to tell the whole story about the distribution. • A question of interest still remains unanswered: How much are the observations spread out around the mean value?
3.2 Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. This data set is now changing to...
3.2 Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. Larger variability The same average value does not provide as good representation of the observations in the data set as before.
? ? ? • The range • The range of a set of observations is the difference between the largest and smallest observations. • Its major advantage is the ease with which it can be computed. • Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points. But, how do all the observations spread out? The range cannot assist in answering this question Range Largest observation Smallest observation
Notation for Samples and Populations Recall, we will use statistics to make inference about population values. Sample Descriptive Population Descriptive Measures Measures = sample mean m = population mean s2 = sample variance s2 = population variance s = sample standard s = population standard deviation deviation
This measure reflects the dispersion of all the observations • The variance of a population of size N x1, x2,…,xN whose mean is m is defined as • The variance of a sample of n observationsx1, x2, …,xn whose mean is is defined as The Variance
Sum = 0 Sum = 0 Why not use the sum of deviations? Consider two small populations: 9-10= -1 A measure of dispersion Should agrees with this observation. 11-10= +1 Can the sum of deviations Be a good measure of dispersion? The sum of deviations is zero for both populations, therefore, is not a good measure of dispersion. 8-10= -2 A 12-10= +2 8 9 10 11 12 …but measurements in B are more dispersed then those in A. The mean of both populations is 10... 4-10 = - 6 16-10 = +6 B 7-10 = -3 13-10 = +3 4 7 10 13 16
The Variance Let us calculate the variance of the two populations Why is the variance defined as the average squared deviation? Why not use the sum of squared deviations as a measure of variation instead? After all, the sum of squared deviations increases in magnitude when the variation of a data set increases!!
The Variance Let us calculate the sum of squared deviations for both data sets Which data set has a larger dispersion? Data set B is more dispersed around the mean A B 1 2 3 1 3 5
SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10 SumB = (1-3)2 + (5-3)2 = 8 The Variance SumA > SumB. This is inconsistent with the observation that set B is more dispersed. A B 1 3 1 2 3 5
The Variance However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked. sA2 = SumA/N = 10/5 = 2 sB2 = SumB/N = 8/2 = 4 A B 1 3 1 2 3 5
The Variance • Example • The following sample consists of the number of jobs six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance • Solution
Standard Deviation • The standard deviation of a set of observations is the square root of the variance .
Properties of the Standard Deviation, s 1. s measures the variability is a sample of measurements. It is a measure of how much the sample values deviate from the sample mean. 2. s is a nonnegative number. If all the numbers in a sample are equal, the value of the standard deviation will be zero. This is the smallest possible value for the standard deviation. 3. When comparing 2 samples of data, the sample that is more variable will have a larger standard deviation.
Standard Deviation • Example: To examine the consistency of shots for a new innovative golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7-iron) club, and 75 with the new club. • The distances were recorded. • Which 7-iron is more consistent?
Standard Deviation • Example– solution Excel printout, from the “Descriptive Statistics” sub-menu. The innovation club is more consistent, and because the means are close, is considered a better club
Interpreting Standard Deviation • For sets of quantitative data that result from real-life experiments, the following statements are generally true: • 1. Most of the measurements will be within 2 standard deviations of the mean • 2. All, or almost all of the measurements will be within 3 standard deviations of the mean.
Interpreting Standard Deviation • The standard deviation can be used to • compare the variability of several distributions • make a statement about the general shape of a distribution. • The empirical rule: If a sample of observations has a mound-shaped distribution, the interval
Empirical Rule Example A sample of n=40 students asked for their one-way commute times to campus yielded a mean of 13.6 minutes with a standard deviation of 2.1 minutes. Empirical Rule: Most students drive between 9.4 and 17.8 minutes to campus. Almost all students drive between 7.3 and 19.9 minutes to campus.
Empirical Rule Example #2 The construction time for a 3-bedroom house for a local builder is known to follow a mound-shaped and symmetric distribution with a mean of 84 days and a standard deviation of 7 days. a) Most 3-bedroom houses take between 70 and 98 days to be completed for this builder. b) Almost all 3-bedroom houses take between 63 and 105 days to be completed for this builder.
Measures of Variability (Spread) for Samples We wish to quantify how spread out from the center the data is. Sample range: R = largest value – smallest value Sample variance: Sample standard deviation: StatCrunch will be used to calculate the standard deviation for most of our data sets.
A Complete Analysis for a Data Set Bone density loss measurements were taken for a sample of 125 women aged 50 or over. Complete an analysis of the data and describe the results. This data was entered into StatCrunch. First, we generated the basic descriptive statistics by the commands: Stat > Basic Statistics > Display Descriptive Statistics With the cursor in the variables box, double click the variable “Bone Density Loss”. Then click OK. Descriptive Statistics: Bone Density Loss Variable N Mean Median TrMean StDev SE Mean Bone Den 12535.00836.000 35.071 7.684 0.687 Variable Minimum Maximum Q1 Q3 Bone Den 15.000 53.000 30.000 41.000
A Complete Analysis for a Data Set (Cont) Descriptive Statistics: Bone Density Loss (Modified) Variable N Mean Median StDev Minimum Maximum Bone Den 12535.00836.0007.684 15.000 53.000 The sample mean is 35.008 and the median is 36, so we expect a roughly symmetric or slightly skewed left distribution. The typical bone density loss is around 35 to 36 units. The histogram is given below
A Complete Analysis for a Data Set (Cont) Descriptive Statistics: Bone Density Loss (Modified) Variable N Mean Median StDev Minimum Maximum Bone Den 12535.00836.0007.684 15.000 53.000 35.008-2(7.684) = 19.640 35.008+2(7.684) = 50.376 Most women aged 50 and over have between 19.640 and 50.376 units of bone density loss. Out of the 125 measurements in the sample, 118 were between these two numbers. This represents 94.4 % of the data points, so this tends to agree with the empirical rule. The range is 53-15=38. Now 38/4 = 9.5. Since 7.684 and 9.5 are not drastically different values, then s was probably calculated properly.