Center and Spread in Data Analysis: Means, Medians, and More

Chapter 4 Numerical Methods for Describing Data

The sample mean of a numerical sample, x1, x2, x3,…, xn, denoted , is Describing the Center of a Data Set with the arithmetic mean The population mean is denoted by m.

Example calculations • During a two week period 10 houses were sold in Fancytown. The “average” or mean price for this sample of 10 houses in Fancytown is $291,000

Outlier Example calculations • During a two week period 10 houses were sold in Fancytown. The “average” or mean price for this sample of 10 houses in Lowtown is $295,000

Comments • In the previous example of the house prices in the sample of 10 houses from Lowtown, the mean was affected very strongly by the one house with the extremely high price. • The other 9 houses had selling prices around $100,000. • This illustrates that the mean can be very sensitive to a few extreme values.

Describing the Center of a Data Set with the median The sample median is obtained by first ordering the n observations from smallest to largest (with any repeated values included, so that every sample observation appears in the ordered list). Then

Examples of median calculation Consider the Fancytown data.First, we put the data in numerical increasing order to get 225,000 285,000 287,000 287,000 291,000 299,000 300,000 310,000 311,000 315,000 Since there are 10 (even) data values, the median is the mean of the two values in the middle.

Comparing the sample mean & sample median

Comparing the sample mean & sample median Notice from the preceding pictures that the median splits the area in the distribution in half and the mean is the point of balance. • Typically, • when a distribution is skewed positively, the mean is larger than the median, • when a distribution is skewed negatively, the mean is smaller then the median, and • when a distribution is symmetric, the mean and the median are equal.

The Trimmed Mean • A trimmed mean is computed by first ordering the data values from smallest to largest, deleting a selected number of values from each end of the ordered list, and finally computing the mean of the remaining values. • The trimming percentage is the percentage of values deleted from each end of the ordered list.

Example of Trimmed Mean

Another Example • Here’s an example of what happens if you compute the mean, median, and 5% & 10% trimmed means for the Ages for the 79 students taking Data Analysis

The sample proportion of successes, denoted by p, is Where S is the label used for the response designated as success. The population proportion of successes is denoted by p. Categorical Data – Sample Proportion

Categorical Data – Sample Proportion If we look at the student data sample, consider the variable gender and treat being female as a success, the sample proportion (of females) is

Describing Variability • The simplest numerical measure of the variability of a numerical data set is the range, which is defined to be the difference between the largest and smallest data values. range = maximum - minimum

Describing Variability • The n deviations from the sample mean are the differences: Note: The sum of all of the deviations from the sample mean will be equal to 0, except possibly for the effects of rounding the numbers.

Note: Sample Variance • The sample variance, denoted s2 is the sum of the squared deviations from the mean divided by n-1.

Sample Standard Deviation • The sample standard deviation, denoted s is the positive square root of the sample variance. The population standard deviation is denoted by s.

Example calculations • 10 Macintosh Apples were randomly selected and weighed (in ounces).

Calculator Formula for s2 and s A little algebra can establish the sum of the square deviations, • A computational formula for the sample variance is given by

Calculations Revisited The values for s2 and s are exactly the same as were obtained earlier.

Quartiles and the Interquartile Range • Lower quartile (Q1) = median of the lower half of the data set. • Upper Quartile (Q3) = median of the upper half of the data set. The interquartile range (iqr), a resistant measure of variability is given by Iqr = upper quartile – lower quartile = Q3 – Q1 Note: If n is odd, the median is excluded from both the lower and upper halves of the data.

Quartiles and IQR Example • 15 students with part time jobs were randomly selected and the number of hours worked last week was recorded. 19, 12, 14, 10, 12, 10, 25, 9, 8, 4, 2, 10, 7, 11, 15 The is order in increasing order to get 2, 4, 7, 8, 9, 10, 10, 10, 11, 12, 12, 14, 15, 19, 25

Lower quartile Q1 Upper quartile Q1 Median Quartiles and IQR Example With 15 data values, the median is the 8th value. Specifically, the median is 10. 2, 4, 7, 8, 9, 10, 10, 10, 11, 12, 12, 14, 15, 19, 25 Upper Half Lower Half Lower quarter = 8 Upper quarter = 14 Iqr = 14 - 8 = 6

Boxplots • Constructing a Skeletal Boxplot • Draw a horizontal (or vertical) scale. • Construct a rectangular box whose left (or lower) edge is at the lower quartile and whose right (or upper) edge is at the upper quartile (so box width = iqr). Draw a vertical (or horizontal) line segment inside the box at the location of the median. • Extend horizontal (or vertical) line segments from each end of the box to the smallest and largest observations in the data set. (These lines are called whisker.)

0 5 10 15 20 25 Skeletal Boxplot Example • Using the student work hours data we have

Outliers • An observations is an outlier if it is more than 1.5 iqr away from the closest end of the box (less than the lower quartile minus 1.5 iqr or more than the upper quartile plus 1.5 iqr. • An outlier is extreme if it is more than 3 iqr from the closest end of the box, and it is mild otherwise.

Modified Boxplots • A modified boxplot represents mild outliers by shaded circles and extreme outliers by open circles. Whiskers extend on each end to the most extreme observations that are not outliers.

Smallest data value that isn’t an outlier Largest data value that isn’t an outlier Mild Outlier Upper quartile + 1.5 iqr = 14 + 1.5(6) = 23 0 5 10 15 20 25 Lower quartile + 1.5 iqr = 14 - 1.5(6) = -1 Upper quartile + 3 iqr = 14 + 3(6) = 32 Modified Boxplot Example • Using the student work hours data we have

17 18 18 18 18 18 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21 21 21 21 21 21 21 22 22 22 22 22 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 25 26 28 28 30 37 38 44 47 Lower Quartile Median Upper Quartile Moderate Outliers Extreme Outliers Modified Boxplot Example • Consider the ages of the 79 students from the classroom data set from the slideshow Chapter 3. Iqr = 22 – 19 = 3 Lower quartile – 3 iqr = 10 Lower quartile – 1.5 iqr =14.5 Upper quartile + 3 iqr = 31 Upper quartile + 1.5 iqr = 25.5

50 45 40 35 30 25 20 15 Modified Boxplot Example Here is the same boxplot reproduced with a vertical orientation.

Largest data value that isn’t an outlier Smallest data value that isn’t an outlier Mild Outliers Extreme Outliers Modified Boxplot Example 15 20 25 30 35 40 45 50

G e n d e r Males Females 100 120 140 160 180 200 220 240 Student Weight Comparative Boxplot Example

Comparative Boxplot Example

Chebyshev’s Rule Consider any number k, where k  1. Then the percentage of observations that are within k standard deviations of the mean is at least . Interpreting VariabilityChebyshev’s Rule

Interpreting VariabilityChebyshev’s Rule • For specific values of k Chebyshev’s Rule reads • At least 75% of the observations are within 2 standard deviations of the mean. • At least 89% of the observations are within 3 standard deviations of the mean. • At least 90% of the observations are within 3.16 standard deviations of the mean. • At least 94% of the observations are within 4 standard deviations of the mean. • At least 96% of the observations are within 5 standard deviations of the mean. • At least 99% of the observations are with 10 standard deviations of the mean.

Example - Chebyshev’s Rule 17 18 18 18 18 18 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21 21 21 21 21 21 21 22 22 22 22 22 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 25 2628 28 30 37 3844 47 • Consider the student age data Color code: within 1 standard deviation of the mean within 2 standard deviations of the mean within 3 standard deviations of the mean within 4 standard deviations of the mean within 5 standard deviations of the mean

Example - Chebyshev’s Rule • Summarizing the student age data Notice that Chebyshev gives very conservative lower bounds and the values aren’t very close to the actual percentages.

Empirical Rule • If the histogram of values in a data set is reasonably symmetric and unimodal (specifically, is reasonably approximated by a normal curve), then • Approximately 68% of the observations are within 1 standard deviation of the mean. • Approximately 95% of the observations are within 2 standard deviation of the mean. • Approximately 99.7% of the observations are within 3 standard deviation of the mean.

The z score corresponding to a particular observation in a data set is The z score is how many standard deviations the observation is from the mean. A positive z score indicates the observation is above the mean and a negative z score indicates the observation is below the mean. Z Scores

Computing the z score is often referred to as standardization and the z score is called a standardized score. The formula used with sample data is Z Scores

Example A sample of GPAs of 38 statistics students appear below (sorted in increasing order) 2.00 2.25 2.36 2.37 2.50 2.50 2.60 2.67 2.70 2.70 2.75 2.78 2.80 2.80 2.82 2.90 2.90 3.00 3.02 3.07 3.15 3.20 3.20 3.20 3.23 3.29 3.30 3.30 3.42 3.46 3.48 3.50 3.50 3.58 3.75 3.80 3.83 3.97

2 0 2 233 2 55 2 667777 2 88899 3 0001 3 2222233 3 444555 3 7 3 889 Stem: Units digit Leaf: Tenths digit Example The following stem and leaf indicates that the data is reasonably symmetric and unimodal.

Using the formula we compute the z scores and color code the values as we did in an earlier example. -2.21-1.68 -1.45 -1.43 -1.15 -1.15-0.94 -0.79 -0.73 -0.73 -0.62 -0.56 -0.52 -0.52 -0.47 -0.30 -0.30 -0.09 -0.05 0.06 0.23 0.33 0.33 0.33 0.40 0.52 0.54 0.54 0.80 0.88 0.93 0.97 0.971.14 1.50 1.60 1.67 1.96 Example

Example Notice that the empirical rule gives reasonably good estimates for this example.

10 3 11 37 12 011444555 13 000000455589 14 000000000555 15 000000555567 16 000005558 17 0000005555 18 0358 19 5 20 00 21 0 22 55 23 79 Stem: Hundreds & tens digits Leaf: Units digit Comparison of Chebyshev’ Rule and the Empirical Rule The following refers to the weights in the sample of 79 students. Notice that the stem and leaf diagram suggest the data distribution is unimodal but is positively skewed because of the outliers on the high side. Nevertheless, the results for the Empirical Rule are good.

Comparison of Chebyshev’ Rule and the Empirical Rule Notice that even with moderate positive skewing of the data, the Empirical Rule gave a much more usable and meaningful result.

Center and Spread in Data Analysis: Means, Medians, and More

Center and Spread in Data Analysis: Means, Medians, and More

Presentation Transcript

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4-4

Chapter 4

Chapter 4

Chapter 4 - 4

Chapter 4

CHAPTER 4

Chapter 4

Chapter 4

CHAPTER 4

Chapter 4

Chapter 4

CHAPTER 4

Chapter 4

Chapter 4

Chapter 4