Chapter 3 Numerical Methods for Describing Data Distributions Created by Kathy Fritz
Suppose that you have just received your score on an exam in one of your classes. What would you want to know about the distribution of scores for this exam? Measures of center Measures of spread
The stress of the final years of medical training can contribute to depression and burnout. The authors of the paper “Rates of Medication Errors Among Depressed and Burnt Out Residents” (British Medical Journal : 488) studied 24 residents in pediatrics. Medical records of patients treated by these residents during a fixed time period were examined for errors in ordering or administering medications. The accompanying dotplot displays the total number of medication errors for each of the 24 residents.
Choosing Appropriate Measures for Describing Center and Spread If the shape of the data distribution is … Describe Center and Spread Using …
Describing Center and Spread For Data Distributions That Are Approximately Symmetric Mean Standard Deviation
Mean Definition: In mathematics, the capital Greek letter Σ is short for “add them all up.” Therefore, the formula for the mean can be written in more compact notation: The population mean, m (the Greek letter mu), is the arithmetic average of all the x values in an entire population. Some notation: x = the variable of interest n = the sample size x1, x2, …, xn are the individual observations in the data set
Measuring Center Use the data below to calculate the mean of the commuting times (in minutes) of 20 randomly selected New York workers. 0 5 1 005555 2 0005 3 00 4 005 5 6 005 7 8 5 Key: 4|5 represents a New York worker who reported a 45-minute travel time to work.
Measuring Variability Consider the three sets of six exam scores displayed below: Each data set has a mean exam score of 75. Does that completely describe these data sets?
Deviations The most widely used measures of variability
Suppose that we are interested in finding the “typical” or average deviation from the mean. So, to calculate the “typical” or average deviation from the mean, we must first square each deviation. Then the all the squared deviations are positive. The deviations from the mean were -25, -15, -5, 5, 15, and 25. The squares of these deviations from the mean are Now we can average these. Variance and Standard Deviation
Consider the following data on the number of pets owned by a group of 9 children. Variance and Standard Deviation deviation: 8 - 5 = 3 Wait a minute . . . If the data values represented the entire population, then we would divide by the sample size (n). However, more often than not, the data values represent a sample from the population and we divide by (n – 1). Why? If the spread of the population were from 50 to 100, samples would rarely have the same spread. The samples would have a smaller spread (less variability). By dividing by a smaller number n - 1, we get a better estimate of the true “typical” deviation from the mean. Since the sum of the deviations from the mean is always zero, you cannot just add the deviations and then divide by the number of deviations. What do you do? Can we just calculate the arithmetic average for the deviations from the mean? Why or why not?
Measuring Spread: The Standard Deviation “average” squared deviation = 52/(9-1) = 6.5 This is the variance. Standard deviation = square root of variance =
Describing Center and Spread For Data Distributions That Are Skewed or Have Outliers Median Interquartile Range
Median The median M The sample median is obtained by first ordering the n observations from smallest to largest (with any repeated values included, so that every sample observation appears in the ordered list). Then . . .
Forty students were enrolled in a statistical reasoning course at a California college. The instructor made course materials, grades, and lecture notes available to students on a class web site. Course management software kept track of how often each student accessed any of these web pages. The data set below (in order from smallest to largest) is the number of times each of the 40 students had accessed the class web page during the first month.
Comparing the Mean and the Median • The mean and median measure center in different ways, and both are useful. • Don’t confuse the “average” value of a variable (the mean) with its “typical” value, which we might describe by the median. Comparing the Mean and the Median
Interquartile range (iqr) is based on quantities called quartiles which divide the data set into four equal parts (quarters). Lower quartile (Q1) = Upper quartile (Q3) = In n is odd, the median of the entire data set is excluded from both halves when computing quartiles. Measuring Spread - Interquartile Range The sample standard deviation, s, can also be greatly affected by the presence of even one outlier. The interquartile range is a measure of variability that is resistant to the effects of outliers.
Measuring Spread: The Interquartile Range • A measure of center alone can be misleading. • A useful numerical description of a distribution requires both a measure of center and a measure of spread. How to Calculate the Quartiles and the Interquartile Range To calculate the quartiles:
Recall the website data set: The lower quartile (Q1) is the median of the lower 20 data values. The upper quartile (Q3) is the median of the upper 20 data values. The interquartile (iqr) is the difference of the upper and lower quartile.
Putting it Together The Chronicle of Higher Education (Almanac Issue, 2009-2010) published the accompanying data on the percentage of the population with a bachelor’s degree or graduate degree in 2007 for each of the 50 U.S. states and the District of Columbia. The data distribution is shown in the histogram below. Step 1: Select
Putting it Together Step 3: Interpret Step 2: Calculations
Find and Interpret the IQR Travel times to work for 20 randomly selected New Yorkers M = 22.5 Q3= 42.5 Q1= 15 Interpretation: The range of the middle half of travel times for the New Yorkers in the sample is 27.5 minutes. IQR = Q3 – Q1 = 42.5 – 15 = 27.5 minutes
Boxplots General Boxplots Modified Boxplots
Five-Number Summary The five-number summary consists of the following:
Boxplots When to Use Univariate numerical data How to construct What to look for center, spread, and shape of the data distribution and if there are any unusual features
Comparative Boxplots A comparative boxplot is Recall the video game study. There were two groups: 1) told to improve total score or 2) told to improve a different aspect, such as speed. 1st 2nd
Identifying Outliers In addition to serving as a measure of spread, the interquartile range (IQR) is used as part of a rule of thumb for identifying outliers. Definition: The 1.5 x IQR Rule for Outliers 0 5 1 005555 2 0005 3 00 4 005 5 6 005 7 8 5 In the New York travel time data, we found Q1=15 minutes, Q3=42.5 minutes, and IQR=27.5 minutes.
Modified boxplots How to construct Compute the values in the five-number summary Draw a horizontal line and add an appropriate scale. Draw a box above the line that extends from the lower quartile (Q1) to the upper quartile (Q3) Draw a line segment inside the box at the location of the median.
Construct a Boxplot Consider our NY travel times data. Construct a boxplot.
Big Mac prices in U.S. dollars for 44 different countries were given in the article “Big Mac Index 2010”. The following 44 Big Mac prices are arranged in order from the lowest price (Ukraine) to the highest price (Norway).
Big Mac Prices Continued . . . Smallest observation = Lower quartile = Median = Upper quartile = Largest observation =
The 2009-2010 salaries of NBA players published on the web site hoopshype.com were used to construct the comparative boxplot of salary data for five teams.
Measures of Relative Standing z -scores Percentiles
Percentiles For a number r between 0 and 100, the rth percentile is a value such that rpercent of the observations fall AT or BELOW that value. This diagram illustrates the 90th percentile.
Measuring Position: Percentiles One way to describe the location of a value in a distribution is to tell what percent of observations are less than it. Definition: Jenny earned a score of 86 on her test. How did she perform relative to the rest of the class? 6 7 7 2334 7 5777899 8 00123334 8 569 9 03
In addition to weight and length, head circumference is another measure of health in newborn babies. The National Center for Health Statistics reports the following summary values for head circumference (in cm) at birth for boys. What value of head circumference is at the 75th percentile? What is the median value of head circumference?
z -scores Definition: The z -score tells you.
Measuring Position: z-Scores Jenny earned a score of 86 on her test. The class mean is 80 and the standard deviation is 6.07. What is her standardized score?
Using z-scores for Comparison We can use z-scores to compare the position of individuals in different distributions. Jenny earned a score of 86 on her statistics test. The class mean was 80 and the standard deviation was 6.07. She earned a score of 82 on her chemistry test. The chemistry scores had a fairly symmetric distribution with a mean 76 and standard deviation of 4. On which test did Jenny perform better relative to the rest of her class?
What do these z-scores mean? -2.3 1.8
Suppose that two graduating seniors, one a marketing major and one an accounting major, are comparing job offers. The accounting major has an offer for $45,000 per year, and the marketing major has an offer for $43,000 per year. Accounting: mean = 46,000 standard deviation = 1500 Marketing: mean = 42,500 standard deviation = 1000
Density Curve • Definition: • A density curve is a curve that • A density curve describes the overall pattern of a distribution. The area under the curve and above any interval of values on the horizontal axis is the proportion of all observations that fall in that interval. The overall pattern of this histogram of the scores of all 947 seventh-grade students in Gary, Indiana, on the vocabulary part of the Iowa Test of Basic Skills (ITBS) can be described by a smooth curve drawn through the tops of the bars.
Normal Distributions One particularly important class of density curves are the Normal curves, which describe Normal distributions. All Normal curves are A Specific Normal curve is described by giving its Two Normal curves, showing the mean µ and standard deviation σ.