Chapter 2: Descriptive Analysis and Presentation of Single-Variable Data

Chapter 2: Descriptive Analysis and Presentation of Single-Variable Data

Chapter Goals • Learn how to present and describe sets of data. • Learn measures of central tendency, measures of dispersion (spread), measures of position, and types of distributions. • Learn how to interpret findings so that we know what the data is telling us about the sampled population.

2.1: Graphic Presentation of Data • Use initial exploratory data-analysis techniques to produce a pictorial representation of the data. • Resulting displays reveal patterns of behavior of the variable being studied. • The method used is determined by the type of data and the idea to be presented. • No single correct answer when constructing a graphic display.

Circle Graphs and Bar Graphs: Graphs that are used to summarize attribute data. Circle graphs (pie diagrams) show the amount of data that belongs to each category as a proportional part of a circle. Bar graphs show the amount of data that belongs to each category as proportionally sized rectangular areas.

Example: The table below lists the number of automobiles sold last week by day for a local dealership. Day Number Sold Monday 15 Tuesday 23 Wednesday 35 Thursday 11 Friday 12 Saturday 42 Describe the data using a circle graph and a bar graph.

Automobiles Sold Last Week

Pareto Diagram: A bar graph with the bars arranged from the most numerous category to the least numerous category. It includes a line graph displaying the cumulative percentages and counts for the bars. • Note: • The Pareto diagram is often used in quality control applications. • Used to identify the number and type of defects that happen within a product or service.

Example: The final daily inspection defect report for a cabinet manufacturer is given in the table below. Defect Number Dent 5 Stain 12 Blemish 43 Chip 25 Scratch 40 Others 10 Construct a Pareto diagram for this defect report. Management has given the cabinet production line the goal of reducing their defects by 50%. What two defects should they give special attention to in working toward this goal?

Solution: The production line should try to eliminate blemishes and scratches. This would cut defects by more than 50%.

Quantitative Data: One reason for constructing a graph of quantitative data is to examine the distribution - is the data compact, spread out, skewed, symmetric, etc. Distribution: The pattern of variability displayed by the data of a variable. The distribution displays the frequency of each value of the variable. Dotplot Display: Displays the data of a sample by representing each piece of data with a dot positioned along a scale. This scale can be either horizontal or vertical. The frequency of the values is represented along the other scale.

Example: A random sample of the lifetime (in years) of 50 home washing machines is given below. 2.5 8.9 12.2 4.1 18.1 1.6 12.2 16.9 2.5 3.5 0.4 2.6 2.2 4.0 4.5 6.4 2.9 3.3 4.4 9.2 4.1 0.9 14.5 4.0 0.9 7.2 5.2 1.8 1.5 0.7 3.7 4.2 6.9 15.3 21.8 17.8 7.3 6.8 3.3 7.0 4.0 18.3 8.5 1.4 7.4 4.7 0.7 10.4 3.6 The figure below is a dotplot for the 50 lifetimes. . : . . .:. . ..: :.::::::.. .::. ... . : . . . :. . +---------+---------+---------+---------+---------+------- 0.0 4.0 8.0 12.0 16.0 20.0 Notice how the data is “bunched” near the lower extreme and more “spread out” near the higher extreme.

The stem-and-leaf display has become very popular for summarizing numerical data. • It is a combination of graphing and sorting. • The actual data is part of the graph. • Well-suited for computers. Background: Stem-and-Leaf Display: Pictures the data of a sample using the actual digits that make up the data values. Each numerical data is divided into two parts: The leading digit(s) becomes the stem, and the trailing digit(s) becomes the leaf. The stems are located along the main axis, and a leaf for each piece of data is located so as to display the distribution of the data.

Example: A city police officer, using radar, checked the speed of cars as they were traveling down the main street in town: 41 31 33 35 36 37 39 49 33 19 26 27 24 32 40 39 16 55 38 36 Construct a stem-and-leaf plot for this data. Solution: All the speeds are in the 10s, 20s, 30s, 40s, and 50s. Use the first digit of each speed as the stem and the second digit as the leaf. Draw a vertical line and list the stems, in order to the left of the line. Place each leaf on its stem: place the trailing digit on the right side of the vertical line opposite its corresponding leading digit.

20 Speeds --------------------------------------- 1 | 6 9 2 | 4 6 7 3 | 1 2 3 3 5 6 6 7 8 9 9 4 | 0 1 9 5 | 5 ---------------------------------------- The speeds are centered around the 30s. Note: The display could be constructed so that only five possible values (instead of ten) could fall in each stem. What would the stems look like? Would there be a difference in appearance?

Note: 1. It is fairly typical of many variables to display a distribution that is concentrated (mounded) about a central value and then in some manner be dispersed in both directions. (Why?) 2. A display that indicates two “mounds” may really be two overlapping distributions. 3. A back-to-back stem-and-leaf display makes it possible to compare two distributions graphically. 4. A side-by-side dotplot is also useful for comparing two distributions.

2.2: Frequency Distributions and Histograms • Stem-and-leaf plots often present adequate summaries, but they can get very big, very fast. • Need other techniques for summarizing data. • Frequency distributions and histograms are used to summarize large data sets.

Frequency Distribution: A listing, often expressed in chart form, that pairs each value of a variable with its frequency. Ungrouped Frequency Distribution: Each value of x in the distribution stands alone. Grouped Frequency Distribution: Group the values into a set of classes. 1. A table that summarizes data by classes, or class intervals. 2. In a typical grouped frequency distribution, there are usually 5-12 classes of equal width. 3. The table may contain columns for class number, class interval, tally (if constructing by hand), frequency, relative frequency, cumulative relative frequency, and class mark. 4. In an ungrouped frequency distribution each class consists of a single value.

Guidelines for constructing a frequency distribution: 1. Each class should be of the same width. 2. Classes should be set up so that they do not overlap and so that each piece of data belongs to exactly one class. 3. For problems in the text, 5-12 classes are most desirable. The square root of n is a reasonable guideline for the number of classes if n is less than 150. 4. Use a system that takes advantage of a number pattern, to guarantee accuracy. 5. If possible, an even class width is often advantageous.

Procedure for constructing a frequency distribution: 1. Identify the high (H) and low (L) scores. Find the range. Range = H - L. 2. Select a number of classes and a class width so that the product is a bit larger than the range. 3. Pick a starting point a little smaller than L. Count from L by the width to obtain the class boundaries. Observations that fall on class boundaries are placed into the class interval to the right. Note: 1. The class width is the difference between the upper- and lower-class boundaries. 2. There is no best choice for class widths, number of classes, and starting points.

Example: The hemoglobin test, a blood test given to diabetics during their periodic checkups, indicates the level of control of blood sugar during the past two to three months. The data in the table below was obtained for 40 different diabetics at a university clinic that treats diabetic patients. Construct a grouped frequency distribution using the classes 3.7 - <4.7, 4.7 - <5.7, 5.7 - <6.7, etc. Which class has the highest frequency? 6.5 5.0 5.6 7.6 4.8 8.0 7.5 7.9 8.0 9.2 6.4 6.0 5.6 6.0 5.7 9.2 8.1 8.0 6.5 6.6 5.0 8.0 6.5 6.1 6.4 6.6 7.2 5.9 4.0 5.7 7.9 6.0 5.6 6.0 6.2 7.7 6.7 7.7 8.2 9.0

Solution: Class Frequency Relative Cumulative Class Boundaries f Frequency Rel. Frequency Mark, x ----------------------------------------------------------------------------------- 3.7 - <4.7 1 .025 .025 4.2 4.7 - <5.7 6 .150 .175 5.2 5.7 - <6.7 16 .400 .575 6.2 6.7 - <7.7 4 .100 .250 7.2 7.7 - <8.7 10 .250 .925 8.2 8.7 - <9.7 3 .075 1.000 9.2 The class 5.7 - <6.7 has the highest frequency. The frequency is 16 and the relative frequency is .40.

Histogram: A bar graph representing a frequency distribution of a quantitative variable. A histogram is made up of the following components: 1. A title, which identifies the population of interest. 2. A vertical scale, which identifies the frequencies in the various classes. 3. A horizontal scale, which identifies the variable x. Values for the class boundaries or class marks may be labeled along the x-axis. Use whichever method of labeling the axis best presents the variable. Note: 1. The relative frequency is sometimes used on the vertical scale. 2. It is possible to create a histogram based on class marks.

Example: Construct a histogram for the blood test results given in the previous example. Solution:

Example: A recent survey of Roman Catholic nuns summarized their ages in the table below. Age Frequency Class Mark --------------------------------------------------------- 20 up to 30 34 25 30 up to 40 58 35 40 up to 50 76 45 50 up to 60 187 55 60 up to 70 254 65 70 up to 80 241 75 80 up to 90 147 85 Construct a histogram for this age data.

Solution:

Terms used to describe histograms: Symmetrical: Both sides of the distribution are identical. There is a line of symmetry. Uniform (rectangular): Every value appears with equal frequency. Skewed: One tail is stretched out longer than the other. The direction of skewness is on the side of the longer tail. (Positively skewed vs. negatively skewed) J-shaped: There is no tail on the side of the class with the highest frequency. Bimodal: The two largest classes are separated by one or more classes. Often implies two populations are sampled. Normal: A symmetrical distribution is mounded about the mean and becomes sparse at the extremes.

Note: 1. The mode is the value that occurs with greatest frequency (discussed in Section 2.3). 2. The modal class is the class with the greatest frequency. 3. A bimodal distribution has two high-frequency classes separated by classes with lower frequencies. 4. Graphical representations of data should include a descriptive, meaningful title and proper identification of the vertical and horizontal scales.

2.3: Measures of Central Tendency • Numerical values used to locate the middle of a set of data, or where the data is clustered. • The term average is often associated with all measures of central tendency.

Mean: The type of average with which you are probably most familiar. The mean is the sum of all the values divided by the total number of values, n. Note: 1. The population mean, m,(lowercase mu, Greek alphabet), is the mean of all x values for the entire population. 2. We usually cannot measure m but would like to estimate its value. 3. A physical representation: the mean is the value that balances the weights on the number line.

Example: The data below represents the number of accidents in each of the last 6 years at a dangerous intersection. 8, 9, 3, 5, 2, 6, 4, 5 Find the mean number of accidents. Solution: Note: In the data above, change 6 to 26. The mean can be greatly influenced by outliers.

Median: The value of the data that occupies the middle position when the data are ranked in order according to size. Note: 1. Denoted by “x tilde”: 2. The population median, M (uppercase mu, Greek alphabet), is the data value in the middle position of the entire population. To find the median: 1. Rank the data. 2. Determine the depth of the median. 3. Determine the value of the median.

Example: Find the median for the set of data {4, 8, 3, 8, 2, 9, 2, 11, 3}. Solution: 1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11 2. Find the depth: 3. The median is the fifth number from either end in the ranked data: Suppose the data set is {4, 8, 3, 8, 2, 9, 2, 11, 3, 15}. 1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11, 15 2. Find the depth: 3. The median is halfway between the fifth and sixth observations:

Mode: The mode is the value of x that occurs most frequently. Note: If two or more values in a sample are tied for the highest frequency (number of occurrences), there is no mode. Midrange: The number exactly midway between a lowest value data L and a highest value data H. It is found by averaging the low and the high values.

Example: Consider the data set {12.7, 27.1, 35.6, 44.2, 18.0}. The midrange is Note: 1. When rounding off an answer, a common rule-of-thumb is to keep one more decimal place in the answer than was present in the original data. 2. To avoid round-off buildup, round off only the final answer, not intermediate steps.

2.4: Measures of Dispersion • Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency. • Measures of dispersion are used to describe the spread, or variability, of a distribution. • Common measures of dispersion: range, variance, and standard deviation.

Range: The difference in value between the highest-valued (H) and the lowest-valued (L) pieces of data: Other measures of dispersion are based on the following quantity. Deviation from the Mean: A deviation from the mean, , is the difference between the value of x and the mean .

Example: Consider the sample {12, 23, 17, 15, 18}. Find the range and each deviation from the mean. Solution: Data Deviation _______________ 12 -5 23 6 17 0 15 -2 18 1

Note: (Always!) Mean Absolute Deviation: The mean of the absolute values of the deviations from the mean: For the previous example:

Sample Variance: The sample variance, s2, is the mean of the squared deviations, calculated using n - 1 as the divisor. where n is the sample size. Note: The numerator for the sample variance is called the sum of squares for x, denoted SS(x). where Standard Deviation: The standard deviation of a sample, s, is the positive square root of the variance:

Example: Find the variance and standard deviation for the data {5, 7, 1, 3, 8}.

Note: 1. The shortcut formula for the sample variance: 2. The unit of measure for the standard deviation is the same as the unit of measure for the data. The unit of measure for the variance might then be thought of as units squared.

2.5: Mean and Standard Deviation of Frequency Distribution • If the data is given in the form of a frequency distribution, we need to make a few changes to the formulas for the mean, variance, and standard deviation. • Complete the extension table in order to find these summary statistics.

In order to calculate the mean, variance, and standard deviation for data: 1. In an ungrouped frequency distribution, use the frequency of occurrence, f, of each observation. 2. In a grouped frequency distribution, we use the frequency of occurrence associated with each class mark.

Example: A survey of students in the first grade at a local school asked for the number of brothers and/or sisters for each child. The results are summarized in the table below. Find the mean, variance, and standard deviation.

2.6: Measures of Position • Measures of position are used to describe the relative location of an observation. • Quartiles and percentiles are two of the most popular measures of position. • An additional measure of central tendency, the midquartile, is defined using quartiles. • Quartiles are part of the 5-number summary.

Quartiles: Values of the variable that divide the ranked data into quarters; each set of data has three quartiles. 1. The first quartile, Q1, is a number such that at most 25% of the data are smaller in value than Q1 and at most 75% are larger. 2. The second quartile is the median. 3. The third quartile, Q3, is a number such that at most 75% of the data are smaller in value than Q3 at at most 25% are larger. Ranked data, increasing order

Percentiles: Values of the variable that divide a set of ranked data into 100 equal subsets; each set of data has 99 percentiles. The kth percentile, Pk, is a value such that at most k% of the data is smaller in value than Pk and at most (100 -k)% of the data is larger. Note: 1. The 1st quartile and the 25th percentile are the same: Q1 = P25. 2. The median, the 2nd quartile, and the 50th percentile are all the same:

Procedure for finding Pk (and quartiles): 1. Rank the n observations, lowest to highest. 2. Compute A = (nk)/100. 3. If A is an integer: d(Pk) = A.5 (depth) Pk is halfway between the value of the data in the Ath position and the value of the next data. If A is a fraction: d(Pk) = B, the next largest integer. Pk is the value of the data in the Bth position.

Example: The following data represents the pH levels of a random sample of swimming pools in a California town. Find the first and third quartile, and the 35th percentile. k = 25: (20) (25) / 100 = 5, depth = 5.5, Q1 = 6 k = 75: (20) (75) / 100 = 15, depth = 15.5, Q3 = 6.95 k = 35: (20) (35) / 100 = 7, depth = 7.5, P35 = 6.15

Chapter 2: Descriptive Analysis and Presentation of Single-Variable Data