1 / 59

Chapter 3 Using Statistics to Summarize Data Sets

Chapter 3 Using Statistics to Summarize Data Sets. Chapter 3 Using Statistics to Summarize Data Sets. 3.1 Introduction 3.2 Sample Mean 3.3 Sample Median 3.4 Sample Mode 3.5 Sample Variance and Sample Standard Deviation 3.6 Normal Data Sets and the Empirical Rule ( 經驗法則 )

Télécharger la présentation

Chapter 3 Using Statistics to Summarize Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 3 Using Statistics to Summarize Data Sets

  2. Chapter 3 Using Statistics to Summarize Data Sets 3.1 Introduction 3.2 Sample Mean 3.3 Sample Median 3.4 Sample Mode 3.5 Sample Variance and Sample Standard Deviation 3.6 Normal Data Sets and the Empirical Rule(經驗法則) 3.7 Sample Correlation Coefficient

  3. Introduction • To obtain a feel for such a large data set, it is often necessary to summarize it by some suitably chosen measures. • In this chapter, we introduce different statistics that can be used to summarize certain features of data sets. • These summary measures are called statistics, where by a statistic we mean any numerical quantity whose value is determined by the data. Definition Numerical quantities (數量) computed from a data set are called statistics(統計量).

  4. Sample Mean • Suppose we have a sample of n data points whose values we designate by x1, x2, . . . , xn. • One statistic for indicating the center of this data set is the sample mean(樣本平均數),defined to equal the arithmetic average of the data values.

  5. Example 3.1 • The average fuel efficiencies(平均燃油效率), in miles per gallon, of cars sold in the United States in the years 1999 to 2003 were 28.2, 28.3, 28.4, 28.5, 29.0 Find the sample mean of this set of data. • Solution

  6. Example 3.2 • The winning scores in the U.S. Masters GolfTournament (美國高爾夫球大師賽)in the years from 1981to 1990 were as follows: 280, 284, 280, 277, 282, 279, 285, 281, 283, 278 Find the sample mean of these winning scores. • Solution

  7. Example 3.3 • The number of suits sold daily by a women’s boutique(女裝店) for the past 6 days has been arranged in the following frequency table. • What is the sample mean? • Solution Since the original data set consists of the 6 values 3, 3, 4, 5, 5, 5 it follows that the sample mean is

  8. Example 3.4 • A. Weiss analyzed a sample of 770 similar motorcycle accidents that occurred in the Los Angeles area in 1976 and 1977. • Find the sample mean of the head severity classifications for thoseoperators who wore helmets (安全帽)and for those who did not.

  9. Example 3.4 • Solution Therefore, the data indicate that those cyclists who were wearing a helmet suffered, on average, less severe head injuries than those who were not wearing a helmet.

  10. Deviations • The differences between each of the data values and the sample mean are called deviations(誤差). • The sum of all the deviations must equal 0.

  11. Example 3.5 • Example 3.1: • The average fuel efficiencies(平均燃油效率), in miles per gallon, of cars sold in the United States in the years 1999 to 2003 were 28.2, 28.3, 28.4, 28.5, 29.0

  12. Center of Gravity • The sample mean is a balancing point called the center of gravity(重心). • For example • The center of gravity of 0, 1, 2, 6, 10, 11 is (0 + 1 + 2 + 6 + 10 + 11)/6 = 30/6 = 5

  13. Exercise (p.80, 10)

  14. Sample Median • The sample mean indicates the center of a data set, but its value is greatly affected by extreme data values. • For example, given a data set {2, 110, 5, 7, 6, 7, 3}. • The sample mean of this data set is 20. • A statistic that is also used to indicate the center of a data set but that is not affected by extreme values is the sample median, defined as the middle value when the data are ranked in order from smallest to largest. • Definition Order the data values from smallest to largest. If the number of data values is odd, then the sample median(樣本中位數)is the middle value in the ordered list; if it is even, then the sample median is the average of the two middle values.

  15. Example 3.6 • The following data represent the number of weeks it took seven individuals to obtain their driver’s licenses. Find the sample median. 2, 110, 5, 7, 6, 7, 3 Solution • First arrange the data in increasing order. 2, 3, 5, 6, 7, 7, 110 • Since the sample size is 7, it follows that the sample median is the fourth smallest value. • The sample median number of weeks it took to obtain a driver’s license is m = 6 weeks.

  16. Example 3.7 • The following data represent the number of days it took 6 individuals to quit smoking (戒煙)after completing a course designed for this purpose. 1, 2, 3, 5, 8, 100 What is the sample median? Solution • Since the sample size is 6, the sample median is the average of the two middle values; thus, m = (3 + 5 ) / 2 = 4 The sample median is 4days.

  17. Example 3.8 • The following data give the names of the National Basketball Association (NBA) individual scoring champions and their season scoring averages in each of the seasons from 1992 to 2008. (a) Find the sample median of the scoring averages. (b) Find the sample mean of the scoring averages. Solution (a)m = 30.2 (b) ≈ 30.435

  18. Sample Mean v.s Sample Median • The question as to which of the two summarizing statistics is the more informative (有益的)depends on what you are interested in learning from the data set. • If a city government has a flat-rate income tax (所得稅) and is trying to figure out how much income it can expect, then it would be more interested in the sample mean of the income of its citizens than in the sample median. • If the city government were planning to construct some middle-income housing and were interested in the proportion of its citizens who would be able to afford(買得起) such housing, then the sample median might be more informative.

  19. Exercise (p. 86, 3)

  20. Sample Percentiles • Definition (Sample Percentiles (樣本百分等級)) The sample 100p percentile is that data value having the property that atleast 100p percent of the data are less than or equal to it and at least 100(1 − p) percentof the data values are greater than or equal to it. If two data values satisfy this condition (np is integer),then the sample 100p percentile is the arithmetic average of these two values. PS. pis any fraction between 0-1. • Note that the sample median is the sample 50thpercentile. p= 0.50

  21. Sample Percentiles

  22. Example 3.9 • Which data value is the sample90th percentile when the sample size is (a) 8,(b) 16, and (c) 100? Solution (a) Since 0.9 × 8 = 7.2, the sample 90th percentile valuewould be the 8th-smallest value (that is, the largest value). (b) Since 0.9 × 16 = 14.4, the sample90th percentile would be the 15th-smallest value. (c) Since 0.9 × 100 = 90 is an integer, the sample 90th percentile value is theaverage of the 90th and the 91st values when the data are arranged fromsmallest to largest.

  23. Quartile(四分位數) • Definition The sample 25th percentile is called the first quartile(第一四分位數). The sample 50thpercentile is called the median or the second quartile. The sample 75th percentile iscalled the third quartile(第三四分位數).

  24. Example 3.11 • Find the sample quartiles for the following 18 data values, which represent theordered values of a sample of scores from a league bowling tournament (保齡球錦標賽): 122, 126, 133, 140, 145, 145, 149, 150, 157, 162, 166, 175, 177, 177, 183,188, 199, 212 • Solution • Since 0.25 × 18 = 4.5, the sample 25th percentile is the fifth-smallest value,which is 145. • Since 0.50 × 18 = 9, the second quartile (or sample median) is the average ofthe 9th- and 10th-smallest values and so is (157 + 162)/2= 159.5 • Since 0.75 × 18 = 13.5, the third quartile is the 14th-smallest value, whichis 177.

  25. Exercise (p.93, 1) • Seventy-five values are arranged in increasing order. How would you determine the sample (a) 80th percentile (b) 60th percentile (c) 30th percentile of this data set?

  26. Sample mode • Sample mode(樣本眾數) • The data valuethat occurs most frequently in the data set Example 3.12 • The following are the sizes of the last 8 dresses sold at a women’s boutique(女裝店) : 8, 10, 6, 4, 10, 12, 14, 10 What is the sample mode? Solution The sample mode is 10, since the value of 10 occurs most frequently. • If no single value occurs most frequently, then all the values that occur at the highest frequency are called modal (典型的)values.

  27. Example 3.14 • The following frequency table gives the values obtained in 30 throws of a die. • It is easy to pick out the modal value from a frequency table, since it is just that value having the largest frequency. • For these data, find the (a) Sample mode (b) Sample median (c) Sample mean Solution (a) The sample mode is 4. (b) The sample median is 3.5. (c) The sample mean is3.333.

  28. Exercise (p. 98, 1; p. 99, 6) • Match each statement in the left-hand column with the correct data set from the right-hand column. 1. Sample mode is 9 A: 5, 7, 8, 10, 13, 14 2. Sample mean is 9 B: 1, 2, 5, 9, 9, 15 3. Sample median is 9 C: 1, 2, 9, 12, 12, 18

  29. Sample Variance and Sample Standard Deviation • Given two data sets A: 1, 2, 5, 6, 6 B: −40, 0, 5, 20, 35 • Although the following data sets AandBhave the same sample mean and sample median, there is clearly more spread in the values of B than in those of A. • One way of measuring the variability(變化性)of a data set is to consider the deviations of the data values from a central value. • The sample variance (樣本變異數) is a measure of the “average” of the squared deviations from the sample mean.

  30. Example 3.15 • Find the sample variance of data set A. A: 1, 2, 5, 6, 6

  31. Example 3.16 • Find the sample variance of data set B. B: −40, 0, 5, 20, 35

  32. Example 3.17 • Check that identity (3.2) holds for data set A. A: 1, 2, 5, 6, 6

  33. Discussion

  34. Sample Standard Deviation • The positive square root of the sample variance is called the sample standard deviation(樣本標準差). • The sample standard deviation is measured in the same units as the original data. • For instance, if the data are in feet, then the sample variance will be expressed in units of square feet and the sample standard deviation in units of feet.

  35. Discussion • Another indicator of the variability of a data set is the interquartile range, which is equal to the third minus the first quartile. • The interquartile range is the length of the interval in which the middle half of the data values lie.

  36. Example 3.19 • The Miller Analogies Test (MAT米勒測驗)is a standardized test that is taken by a variety of students applying to graduate and professional schools. • The MAT consists of 120 questions in 60 minutes. • Table 3.2 presents some of the percentile scores on this examination for students, classified according to the graduate fields they are entering. • Determine the interquartile ranges of the scores of students in the five specified categories.

  37. Example 3.19 • Solution Since the interquartile range is the difference between the 75th and the 25th sample percentiles, it follows that its value is 80 − 55 = 25 for scores of physical science students 71 − 45 = 26 for scores of medical school students 74 − 49 = 25 for scores of social science students 73 − 43 = 30 for scores of language and literature students 60 − 37 = 23 for scores of law school students

  38. A Box Plot • A box plot is often used to plot some of the summarizing statistics of a data set. • A straight-line segment stretching from the smallest to the largest data value is drawn on a horizontal axis; imposed on the line is a “box,” which starts at the first and continues to the third quartile, with the value of the second quartile indicated by a vertical line. • For instance, the following frequency table gives the starting salaries (起薪) of a sample of 42 graduating seniors of a liberal arts (文科)college. • The salaries go from a low of 47 to a high of 60. The value of the first quartile is 50; the value of the second quartile is 51.5; and the value of the third quartile is 54.

  39. Exercise (p. 107, 10; p.108, 16)

  40. Normal Data Sets and the Empirical Rule • Definition A data set is said to be normal if a histogram describing it has the following properties: 1. It is highest at the middle interval. 2. Moving from the middle interval in either direction, the height decreases in such a way that the entire histogram is bell-shaped. 3. The histogram is symmetric about its middle interval. Figure 3.2 shows the histogram of a normal data set.

  41. Histogram

  42. Empirical Rule (經驗法則)

  43. Example 3.20 • The scores of 25 students on a history examination are listed on the following stem-and-leaf plot. • By standing this figure on its side, we can see that the corresponding histogram is approximately normal. • Use it to assess the empirical rule.

  44. Example 3.20

  45. Bimodal • A data set that is obtained by sampling from a population that is itself made up of subpopulations of different types is usually not normal. • The histogram from such a data set often appears to resemble a combining of normal histograms and thus will often have more than one local peak. • A data set whose histogram has two local peaks is said to be bimodal. • The data set represented in Fig. 3.6 is bimodal.

  46. Exercise (p. 118, 9; p. 118, 8)

  47. Sample Correlation Coefficient • The sample correlation coefficient(樣本相關係數): • Measure the degree to which larger x values go with larger y values and smaller x values go with smaller y values. • Consider the data set of paired values (x1, y1), (x2, y2), . . . , (xn, yn). daily positive correlation A free radical (自由基)is a single atom of oxygen. It is believed to be potentially harmful because it is highly reactive and has a strong tendency to combine with other atoms within the body.

  48. Free Radical (自由基) • 自由基就是「帶有一個單獨不成對的電子的原子、分子、或離子」 • 人體內的自由基由有許多種,有人體自行合成,具有重要功能的;或在新陳代謝過程中產生的;也有來自外界環境的。 • 有些自由基相當活潑,這些較活潑的自由基性質不穩定,具有搶奪其他物質的電子,使自己原本不成對的電子變得成對(較穩定)的特性。 • 而被搶走電子的物質也可能變得不穩定,可能再去搶奪其他物質的電子,於是產生一連串的連鎖反應,造成這些被搶奪的物質遭到破壞。 • 人體的老化和疾病,極可能就是從這個時候開始的。 • 尤其是近年來位居十大死亡原因之首的癌症,其罪魁禍首便是自由基。 • 資料來源 http://www.mmh.org.tw/nutrition/chao/064antioxid.htm

  49. Sample Correlation Coefficient • The data of Table 3.4 represent the years of schooling(訓練) (variable x) and the resting pulse rate (脈搏率) in beats per minute (variable y) of 10 individuals. • A scatter diagram of this data is presented in Fig. 3.10. negative correlation

  50. Correlation Coefficient

More Related