1 / 33

Topic 2

Topic 2. Describing Distributions with Graphs and Numbers. Sampling/ experiment. Target population. Data. Size = n. Size = N. summary. Inference (estimation, testing). visualization. Parameter and Statistic.

Télécharger la présentation

Topic 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic 2 Describing Distributions with Graphs and Numbers

  2. Sampling/ experiment Target population Data Size = n Size = N summary Inference (estimation, testing) visualization

  3. Parameter and Statistic • A parameter (in statistics) is a quantity that defines a certain characteristic of a population. • Average birthweight of all new-born babies • Parameters are estimated based on a sample. • A statistic is a summary measure computed from sample data. Note that a parameter is a summary measure for an entire population. • A key use of a statistic is as an estimator for a parameter.

  4. Distributions • When we say that 62% TAMUK students are Hispanic, 32% are white, 3% are African-American, and 3% are others, we mean the DISTRIBUTION of TAMUK students according to race is Race Percent Hispanic 62% White 32% African-American 3% Others 3%

  5. The DISTRIBUTION of grades for a class could be Grade Percent A 20% B 45% C 22% D 10% F 3%

  6. The DISTRIBUTION of weights of all men aged 30 in Texas could be Weights Percent Less than 130 lb. 3% 130 to 140 lb. 6% 140 to 150 lb. 15% 150 to 160 lb. 25% 160 to 170 lb. 30% 170 to 180 lb. 17% 180 or over 4%

  7. So, the DISTRIBUTION of a population describes how the population is made up of according to some characteristic. If one is concerned with the characteristic of a population that can be described by a categorical variable, e.g., race, he or she may be interested in what percent of subjects fall in each race category. If one is concerned with the characteristic of a population that can be described by a continuous variable, e.g., weight, he or she may be interested in what proportion of people fall in a weight interval.

  8. Histograms • A histogram is a bar graph in which the horizontal scale represents classes of data values and the vertical scale represents frequencies (or relative frequencies). The heights of the bars correspond to the frequency (or the relative frequency) values, and the bars are drawn adjacent to each other without gaps.

  9. Example Construct a histogram for the 20 systolic blood pressures (SBP) of 20 men 93 104 105 108 109 112 114 115 117 119 119 120 121 123 127 130 135 139 139 158

  10. R Codes SBP = c(93,104,105,108,109,112,114,115,117,119, 119,120,121,123,127,130,135,139,139,158) hist(SBP, breaks=c(89.5,99.5,109.5,119.5,129.5,139.5, 149.5,159.5,169.5),col=3) Copy and paste these codes to R, then you will see the histogram.

  11. Pie Charts Pie chart: A circle having a “slice of a pie” for each category. The size of slice corresponds to the percentage of observations in the category.

  12. Bar Graph for European Parliament in 2004

  13. Pareto Chart: Bar Graph with categories Ordered by Their Frequency from the Tallest Bar to Shortest

  14. Measuring the Center: the Mean and Median The distribution of data or a population can be displayed graphically. In practice, we also want to know where the center of a distribution is. The mean and median are common measures of a distribution. The mean of n observations x1, x2, …, xn, denoted ___, is defined as ______. Example: The selling prices ($) of 5 single-family homes are 198000, 219000, 175000, 260000, 630000. Find the mean price.

  15. The Mean is Sensitive to Outliers If the 5th home were $360000, then the mean price would be ___. The significant difference in means is due primarily to the 5th price, which is called an outlier. If we construct a histogram or a stem plot for the data of these 5 prices, the distribution of the data can be seen to be skewed to the right. This skewness is caused by the outlier.

  16. The Median Another measure of center of a distribution is the median. Given n observations x1, x2, …, xn, the median, denoted M, is defined as the number such that half the observations are smaller. To find the median of n observations, we first sort the observations in order, then pick the “midpoint”. Example: Find the median of the 5 prices 198000, 219000, 175000, 260000, and 630000. What if we have 6 prices: 198000, 219000, 175000, 260000, 630000, and 230000?

  17. Location of the Median Given n observations, the location of the median in the ordered list is always (n+1)/2. When is the location of a median an integer? When decimal? If the location of a median is 4.5, it means that the median is halfway between the 4th and 5th observations in the ordered list. What does it mean if the location is 7? Find the median and its location for data: 2, 5, 1, 0, 9. Find the median and its location for data: 0, 3, 1, -2, 7, 4.

  18. Example: Find the Mean and Median from a Stem Plot 1 | 69 2 | 455 3 | 334477 4 | 0255669 5 | 6 | 7 | 3 (a) What are the observations? (b) Find the mean. (c) Find the median and its location.

  19. Comparing the Mean and Median For a symmetric distribution, mean = median. For a right-skewed distribution, mean > median. For a left-skewed distribution, mean < median.

  20. Mean, Median, and Mode The distribution of data is Symmetric The distribution is skew to the left The distribution is skew to the right

  21. Measuring the Spread: The Quartiles • The spread of a distribution measures how divergent the distribution is. • The middle half of a distribution is marked out by two quartiles: The 1st quartile Q1is the number such that 25% of all values are smaller; The 3rd quartile Q3is the number such that 75% of all values are smaller; • The median of a distribution is also called the 2nd quartile which is the number such that 50% of all values are smaller;. • Note also that these quartiles so defined are not unique. • To find these quartiles, we will need to sort the data and find the locations of these quartiles.

  22. Example: Find Quartiles 1. Given data 16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73 46 45 45, find Q1, M, and Q3. 2. Given data 16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73 46 45 45 31, find Q1, M, and Q3.

  23. The Five-Number Summary and Boxplots Q1, M, and Q3 give the information about the middle half of a distribution; the tails of a distribution can be described by possible smallest and largest values of the distribution. These five values can intuitively picture a distribution and are called the 5-number summary. The Five-Number Summary of a distribution describes both the center and the spread of a distribution. The 5 numbers can be displayed in a (ordinary) boxplot, which consists of (a) a central box spanning the quartiles Q1 and Q3, (b) a line in the box masking the median M, and (c) two lines extended from the box out to the smallest and largest observations. Compared with its competitors histograms and stem plots, a boxplot show less detail about the distribution. Boxplots are best used for side-by-side comparison of more than one distribution. The boxplot of a distribution should be interpreted in terms of skewness, the center and the spread.

  24. Compare the two boxplots in terms of skewness, spread, and center. The side-by-side boxplot is produced with the following R codes: x = c(86, 91, 72, 79, 74, 83, 73, 92, 76, 72, 67, 88, 70, 79, 93, 65, 75, 83, 90, 75, 100, 63); y = c(74, 84, 86, 90, 78, 85, 75, 72, 97, 84, 87, 76, 78, 79, 82, 63, 95, 79, 82, 69, 96, 73) ; z=data.frame(Grade=c(x,y), Section = c(rep('Section 01', length(x)), rep('Section 02 ', length(y)))); attach(z); boxplot(Grade~Section, col = 2:3)

  25. Spotting Suspected Outliers: The 1.5xIQR Rule • In a boxplot, the distance between Q1 and Q3 (the range of the center half of the data) is a more resistant measure of spread. This distance is called the inter-quartile range, denoted IQR; that is IQR = Q3 – Q1. • The 1.5xIQR Rule for outliers: An observation is called a suspected outlier if it falls more than 1.5xIQR above Q3 or below Q1. • Example: Find Q1, Q3, and IQR of the data: 72 83 91 84 84 78 90 85 67 91 80 85 67 65 95. Identify any suspected outlier.

  26. A Modified Boxplot

  27. R codes myBoxPlot = function(x, col = 'gray'){ boxplot(x, col = col) text(rep(1.3,5), fivenum(x), labels=c('minimum', 'lower hinge', 'median', 'upper hinge', 'maximum'), col = 'blue') q = quantile(x, probs = c(0.25, 0.5, 0.75)) IQR = q[3] - q[1] lowerfence = q[1] - 1.5*IQR upperfence = q[3] + 1.5*IQR abline(h = c(lowerfence, upperfence), col = 'green', lty = 2) text(rep(1.3,5), c(lowerfence, upperfence), labels=c('lower fence', 'upper fence'), col = 'blue') Outliers = which((x - lowerfence)*(x - upperfence) > 0) if (length(Outliers) != 0) text(rep(0.63, length(Outliers)), x[Outliers], labels = paste(rep('Obs.', length(Outliers)),Outliers), col = 'red') } Rainfall = c(9.6, 12.9, 9.9, 8.7, 6.8, 12.5, 13.0, 10.1, 10.1, 10.1, 10.8, 7.8, 14.1, 10.6, 10.0, 11.5, 13.6, 12.1, 12.0, 9.3, 7.7, 11.0, 6.9, 9.5, 16.5, 9.3, 9.4, 8.7, 9.5, 11.6, 12.1, 8.0, 10.7, 13.9, 11.3, 11.6, 10.4) myBoxPlot(Rainfall)

  28. Measuring Spread: the Standard Deviation Interestingly, the mean is not among the 5-numver summary of a distribution. The closest partner of the mean is the standard deviation, which is another measure of the spread of a distribution. The standard deviation measures how far the observations are from their mean.

  29. The variance of a set of observations is an average of the squares of deviation from the mean. Calculation of Standard Deviations • The standard deviations is the square root of the variance

  30. Example (Calculating the standard deviation s) Metabolic rates of 7 men who took part in a study of dieting. The units are calories per 24 hours. 1792 1666 1362 1614 1460 1867 1439 Find the mean first: The standard deviation: Example

  31. Cont’d Observations Deviations Squared deviations sum = 0 sum = 214870 The variance The standard deviation

  32. Summary of Strategies for Exploring Data on a Single Quantitative Variable The 5-number summary is always good for describing the distribution of quantitative data. The mean and its partner standard deviation should be used to describe the center and spread of the distribution of quantitative data only when the distribution is known to be symmetric, since both are sensitive to outliers. The shape of the distribution of quantitative data is better described using graphical displays such as histograms.

More Related