1 / 69

Descriptive Statistics Introduction to Summary Statistics

“In God we trust. All others must use data” – W. Edwards Deming. Descriptive Statistics Introduction to Summary Statistics. Overview. Data types Summary statistics Central tendency Dispersion Distribution shape Relative position Exercises. Data Types.

rupali
Télécharger la présentation

Descriptive Statistics Introduction to Summary Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “In God we trust. All others must use data” – W. Edwards Deming Descriptive StatisticsIntroduction to Summary Statistics

  2. Overview • Data types • Summary statistics • Central tendency • Dispersion • Distribution shape • Relative position • Exercises

  3. Data Types • Different types of concepts are represented with different types of data • Level of measurement determines the kinds of statistical analysis that can be performed with the data • Discrete & Continuous • Discrete data can only assume values within some finite set • Continuous data can take any value within some interval

  4. Scale Definition Examples Descriptive Statistics Race, gender marital status Percentages, mode Non-ordered categories Nominal Ordered relation between categories Attitudes, social class Percentiles, median Ordinal Temperature Interval Ordered relation, equality of differences Range, mean, standard deviation Ordered relation, equality of differences, absolute zero Elapsed time, costs, number of customers All of above, coefficient of variation Ratio Data Types

  5. Summary Statistics • Summarizing a set of data typically involves describing three main attributes • Central tendency • Dispersion • Shape

  6. Central Tendency • Measures of central tendency provide a focal point for making decisions based on the data • Types of measures • Mean (average) • Median • Mode • Trimmed means • Address problems with outliers

  7. Mean • Mean is the arithmetic average of data set • Data: • Average: • Can be applied to ratio and interval data • Exercise • Calculate averages of data sets in summary stats.xls, sheet mean

  8. -6 -2 3 5 11 Mean • Mean serves as a measure of central tendency since it is the value that balances positive and negative deviations 5-11 = -6 9-11 = -2 14-11 = 3 16-11 = 5 5,9,14,16 Average 11

  9. Mean • The mean is sensitive to outlier values in the data set • Mean can change substantially because of a few very large or small data points • Mean is not a robust estimator of central tendency • Mean is sensitive to data entry errors in data set • Exercise • Vary first data point of data sets in summary stats.xls, sheet mean and note changes in mean values

  10. Mean • Always check integrety of data before calculating statistics • Check reasonableness of maximum and minimum of data set • Exercise • Calculate maximum and minimum of flight time data in summary stats.xls, sheet flt time • Calculate mean with and without anomalous data

  11. Mean and outliers • When possible plot data in the order that it was collected to help spot outliers and and identify possible data collection errors mean = 170.35 mean without outliers = 150.14

  12. Median • Median is that value such that half the data is less than the median and half is greater • Can be applied to ratio, interval and ordinal data

  13. Median • Median is a more robust measure of central tendency than mean • Exercise • Calculate median of of flight time data with and without anomalous data in summary stats.xls, sheet flt time median w/outliers= 151 median w/o outliers= 149

  14. Trimmed Mean • Trimmed mean is the arithmetic mean after excluding the smallest and greatest x% of the data • More robust to outliers than standard mean • Typically eliminate smallest/greatest 5% or 10% • Exercise • Calculate 5% and 10% trimmed mean for flight time data in summary stats.xls, sheet flt time

  15. Which Central Tendency Measure? • Use median if ordinal data • If ratio or interval data, can calculate mean and median • Check data integrety • Plot data • If analyze data without outliers, report and explain outliers • Use median or trimmed means if robust measure needed

  16. Which Central Tendency Measure? • Create histogram to check shape of data • Many statistical studies involve studying the difference between population means • So the reporting the mean may be dictated by objective of study

  17. Which Central Tendency Measure? • If data is unimodal and fairly symmetric • Mean is approximately equal to median • Then mean is a reasonable measure of central tendency

  18. Which Central Tendency Measure? • If data is unimodal and asymmetric • Median is better measure of central tendency • May report both median and mean • Difference between mean and median indicative of asymmetry

  19. Asymmetric Distributions • Median better indicator of central tendency for asymmetric distributions • Life expectancy • U.S. males: mean = 80.1, median = 83 • U.S. females: mean = 84.3, median = 87 • Household income • Mean = $51,855, median = $38,885 • .3% account for 12% of income • Net worth • Mean = $282,500, median = $71,600

  20. Which Central Tendency Measure? • If data is not unimodal • Then there is not a central tendency to the data • Neither mean nor median provide good summaries of data set • Analyze data for distinct groups • Identify groups and consider providing summary statistics for each group

  21. Central Tendency and Time Series • Time series data is collected periodically over some time interval • Types of time series • Stationary processes • Data varies around some central value with approximately same variation over time • Nonstationary processes • Data has trend and/or changes in variation over time

  22. Central Tendency and Time Series • Standard mean or median can be used as central tendency for stationary time series • Moving averages can used to provide a (moving) central tendency value for nonstationary time series • Tends to smooth out random variations in data • Control amount of historical data used in average

  23. Central Tendency and Time Series • Arithmetic moving average • Average of consecutive data points for a specified number of periods

  24. Central Tendency and Time Series • Exercise • Calculate moving averages with for data in summary stats.xls, sheet time series • Vary length of averaging interval

  25. Lack of Central Tendency • Central tendency measures can be misleading or non-informative if there is not a “central tendency” in the data • Bi or multi-modal • U-shaped distributions • Uniform distributions • Highly skewed • Heavy tails

  26. Lack of Central Tendency

  27. Lack of Central Tendency

  28. Lack of Central Tendency

  29. Lack of Central Tendency

  30. Lack of Central Tendency

  31. Limitations of Central Tendency • Any single number summary may not adequately represent data and may hide differences between data sets • Example

  32. Measures of Dispersion • Measures of dispersion provide ways to quantify the amount of variation within a data set • Dispersion measures also provide context to evaluate significance of departures from central tendency • Types of measures • Range • Standard deviation • IQR

  33. Range • Range:max - min

  34. Standard Deviation • Root mean square difference from the mean • Data • Calculate mean

  35. m = 100 m = 100 Standard Deviation • Example

  36. Standard Deviation • While form of standard deviation is not particularly intuitive, many data sets can be characterized using just the mean and SD • If the values of the data set are distributed in an approximately bell shape, the • ~68% of the data will be within 1 SD unit of mean, ~95% will be within 2 SD units and nearly all will be within 3 SD units

  37. SD  Coefficient of Variation • When comparing relative variation between data sets, often useful to adjust SD to a common scale • Coefficient of variation adjusts scale of SD using the mean

  38. Coefficient of Variation • Example

  39. Standard Deviation • Exercise • Calculate range and standard deviation for data in summary stats.xls, sheet dispersion • Both range and standard deviation are sensitive to outliers • Exercise • Vary first data point of data sets in summary stats.xls, sheet dispersion and note changes in range and standard deviation

  40. Measures of Dispersion • A robust measure of dispersion is the interquartile range (IQR) • The IQR specifies the range over which the middle 50% of the data is spread • Q1 or 25th percentile: value such that 25% of data less than, and 75% greater than • Q3: value such that 75% less than, and 25% greater than • IQR = Q3 - Q1

  41. IQR • Example • Like the median the IQR is less sensitive to outliers since it is based on relative ranking of data points as opposed to their actual values 1 98 99 100 100 100 102 102 104 95 98 99 100 100 100 102 102 104 98.5 IQR = 102 – 98.5 = 3.5 102

  42. IQR • Exercise • Calculate IQR for data in summary stats.xls, sheet dispersion • Vary first data point of data sets in summary stats.xls, sheet dispersion and note change (or lack of) in IQR

  43. Dispersion • The more spread out or dispersed the data, the larger the range, SD and IQR • The more concentrated or homogeneous the data, the smaller the range, SD, and IQR • If all the data elements are the same, then the dispersion will be 0 • Note neither range, SD nor IQR can be negative

  44. Grouped Data • Often summary measures are given for groups of data • Then statistics are needed for the data aggregated together • Means • SD’s • Frequencies

  45. Grouped Data • Aggregate mean • Aggregate SD

  46. Grouped Data • Exercise • Suppose average salary of group of 50 employees is $65K with SD of $2K, and average salary of a second group of 30 employees is $85K with SD of $4K • Find mean and SD of salary for entire group of 80 employees

  47. Relative Position • An important aspect of data analysis is examining the relative position of individual data points within the entire data set • Standard units • Percentile

  48. Standard Units and Z-scores • Translating a data point into standard units indicates the position of the data relative to the mean with respect to standard deviation units • The z-score of a data point is given by

  49. Z-scores • A z-score greater than 0 indicates the data point is greater than the mean • A z-score less than 0 indicates the data point is less than the mean • A z-score equal to 0 indicates the data point is equal to the mean • A z-score between –1 and 1 indicates that the data point is a fairly typical value • A z-score greater than ~ 2 or less than ~ –2 indicates a less than typical value

  50. Typical Z-Scores

More Related