1 / 43

Descriptive Statistics

Descriptive Statistics. Statistics. Faculty of Information Technology King Mongkut’s University of Technology North Bangkok. Content. Data Preparation Data Presentation Descriptive Statistics. Data Preparation. Data checking for accuracy Data cleaning

Télécharger la présentation

Descriptive Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Descriptive Statistics Statistics Faculty of Information Technology King Mongkut’s University of Technology North Bangkok

  2. Content • Data Preparation • Data Presentation • Descriptive Statistics

  3. Data Preparation • Data checking for accuracy • Data cleaning • Removal of inaccurate data, errors, outlier • Deal with missing data • Data transformation • Application of a deterministic mathematical function to each point in a data set • The function that is used to transform the data is invertible, and generally is continuous

  4. Data Transformation • To comply with requirement of statistical analysis • For better understanding of graph • Ease of interpretation of data • Common method • The logarithm and square root transformations are commonly used for positive data • The multiplicative inverse (reciprocal) transformation can be used for non-zero data

  5. Example • Populations • See http://en.wikipedia.org/wiki/Data_transformation_(statistics) • Fuel consumption • Kilometers per litre • 10 km/l • Reciprocal: litres per 100 kilometers • 10l/100km • Why?

  6. Data Presentation • Text • Table • Graphical • Pictograph • Bar Chart • Pie Chart • Line Chart • Histogram • Stem and Leaf • Scatter Plot • Box Plot • What is the difference between Bar Chart and Histogram?

  7. Normal Curve and Skewed Curves Positive Skewed Curve Normal or Symmetrical Curve Negative Skewed Curve

  8. J-Shaped Curve U-Shaped Curve Multimodal Curve J-Reversed Shaped Curve Bimodal Curve

  9. Cumulative Frequency Curve Stem and Leaf Scatter Plot

  10. Box Plot • Shows data distribution and skewness Right/Positive Skewed Left/Negative Skewed Normal

  11. Descriptive Statistic • A descriptive statistic is a numerical index that describes or summarizes some characteristic of a frequency or relative frequency distribution. (Frank & Althoen, “Statistics: Concepts and applications”, 1994)

  12. Descriptive Statistics • Frequency distribution table • Describe • Location of distribution – Mode, Median, Mean • Dispersion of distribution – Range, SD, Variance • Shape of distribution – Skewness, Kurtosis • Individuals in distributions – Percentile, Decile, Quartile • Joint distributions of data • Scatter Diagram • Correlation Coefficient • Linear Regression

  13. Frequency Distribution Grouped Ungrouped Raw data: 42, 45, 82, 32, 91, 76, 55, 58, 55, 62, 60, … • Can be visualized using graphs and charts • Determining number of intervals • k = 1 + 3.3logN • Interval width = Range / k

  14. Frequency Distribution Table • One-way • One variable – often used with percentage • Two-way • Two variables – shows rough relation between two variables • Etc.

  15. Describing Location of distribution • Mode • The value with highest frequency • Applicable to nominal scale (and higher scale) • Can be more than one value for one set of data • fx : MODE

  16. Arithmetic Mean • Considered best among the three • Sum of value divided by total frequency • Can be affected by (very) peak values • A value change of an entry also changes mean • Adding / subtracting a value from all entry changes mean for the same value • Multiply / divide all entry with a value also changes mean for the same multiplication/division with the value • Sum of the difference between each entry and mean is always zero • In case of grouped data, use sum of product of the midpoint of each interval and the frequency of that interval • fx : AVERAGE

  17. Median • Better for data with very peaked values • Ungrouped data • The value in the middle of distribution after sorting • N is odd: (N+1) / 2 • N is even: Average(N/2, N/2 +1) • Average of two middle values • fx : MEDIAN • Grouped data • See percentile

  18. Describing Dispersion • Range • Ungrouped: Max – Min (fx MAX – fx MIN) • Grouped: true highest upper bound – true lowest lower bound • True upper bound is average value between the upper bound of the interval and the (expected) lower bound of the higher interval • True lower bound is average value between the lower bound of the interval and the (expected) upper bound of the lower interval

  19. Standard Deviation & Variance • Root of the sum of difference between each entry and arithmetic mean (higher value means data are more dispersed) • OR • Standard Deviation (or SD, S.D., S) is most popular for describing dispersion N >= 30 N < 30 N >= 30 N < 30

  20. Standard Deviation & Variance • Always SD >= 0 • SD of 0 means that all data entries are of the same value • Adding / subtracting a value from all entries does not affect SD • Multiply / divide all entries with a value m changes SD by multiplying/dividing SD with the absolute value of m • Variance is equal to SD2 • Only interested in the positive value of SD • fx : STDEV and VARA

  21. Shape of Distribution • Skewness • 0 means there is no skewness (normal distribution) • Positive value means positive/right skewed • Negative value means negative/left skewed • Calculation? • Just use Excel or SPSS • fx : SKEW

  22. Shape of Distribution • Kurtosis • 0 means normal distribution • Positive value means very peaked (less dispersed) • Negative value means less peaked (more dispersed) • Calculation? • Just use Excel or SPSS • fx : KURT

  23. Describing Individuals in Distributions • Percentile • Quartile • Decile • Performed on data sorted in ascending order • Dividing data in 100, 4, 10 parts and identify the value at the desired position

  24. Percentile Rank • “The percentile rank of any particular score x is the percentage of observations equal to or less than x” • Divide sorted data set into 100 parts • “cent” = 100 thus “per”“cent” = /100 • Percentile rank of entry xi = 100*(cumulative frequencyi / N) • e.g. 18, 29, 31, 32, 33 • Percentile rank of 31 = 100*(3/5) = 60 • Be careful! • Percentile rank determines rank from data value • Excel uses 0.00 – 1.00 for fx: PERCENTRANK

  25. Percentile • “The kth percentile is the x-value at or below which fall K percent of observations” • Roughly • Position of data entry at kthPercentile = k(n+1)/100 • e.g. 18, 29, 31, 32, 33 • Percentile 80th = 80/100(5+1) = 4.8 = 5th position • Be careful! • Percentile rank determines data value from percentile • Excel uses 0.00 – 1.00 for fx: PERCENTILE

  26. Quartile • The kth quartile is the x-value at or below which fall K quarters of observations • Roughly • Position of data entry at kthQuartile = k(n+1)/4 • e.g. 18, 29, 31, 32, 33 • Quartile 3th = 3/4(5+1) = 4.5 = 4th-5th position • fx: QUARTILE

  27. Decile • The kthDecile is the x-value at or below which fall K tenth of observations • Roughly • Position of data entry at kthQuartile = k(n+1)/10 • e.g. 18, 29, 31, 32, 33 • Decile 5th = 5/10(5+1) = 3rd position • Excel does not have direct decile function • Use fx: PERCENTILE with 0.1, 0.2, 0.3, … instead

  28. Percentile for Grouped Data • r: The percentile • P: Data value at given percentile r • L: True lower bound of the interval in which percentile r falls • I: Interval width • n: Number of data entries • Σf: Cumulative frequency of intervals below L • fr: Frequency of the L interval • Determine the interval that the percentile fall using (n*r)/100

  29. Example • Percentile 60th • n = 72, thus P60 is at around 60/100*72 = 43rd entry which falls in interval 61 – 70 • Thus • P60 = 60.5 + (10{(60/100*72) - 36}) / 17 = 64.74

  30. Median

  31. Joint Distribution of Data • Scatter Diagram Imaginary line showing relation Imaginary line showing relation Negatively related Not related Positively related

  32. Correlation Coefficient • Pearson Product-Moment Correlation Coefficient • Denoted as rxy or r • Measure the correlation between two data sets • Can take value from -1 to 1 • Value of 1: two data sets have absolute positive relation • Value of -1: two data sets have absolute negative relation • Value of 0: two data sets have no linear relation

  33. Correlation Coefficient • Formula • fx: PEARSON (do not use in MS Excel earlier than 2003) • fx: CORREL

  34. Correlation for Ordinal Scale • Spearman Rank Correlation Coefficient • Two variables • Kendall’s Tau Rank Correlation Coefficient • Three or more variable

  35. Linear Regression • Describe relation between two interval-scale variables in the form of regression equation • y = bx + a (Straight line) • y = a + bx + cx2 (Parabola equation) • y = abx(Exponential equation) • x: independent variable • y: dependent variable • a: Y-intercept (where the line crosses Y axis) • b: Slope

  36. Simple Linear Regression • Find b then a • Then write the equation • y = bx + a • E.g. b = 31.4, a = 4.52 • y = 31.4x + 4.52

  37. Example • Table shows the period of time each student spends reading for exam and his/her score • b = {10 (45885) – (1035)(413)} / {10 (123375) – (1035)2 • = 31395 / 162525 = 0.1932 • a = 41.3 – (0.1932) (103.5) = 21.3038 • y = 0.1932x + 21.3038 • Meaning • Spending 1 minute will increase score by 0.1932 mark • If you don’t read at all you should get 21.3038 mark

  38. Multiple Linear Regression • More than one independent variables • Equation Y = a + b1x1 + b2x2 + b3x3… • Requirement • Normal distribution • No multicollinearity (independent variables do not depend on each other) • Selecting independent variables • All Entry – when you are not sure which variable has effect • Stepwise – only use variables tested to be significant

  39. how much of the dependent variable can be explained by the independent variable Simple correlation Is the model good (significant)? (yes, Sig. < 0.05) b1 a b a b2

  40. Summary

More Related