1 / 42

Stor 155, Section 2, Last Time

Stor 155, Section 2, Last Time. Distributions (how are data “spread out”?) Visual Display: Histograms Binwidth is critical Time Plots = Time Series Course Organization & Website http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155-07Home.html.

horace
Télécharger la présentation

Stor 155, Section 2, Last Time

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stor 155, Section 2, Last Time • Distributions (how are data “spread out”?) • Visual Display: Histograms • Binwidth is critical • Time Plots = Time Series • Course Organization & Website http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155-07Home.html

  2. Reading In Textbook Approximate Reading for Today’s Material: Pages 40-55 Approximate Reading for Next Class: Pages 64-83

  3. And now for something completely different Is this class too “monotone”? • Easier to understand? • Calm environment enhances learning? • Or does it induce somnolence? What is “somnolence”? Google definition: Sleepiness, a condition of semiconsciousness approaching coma.

  4. And now for something completely different An experiment: • Pull out any coins you have with you • How many of you have: • >= 1 penny? • >= 1 nickel? • >= 1 dime? • >= 1 quarter? • Choose most frequent denomination

  5. And now for something completely different Collect data (into Spreadsheet): • Years stamped on coins (chosen denomination) • Many as person has • Enter into spreadsheet • Look at “distribution” using histogram

  6. And now for something completely different • Predicted Answer • From Text Book, Problem 1.32 • Distribution is Left Skewed • Works out as predicted? • Why? • Note: most skewed dist’ns seem to be: Right Skewed

  7. Exploratory Data Analysis 4 Numerical Summaries of Quant. Variables: Idea: Summarize distributional information (“center”, “spread”, “skewed”) In Text, Sec. 1.2 for data (subscripts allow “indexing numbers” in list)

  8. Numerical Summaries • “Centers” (note there are several) • “Mean” = Average = • Greek letter “Sigma”, for “sum” In EXCEL, use “AVERAGE” function

  9. Numerical Summaries of Center • “Median” = Value in middle (of sorted list) Unsorted E.g: Sorted E.g: 3 0 1 1 27 “in middle”? (no) 2 better “middle”! 2 3 0 27 EXCEL: use function “MEDIAN”

  10. Difference Betw’n Mean & Median Symmetric Distribution: Essentially no difference Right Skewed Distribution: 50% area 50% area M bigger since “feels tails more strongly”

  11. Difference Betw’n Mean & Median Outliers (unusual values): Simple Web Example: http://www.stat.sc.edu/~west/applets/box.html • Mean feels outliers much more strongly • Leaves “range of most of data” • Good notion of “center”? (perhaps not) • Median affected very minimally • Robustness Terminology: Median is “resistant to the effect of outliers”

  12. Difference Betw’n Mean & Median A richer web example: Publisher’s Web Site: Statistical Applets: Mean & Median • For Symmetric distributions: • Both are same • Add an outlier: • Mean feels it much more strongly • Implication for “bad data”: can be very bad • Two Clusters: • Median jumps more quickly • Mean more stable (better?)

  13. Computation using Excel Some Toy Examples: http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg3Done.xls • Compute Using Excel Functions • Mean feels location of data on number line • Median feels location of data in sorted list • Median breaks tie by averaging center points

  14. Numerical Centerpoint HW HW: 1.46 a, 1.47, 1.49 • Use EXCEL

  15. And now for something completely different Check out this small quick movie clip:

  16. And now for something completely different Suggestions for other things to show here are very welcome…. • Movie Clips… • Music… • Jokes… • Cartoons… • …

  17. Numerical Summaries (cont.) • “Spreads” (again there are several) 1. Range = biggest - smallest range Problems: • Feels only “outliers” • Not “bulk of data” • Very non-resistant to outliers

  18. Numerical Summaries of Spread • Variance = = “average squared distance to “ EXCEL: VAR Drawback: units are wrong e. g. For in feet  is in square feet

  19. Numerical Summaries of Spread • Standard Deviation EXCEL: STDEV • Scale is right • But not resistant to outliers • Will use quite a lot later (for reasons described later)

  20. Interactive View of S. D. Interesting web example (manipulate histogram): http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html • Note SD range centered at mean • Can put SD “right near middle” (densely packed data) • Can put SD at “edges of data” (U shaped data) • Can put SD “outside of data” (big spike + outlier) • But generally “sensible measure of spread”

  21. Variance – S. D. HW C3: For the data set in 1.46 (i.e. 1.37), find the: • Variance (1620) • Standard Deviation (40.2) • Use EXCEL

  22. Numerical Summaries of Spread • Interquartile Range = IQR Based on “quartiles”, Q1 and Q3 (idea: shows where are 25% & 75% “through the data”) 25% 25% 25% 25% Q1 Q2 = median Q3 IQR = Q3 – Q1

  23. Quartiles Example Revisit Hidalgo Stamp Thickness example: http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg6Done.xls Right skewness gives: • Median < Mean (mean “feels farther points more strongly”) • Q1 near median • Q3 quite far (makes sense from histogram)

  24. Quartiles Example A look under the hood: http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg6Raw.xls • Can compute as separate functions for each • Or use: Tools  Data Analysis  Descriptive Stats • Which gives many other measures as well • Use “k-th largest & smallest” to get quartiles

  25. 5 Number Summary • Minimum • Q1 - 1st Quartile • Median • Q3 - 3rd Quartile • Maximum Summarize Information About: • Center - from 3 • Spread - from 2 & 4 (maybe 1 & 5) • Skewness - from 2, 3 & 4 • Outliers - from 1 & 5

  26. 5 Number Summary How to Compute? http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg6Done.xls • EXCEL function QUARTILE • “One stop shopping” • IQR seems to need explicit calculation

  27. Rule for Defining “Outliers” Caution: There are many of these Textbook version: Above Q3 + 1.5 * IQR Below Q1 – 1.5 * IQR For stamps data: http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg6Done.xls • No outliers at “low end” • Some at “high end”

  28. 5 Number Sum. & Outliers HW 1.43

  29. Box Plot • Additional Visual Display Device • Again legacy from pencil & paper days • Not supported in EXCEL • So we won’t do • Main use: comparing populations • Example: Figure from text

  30. Box Plot

  31. Box Plot • Main use: comparing populations • Example: Figure from text • Want to do this? Find better software package than Excel

  32. And now for something completely different Recall Distribution of majors of students in this course:

  33. And now for something completely different How about a business manager joke? How many managers does it take to replace a light bulb?

  34. And now for something completely different How about a business manager joke? How many managers does it take to replace a light bulb? Two. One to find out if it needs changing, and one to tell an employee to change it. Source: http://www.joblatino.com/jokes/managers.html

  35. Linear Transformations Idea: What happens to data & summaries, when data are: “shifted and scaled” i.e. “panned and zoomed” Math: Scaled by a Shifted by b

  36. Linear Transformations Effect on linear summaries: • Centerpoints, and “follow data”: . • Spreads, and “feel scale, not shift”: .

  37. Most Useful Linear Transfo. “Standardization” Goal: put data sets on “common scale” Approach: • Subtract Mean , to “center at 0” • Divide by S.D. , to “give common SD = 1”

  38. Standardization Result is called “z-score”: Note that Thus is interpreted as: “number of SDs from the mean”

  39. Standardization Example Next time: work in Excel command: STANDARDIZE

  40. Standardization Example Buffalo Snowfall Data: http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg7Done.xls • Standardized data have same (EXCEL default) histogram shape as raw data. (Since axes and bin edges just follow the transformation) • i.e. “shape” doesn’t depend on “scaling”

  41. Standardization Example A look under the hood: http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg7Raw.xls Compute AVERAGE and SD • Standardize by: • Create Formula in cell B2 • Drag downwards • Keep Mean and SD cells fixed using $s 3. Check stand’d data have mean 0 & SD 1 note that “8.247E-16 = 0”

  42. Standardization HW C4: For data in 1.17, use EXCEL to: a. Give the list of standardized scores b. Give the Z-score for: (i) the mean (0) (ii) the median (-0.223) (iii) the smallest (-1.21) (iv) the largest (2.77) 1.59a, 1.73

More Related