1 / 32

Unit 2: Some Basics

Unit 2: Some Basics. The whole vs. the part. population vs. sample means (avgs) and “std devs” [defined later] of these are denoted by different letters description vs. inference mean of a population describes it (partly) mean of a sample gives a basis to infer (guess) mean of the population.

lbriggs
Télécharger la présentation

Unit 2: Some Basics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unit 2: Some Basics

  2. The whole vs. the part • population vs. sample • means (avgs) and “std devs” [defined later] of these are denoted by different letters • description vs. inference • mean of a population describes it (partly) • mean of a sample gives a basis to infer (guess) mean of the population

  3. Computers are ubiquitous in stats • Warning: We’ll compute with Excel because you can see the data and it’s available all over campus... • but its stats results are unreliable (see next slide) • In the real world, use a statistical program (SAS, SPSS, ...)

  4. Data Types • categorical vs. numerical (vs. ordinal, …) • ex of categorical: red, blue, green • ex of numerical: person’s height in inches • (ex of ordinal: str agree, agree, neutral, ...) • [numerical: discrete vs. continuous] Types of Statistics • graphical • numerical

  5. Graphs for categorical data • pie charts: for fractions of total data in each category • bar (Excel: “column”) graphs: for counts or fractions of data in each category • (mode = most frequent value)

  6. Example: Hair color at NYS Fair

  7. Numerical variables: Graphs of frequencies I • dot plots • stem-and-leaf plots • Ex: 79, 91, 59, 52, 94, 74, 75, 87, 67, 35, 91, 89, 96, 92, 92

  8. Numerical variables: Graphs of frequencies II • histogram • 79, 91, 59, 52, 94, 74, 75, 87, 67, 35, 91, 89, 96, 92, 92 • (Note similarity to dotplot and stem-and-leaf plot) • Excel’s “histograms” aren’t, unless bar widths are equal

  9. More on Histograms • Frequencies are represented by areas, not heights • Vertical axis should have “density scale” (in % per horizontal unit, not frequency) • So (if widths differ), bar height is % / h.u. • Result: total area under histogram = 100%

  10. Density scale • Weekly salaries in a company • Vertical axis is in % per $200 • Where is the high point?

  11. Reading a histogram: Hrs slept by CU students(Data from questionnaire) • Which given interval contains the most students? • Which 1-hr period contains the most students (i.e., is most “crowded”)? • About what % slept 8 hr? • About what % slept 3 or 4 hr? • If there were 240 surveyed, about how many slept 6-7 hr? Sum= 20 So maybe 5% slept 1hr, 5% 2hr and 5% 3hr; Or maybe 0% slept 1hr, 7% 2hr, 8% 3hr

  12. Drawing a histogram:Horses’ weights in kg Sum = 250

  13. Frequencies by area?

  14. We are given lists of numbers taken from a large number of residents of New York State. The numbers represent • the person’s height • the distance from the person’s home to the nearest airport • the distance from the person’s home to Syracuse • the size of the person’s savings • From the following histograms, choose the one that you think best approximates the true histogram of each list.

  15. Remarks on histograms • If data is continuous, need convention about what do with data on boundaries: Does it fall into left or right interval? • If data is discrete, it’s natural to center bars on values (which Excel does even when it shouldn’t).

  16. Possible traits of a number list • “skewed” right or left (the skew is the tail) • incomes are usually skewed right • test scores are often skewed left • IQs are usually symmetric • outlier values (far from others – sometimes a rule defines one, but ...) • unimodal vs. bimodal • ex: heights of a group of men vs. hts of a group of men and women

  17. Centers of numerical data • average [mean] (Greek letter µ or variable with overbarx ): sum of data divided by number of numbers in list • this is the “arithmetic” mean • (others include geometric, harmonic) • “center of gravity” of histogram • median: value that is greater than half, less than half of data • half area of histogram to each side • more “resistant” to skew & outliers

  18. Weekly salaries in a small factory: 30 workers 17 $200, 5 $400, 6 $500, 1 $2000, 1 $4600 avg = $500, median = $200

  19. Test for skew • median < avg : skewed to right • median > avg : skewed to left • (These can be used as the def of skew left or right)

  20. Pop Quiz Is the average or the median likely to be greater for a list of • heights of people (including children)? • heights of adults? • salaries?

  21. Measures of spread • standard deviation σ (maybe with subscript of variable) = √[Σ( x -x )2/n] • this is “population std dev”, SD in text, stdevp in Excel • “sample std dev” s = √[Σ( x -x )2/(n-1)] is SD+ in text (much later), stdev in Excel • IQR (“inter-quartile range”): difference between third and first quartiles • first quartile: 1/4 of data is less, 3/4 greater • third quartile: [Guess!] • (n-th percentile: n% of data is less, (100-n)% greater) • (range: max value – min value)

  22. Rough estimates of σ • Steps: Estimate avg, estimate deviations of data values from avg, estimate avg of deviations (guess high) • Ex: 41, 48, 50, 50, 54, 57. Is σ closest to 0, 5, 10 or 20? • Ex: Is σ closest to 0, 3, 8 or 15?

  23. Rule of Thumb (best for, but not limited to, bell-shaped data) • 68% of values are within 1σ of mean • 95% of values are within 2σ of mean • almost all values are within 3σ of mean

  24. So values within one σ of the average are fairly common, ... • while values around two σ away from average are mildly surprising, ... • and values more than three σ away from the average are “a minor miracle”. • “z-score” (“std units”): z = ( x –x ) / σ • the number of σ’s above average • (if negative, below average)

  25. 1,2,3,3,4,5 avg = 3 σ = √(10/6) ≈ 1.3 median = 3 IQR = 3.75 - 2.25 (Excel values) = 1.5 1,2,3,3,4,100 avg ≈ 18.8 σ ≈ 36.3 median = 3 IQR = 3.75 - 2.25 (Excel values) = 1.5 Effect of outliers

  26. Changing the list (I):“Linear” changes of variable(i.e., changing units of measure) • Basic list: 6, 9, 15 • µ= [6+9+15]/3 = 10 • σ = √[((6-10)2+(9-10)2+(15-10)2)/3] = √14 • Add 5: 11, 14, 20 • µ= [11+14+20]/3 = 15 • σ = √[((11-15)2+(14-15)2+(20-15)2)/3)] = √14 • Multiply by 12: 72, 108, 180 • µ= [72+108+180]/3 = 120 • σ = √[((72-120)2+(108-120)2+(180-120)2)/3] = 12√14

  27. Changing the list (II): • Adjoining copies of the average: 6, 9, 15, 10, 10, 10, 10 • – µ = (6+9+15+4(10))/7 = 10 • –σ = √[((6-10)2+(9-10)2+(15-10)2+4(10-10)2)/7] • = √(3/7)∙√14 < √14 • A: 25, 35 B: 40, 40, 50, 50 • µA = 30, σA= 5, µB = 45, σB = 5 • A and B: 25, 35, 40, 40, 50, 50 • µ = 40, σ = 5√3 > 5 Combining two lists

  28. Box-&-whisker plots • Uses all of min, first quartile, median, third quartile, max • “Normal” just shows them all • “Modified” puts limit on whisker length -- 1.5 IQR • 3rd quartile +1.5 IQR = “(inner) fence” • whisker ends at last value before or on fence • beyond the fence is an “outlier” • reject only “for cause” (?) • beyond 3rd quartile + 3 IQR (“outer fence” or “bound”) is “extreme outlier”

  29. Normal: Modified:

More Related