Unit 2: Some Basics

Unit 2: Some Basics

The whole vs. the part • population vs. sample • means (avgs) and “std devs” [defined later] of these are denoted by different letters • description vs. inference • mean of a population describes it (partly) • mean of a sample gives a basis to infer (guess) mean of the population

Computers are ubiquitous in stats • Warning: We’ll compute with Excel because you can see the data and it’s available all over campus... • but its stats results are unreliable (see next slide) • In the real world, use a statistical program (SAS, SPSS, ...)

Data Types • categorical vs. numerical (vs. ordinal, …) • ex of categorical: red, blue, green • ex of numerical: person’s height in inches • (ex of ordinal: str agree, agree, neutral, ...) • [numerical: discrete vs. continuous] Types of Statistics • graphical • numerical

Graphs for categorical data • pie charts: for fractions of total data in each category • bar (Excel: “column”) graphs: for counts or fractions of data in each category • (mode = most frequent value)

Example: Hair color at NYS Fair

Numerical variables: Graphs of frequencies I • dot plots • stem-and-leaf plots • Ex: 79, 91, 59, 52, 94, 74, 75, 87, 67, 35, 91, 89, 96, 92, 92

Numerical variables: Graphs of frequencies II • histogram • 79, 91, 59, 52, 94, 74, 75, 87, 67, 35, 91, 89, 96, 92, 92 • (Note similarity to dotplot and stem-and-leaf plot) • Excel’s “histograms” aren’t, unless bar widths are equal

More on Histograms • Frequencies are represented by areas, not heights • Vertical axis should have “density scale” (in % per horizontal unit, not frequency) • So (if widths differ), bar height is % / h.u. • Result: total area under histogram = 100%

Density scale • Weekly salaries in a company • Vertical axis is in % per $200 • Where is the high point?

Reading a histogram: Hrs slept by CU students(Data from questionnaire) • Which given interval contains the most students? • Which 1-hr period contains the most students (i.e., is most “crowded”)? • About what % slept 8 hr? • About what % slept 3 or 4 hr? • If there were 240 surveyed, about how many slept 6-7 hr? Sum= 20 So maybe 5% slept 1hr, 5% 2hr and 5% 3hr; Or maybe 0% slept 1hr, 7% 2hr, 8% 3hr

Drawing a histogram:Horses’ weights in kg Sum = 250

Frequencies by area?

We are given lists of numbers taken from a large number of residents of New York State. The numbers represent • the person’s height • the distance from the person’s home to the nearest airport • the distance from the person’s home to Syracuse • the size of the person’s savings • From the following histograms, choose the one that you think best approximates the true histogram of each list.

Remarks on histograms • If data is continuous, need convention about what do with data on boundaries: Does it fall into left or right interval? • If data is discrete, it’s natural to center bars on values (which Excel does even when it shouldn’t).

Possible traits of a number list • “skewed” right or left (the skew is the tail) • incomes are usually skewed right • test scores are often skewed left • IQs are usually symmetric • outlier values (far from others – sometimes a rule defines one, but ...) • unimodal vs. bimodal • ex: heights of a group of men vs. hts of a group of men and women

Centers of numerical data • average [mean] (Greek letter µ or variable with overbarx ): sum of data divided by number of numbers in list • this is the “arithmetic” mean • (others include geometric, harmonic) • “center of gravity” of histogram • median: value that is greater than half, less than half of data • half area of histogram to each side • more “resistant” to skew & outliers

Weekly salaries in a small factory: 30 workers 17 $200, 5 $400, 6 $500, 1 $2000, 1 $4600 avg = $500, median = $200

Test for skew • median < avg : skewed to right • median > avg : skewed to left • (These can be used as the def of skew left or right)

Pop Quiz Is the average or the median likely to be greater for a list of • heights of people (including children)? • heights of adults? • salaries?

Measures of spread • standard deviation σ (maybe with subscript of variable) = √[Σ( x -x )2/n] • this is “population std dev”, SD in text, stdevp in Excel • “sample std dev” s = √[Σ( x -x )2/(n-1)] is SD+ in text (much later), stdev in Excel • IQR (“inter-quartile range”): difference between third and first quartiles • first quartile: 1/4 of data is less, 3/4 greater • third quartile: [Guess!] • (n-th percentile: n% of data is less, (100-n)% greater) • (range: max value – min value)

Rough estimates of σ • Steps: Estimate avg, estimate deviations of data values from avg, estimate avg of deviations (guess high) • Ex: 41, 48, 50, 50, 54, 57. Is σ closest to 0, 5, 10 or 20? • Ex: Is σ closest to 0, 3, 8 or 15?

Rule of Thumb (best for, but not limited to, bell-shaped data) • 68% of values are within 1σ of mean • 95% of values are within 2σ of mean • almost all values are within 3σ of mean

So values within one σ of the average are fairly common, ... • while values around two σ away from average are mildly surprising, ... • and values more than three σ away from the average are “a minor miracle”. • “z-score” (“std units”): z = ( x –x ) / σ • the number of σ’s above average • (if negative, below average)

1,2,3,3,4,5 avg = 3 σ = √(10/6) ≈ 1.3 median = 3 IQR = 3.75 - 2.25 (Excel values) = 1.5 1,2,3,3,4,100 avg ≈ 18.8 σ ≈ 36.3 median = 3 IQR = 3.75 - 2.25 (Excel values) = 1.5 Effect of outliers

Changing the list (I):“Linear” changes of variable(i.e., changing units of measure) • Basic list: 6, 9, 15 • µ= [6+9+15]/3 = 10 • σ = √[((6-10)2+(9-10)2+(15-10)2)/3] = √14 • Add 5: 11, 14, 20 • µ= [11+14+20]/3 = 15 • σ = √[((11-15)2+(14-15)2+(20-15)2)/3)] = √14 • Multiply by 12: 72, 108, 180 • µ= [72+108+180]/3 = 120 • σ = √[((72-120)2+(108-120)2+(180-120)2)/3] = 12√14

Changing the list (II): • Adjoining copies of the average: 6, 9, 15, 10, 10, 10, 10 • – µ = (6+9+15+4(10))/7 = 10 • –σ = √[((6-10)2+(9-10)2+(15-10)2+4(10-10)2)/7] • = √(3/7)∙√14 < √14 • A: 25, 35 B: 40, 40, 50, 50 • µA = 30, σA= 5, µB = 45, σB = 5 • A and B: 25, 35, 40, 40, 50, 50 • µ = 40, σ = 5√3 > 5 Combining two lists

Box-&-whisker plots • Uses all of min, first quartile, median, third quartile, max • “Normal” just shows them all • “Modified” puts limit on whisker length -- 1.5 IQR • 3rd quartile +1.5 IQR = “(inner) fence” • whisker ends at last value before or on fence • beyond the fence is an “outlier” • reject only “for cause” (?) • beyond 3rd quartile + 3 IQR (“outer fence” or “bound”) is “extreme outlier”

Normal: Modified:

Unit 2: Some Basics