STAT3600

STAT3600 Lecture 2 Measures of Location and Variability

Histogram Shapes • Modality: • Unimodal: Rises to a single peak then declines. • Bimodal: Two different peaks. • Multimodal: More than two peaks. • Skewness: • Symmetric: Left half is the mirror image of the right half. • Positively skewed: Right or upper tail is stretched out. • Negatively skewed: Left side is stretched out.

Histogram Shapes (Smoothed)

Measures of Location • Visual methods provide good preliminary insights about the data. • More formal analysis require calculation of a few measures that numerically (or quantitatively) describe the data. • Some measures of location: Mean, Geometric mean, Harmonic mean, Median, Mode, Quartiles, Percentiles, Trimmed mean, etc.

The Mean • The most familiar and useful measure of center. • Mean is the arithmetic average of the values in the data set. (x1, x2, x3, …, xn) • Denoted by

Population Mean • Just like the calculation of the sample mean, the average of all values in the entire population can also be calculated. • The population mean is denoted by . • Inferential statistics are needed to draw conclusions about the population mean, by using the sample information.

The Median Examples: Even sample size: Observed data x ={13, 56, 43, 36, 57, 44, 56, 67, 45, 67, 32, 12, 46, 76} Ordered-statistic x() ={12, 13, 32, 36, 43, 44, 45, 46, 56, 56, 57, 67, 67, 76} n=14 7 values 7 values Odd sample size: Observed data x ={13, 56, 43, 36, 57, 44, 56, 67, 45, 67, 32, 12, 46} Ordered-statistic x() ={12, 13, 32, 36, 43, 44, 45, 46, 56, 56, 57, 67, 67} n=13 6 values 6 values

The Median

Mean & Median

Geometric Mean

Harmonic Mean

Trimmed Mean • A 10% trimmed mean, xtr(10), is calculated by deleting the smallest and largest 0.10n observations from the data and calculating the mean of the remaining 0.80n observations (or the remaining 80% of the data.)

Trimmed Mean • Example: • Consider Example 1.1 on page 4 of Devore. n=36. To calculate the 10% trimmed mean: 0.10  36 = 3.6 observations should be deleted from each side. Since 3.6 is not an integer, we will use interpolation:

Mode • The mode is the value that has the highest frequency. For the data of Example 1.1, Mode1 = 67 and Mode2 = 70 because both observations have the same frequency, which is 4. • Most populations usually have a single mode. • In calculus, the mode is referred to as maximum.

Sample Percentiles • The 100pth sample percentile is computed as follows: • Multiply p by n: if n  p is an exact integer, say i, then xp = (x(i) + x(i+1))/2 • If n  p is not an integer, then round up n  p to the next higher integer, and let that be i. Then: xp = x(i)

Sample Percentile • For the Data of Ex. 1.1, the 10th, 25th, 50th, 75th and 90th percentiles are computed as: • x0.10 = 0.1036 = 3.6 4 x0.10 = x(4) = 49.0 • Q1 = x0.25 = 0.2536 = 9 • x0.25=(x(9)+x(10))/2=(58+60)/2=59.0 • Q2 = x0.50 = 0.5036 = 18 • x0. 50=(x(18)+x(19))/2=(67+68)/2=67.5 • Q3 = x0.75 = 0.7536 = 27 • x0. 75=(x(27)+x(28))/2=(75+75)/2=75.0 • x0.90= 0.9036 = 32.4 33 x0.90 = x(33) = 80.0

Sample Percentile • The IQR (interquartile range) is defined as IQR=(the 4th spread = fs, where fs is Devore’s notation) = x0.75 x0.25 = Q3  Q1, and its value is equal to 16.0 for the data of Exp.1.1. • If x(i) < Q1  1.5IQR, or x(I) > Q3 + 1.5IQR, then the ith order-statistic x(i) is an (a mild) outlier. If the value of x(i) < Q1  3IQR, or x(i) > Q3 + 3IQR, then x(i) is an extreme outlier. • For Exp. 1.1 of Devore, Q1  1.516 = 59  24 = 35 and only x(1) = 31 < 35, then the data contains a single mild outlier on the LHS.

Ex15p24.xls

Measures of Variability • There are three quantitative measures for the variability of the data. • Range R = x(n)  x(1), IQR = x0.75 x0.25 , and the standard deviation (stdev) s. • We have already defined R and IQR. In order to compute the stdev, s, we need to compute the variance first.

Variance • Definition: The variance, v, is the average of deviations of n observations from-their-own-mean squared. • A deviation will be positive if the observation is larger than the sample mean, and negative if the observation is smaller than the sample mean. • There is a single problem: If we just add n deviations from the sample mean, they will sum up to zero. This is why we square the deviations first to calculate the variance.

When the deviations are squared, the result will always be positive so the problem is resolved.

Sample standard deviation, s, and standard error of the mean, se(x), and coefficient of variation The unit of s2 is [unit of x]2. Sample standard deviation. The unit of s is same as the unit of x. Coefficient of variation (cv) is expressed as %. Population variance is denoted by 2, and the population standard deviation is denoted by .

Properties of variance (var(x) or s2) • var(x) is guaranteed to be non-negative. • If you calculated s2 as a negative number, you have made a computational error. • var(x+c) = var(x), where c is a constant. • var(cx) = c2var(x), where c is a constant.

x0.25 x0.50 x0.75 59.0 67.5 75.0 Graphical Measure for Variability, the Boxplot • Step 1. Draw a vertical line thru the median = x0.50 • Step 2. Draw vertical lines thru Q1 = x0.25 and Q3 = x0.75 and connect at the bottom and the top to make a (rectangular) box. For the data of Exp1.1, the box is shown below.

x0.25 x0.50 x0.75 40 84 59.0 67.5 75.0 • Step 3. Compute both 1.5IQR and 3IQR. For the Exp1.1, 1.5IQR = 24 and 3IQR = 48. Then all data points less than Q1  1.5IQR = 59 – 24 = 35 or larger than 75 + 24 = 99 are (mild) outliers. Exp1.1 on page 4 of Devore has no extreme outliers. • Step 4. Draw whiskers from Q1 and Q3 to the smallest and largest order statistics that are not outliers. Note that the dark dot on the LHS of the below boxplot represents the mild outlier x(1) = 31. Extreme outliers are represented by clear dots.

STAT3600

STAT3600

Presentation Transcript

STAT3600

STAT3600

STAT3600