Review of Top 10 Concepts in Statistics

Review of Top 10 Conceptsin Statistics NOTE: This Power Point file is not an introduction, but rather a checklist of topics to review

Top Ten #1 • Descriptive Statistics

Measures of Central Location • Mean • Median • Mode

Mean • Population mean =µ= Σx/N = (5+1+6)/3 = 12/3 = 4 • Algebra: Σx = N*µ = 3*4 =12 • Sample mean = x-bar = Σx/n • Example: the number of hours spent on the Internet: 4, 8, and 9 x-bar = (4+8+9)/3 = 7 hours • Do NOT use if the number of observations is small or with extreme values • Ex: Do NOT use if 3 houses were sold this week, and one was a mansion

Median • Median = middle value • Example: 5,1,6 • Step 1: Sort data: 1,5,6 • Step 2: Middle value = 5 • When there is an even number of observation, median is computed by averaging the two observations in the middle. • OK even if there are extreme values • Home sales: 100K,200K,900K, so mean =400K, but median = 200K

Mode • Mode: most frequent value • Ex: female, male, female • Mode = female • Ex: 1,1,2,3,5,8 • Mode = 1 • It may not be a very good measure, see the following example

Measures of Central Location - Example Sample: 0, 0, 5, 7, 8, 9, 12, 14, 22, 23 • Sample Mean = x-bar = Σx/n = 100/10 = 10 • Median = (8+9)/2 = 8.5 • Mode = 0

Relationship • Case 1: if probability distribution symmetric (ex. bell-shaped, normal distribution), • Mean = Median = Mode • Case 2: if distribution positively skewed to right (ex. incomes of employers in large firm: a large number of relatively low-paid workers and a small number of high-paid executives), • Mode < Median < Mean

Relationship – cont’d • Case 3: if distribution negatively skewed to left (ex. The time taken by students to write exams: few students hand their exams early and majority of students turn in their exam at the end of exam), • Mean < Median < Mode

Dispersion – Measures of Variability • How much spread of data • How much uncertainty • Measures • Range • Variance • Standard deviation

Range • Range = Max-Min > 0 • But range affected by unusual values • Ex: Santa Monica has a high of 105 degrees and a low of 30 once a century, but range would be 105-30 = 75

Standard Deviation (SD) • Better than range because all data used • Population SD = Square root of variance =sigma =σ • SD > 0

Empirical Rule • Applies to mound or bell-shaped curves Ex: normal distribution • 68% of data within + one SD of mean • 95% of data within + two SD of mean • 99.7% of data within + three SD of mean

Standard Deviation = Square Root of Variance

Sample Standard Deviation

Standard Deviation Total variation = 34 • Sample variance = 34/4 = 8.5 • Sample standard deviation = square root of 8.5 = 2.9

Measures of Variability - Example The hourly wages earned by a sample of five students are: $7, $5, $11, $8, and $6 Range: 11 – 5 = 6 Variance: Standard deviation:

Graphical Tools • Line chart: trend over time • Scatter diagram: relationship between two variables • Bar chart: frequency for each category • Histogram: frequency for each class of measured data (graph of frequency distr.) • Box plot: graphical display based on quartiles, which divide data into 4 parts

Top Ten #2 • Hypothesis Testing

H0: Null Hypothesis • Population mean=µ • Population proportion=π • A statement about the value of a population parameter • Never include sample statistic (such as, x-bar) in hypothesis

HA or H1:Alternative Hypothesis • ONE TAIL ALTERNATIVE – Right tail: µ>number(smog ck) π>fraction(%defectives) – Left tail: µ<number(weight in box of crackers) π<fraction(unpopular President’s % approval low)

One-Tailed Tests A test is one-tailed when the alternate hypothesis, H1 or HA, states a direction, such as: • H1: The mean yearly salaries earned by full-time employees is more than $45,000. (µ>$45,000) • H1: The average speed of cars traveling on freeway is less than 75 miles per hour. (µ<75) • H1: Less than 20 percent of the customers pay cash for their gasoline purchase. (π <0.2)

Two-Tail Alternative • Population mean not equal to number (too hot or too cold) • Population proportion not equal to fraction (% alcohol too weak or too strong)

Two-Tailed Tests A test is two-tailed when no direction is specified in the alternate hypothesis • H1: The mean amount of time spent for the Internet is not equal to 5 hours. (µ  5). • H1: The mean price for a gallon of gasoline is not equal to $2.54. (µ ≠ $2.54).

Reject Null Hypothesis (H0) If • Absolute value of test statistic* > critical value* • Reject H0 if |Z Value| > critical Z • Reject H0 if | t Value| > critical t • Reject H0 if p-value < significance level (alpha) • Note that direction of inequality is reversed! • Reject H0 if very large difference between sample statistic and population parameter in H0 * Test statistic: A value, determined from sample information, used to determine whether or not to reject the null hypothesis. * Critical value: The dividing point between the region where the null hypothesis is rejected and the region where it is not rejected.

Example: Smog Check • H0 : µ = 80 • HA: µ > 80 • If test statistic =2.2 and critical value = 1.96, reject H0, and conclude that the population mean is likely > 80 • If test statistic = 1.6 and critical value = 1.96, do not reject H0, and reserve judgment about H0

Type I vs Type II Error • Alpha=α = P(type I error) = Significance level = probability that you reject true null hypothesis • Beta= β = P(type II error) = probability you do not reject a null hypothesis, given H0 false Ex: H0 : Defendant innocent • α = P(jury convicts innocent person) • β =P(jury acquits guilty person)

Type I vs Type II Error

Example: Smog Check • H0 : µ = 80 • HA: µ > 80 • If p-value = 0.01 and alpha = 0.05, reject H0, and conclude that the population mean is likely > 80 • If p-value = 0.07 and alpha = 0.05, do not reject H0, and reserve judgment about H0

Test Statistic • When testing for the population mean from a large sample and the population standard deviation is known, the test statistic is given by:

Example The processors of Best Mayo indicate on the label that the bottle contains 16 ounces of mayo. The standard deviation of the process is 0.5 ounces. A sample of 36 bottles from last hour’s production showed a mean weight of 16.12 ounces per bottle. At the .05 significance level, can we conclude that the mean amount per bottle is greater than 16 ounces?

Example – cont’d 1. State the null and the alternative hypotheses: H0: μ = 16, H1: μ > 16 2. Select the level of significance. In this case, we selected the .05 significance level. 3. Identify the test statistic. Because we know the population standard deviation, the test statistic is z. 4. State the decision rule. Reject H0 if |z|> 1.645 (= z0.05)

Example – cont’d 5. Compute the value of the test statistic 6. Conclusion: Do not reject the null hypothesis. We cannot conclude the mean is greater than 16 ounces.

Top Ten #3 • Confidence Intervals: Mean and Proportion

Confidence Interval A confidence interval is a range of values within which the population parameter is expected to occur.

Factors for Confidence Interval The factors that determine the width of a confidence interval are: • The sample size, n • The variability in the population, usually estimated by standard deviation. • The desired level of confidence.

Confidence Interval: Mean • Use normal distribution (Z table if): population standard deviation (sigma) known and either (1) or (2): • Normal population • Sample size > 30

Confidence Interval: Mean • If normal table, then

Normal Table • Tail = .5(1 – confidence level) • NOTE! Different statistics texts have different normal tables • This review uses the tail of the bell curve • Ex: 95% confidence: tail = .5(1-.95)= .025 • Z.025 = 1.96

Example • n=49, Σx=490, σ=2, 95% confidence • 9.44 < µ < 10.56

Another Example One of SOM professors wants to estimate the mean number of hours worked per week by students. A sample of 49 students showed a mean of 24 hours. It is assumed that the population standard deviation is 4 hours. What is the population mean?

Another Example – cont’d 95 percent confidence interval for the population mean. The confidence limits range from 22.88 to 25.12. We estimate with 95 percent confidence that the average number of hours worked per week by students lies between these two values.

Confidence Interval: Mean t distribution • Use if normal population but population standard deviation (σ) not known • If you are given the sample standard deviation (s), use t table, assuming normal population • If one population, n-1 degrees of freedom

Confidence Interval: Mean t distribution

Confidence Interval: Proportion • Use if success or failure (ex: defective or not-defective, satisfactory or unsatisfactory) Normal approximation to binomial ok if (n)(π) > 5 and (n)(1-π) > 5, where n = sample size π= population proportion NOTE: NEVER use the t table if proportion!!

Confidence Interval: Proportion Ex: 8 defectives out of 100, so p = .08 and n = 100, 95% confidence

Confidence Interval: Proportion A sample of 500 people who own their house revealed that 175 planned to sell their homes within five years. Develop a 98% confidence interval for the proportion of people who plan to sell their house within five years.

Interpretation • If 95% confidence, then 95% of all confidence intervals will include the true population parameter • NOTE! Never use the term “probability” when estimating a parameter!! (ex: Do NOT say ”Probability that population mean is between 23 and 32 is .95” because parameter is not a random variable. In fact, the population mean is a fixed but unknown quantity.)

Point vs Interval Estimate • Point estimate: statistic (single number) • Ex: sample mean, sample proportion • Each sample gives different point estimate • Interval estimate: range of values • Ex: Population mean = sample mean + error • Parameter = statistic + error

Width of Interval • Ex: sample mean =23, error = 3 • Point estimate = 23 • Interval estimate = 23 + 3, or (20,26) • Width of interval = 26-20 = 6 • Wide interval: Point estimate unreliable

Review of Top 10 Concepts in Statistics