Basic Statistics

Basic Statistics Introduction to Inferential Statistics

STRUCTURE OF STATISTICS TABULAR DESCRIPTIVE GRAPHICAL NUMERICAL STATISTICS ESTIMATION INFERENTIAL TESTS OF HYPOTHESIS

Introduction to Inferential Statistics • Inferential statistics about the population mean are usually used to answer one of two types of questions. • The first question is, What is the average “something?” This is Estimation. • “Something” could be hours spend studying by online students, speed driven by teenagers, distance people commute to work or school, or any number of other things.

Introduction to Inferential Statistics • The second type question about the population mean is: • “Am I right or wrong if I guess (hypothesize) the mean “something” to be {value}? This is Hypothesis Testing. • Again, “something” could be hours spend studying by online students (10 hours), speed driven by teenagers (too fast*), distance people commute to work or school (12 miles), or any number of other things. *The hypothesized value must be a value not a value judgment!

Inferential Statistics

Relating to the Textbook • Your textbook treats these two types of questions as distinctly different, with the Hypothesis Testing taking a predominate role. • I see them as closely linked and in fact, I will show you how to do both things with one technique.

REMINDER!! • Much of what we will cover from here until the end of the course is not in sequence with your book. The material is all there but I will be referring you to many sections of many chapters as we progress. You will need to pay careful attention to the PowerPoint lessons and be able to use your textbook as a reference.

Some Definitions for Estimation • Estimation: Using sample statistics to estimate population parameters. • Point Estimate: Use of a single number as the estimate for unknown parameter (usually never correct!). • Interval Estimate: A range of values as the estimate for the unknown parameter. • Confidence Interval: An interval estimate accompanied by a specific level of probability.

An Example of Estimation Suppose a university administrator is interested in determining the average IQ of all professors at her university. It is too costly to test all professors, so she selects a random sample of 20 professors. Each is given an IQ test and the results show a sample mean of 135. Since the test is nationally standardized, she knows that s for the population is 15. • How would the administrator estimate the average IQ for ALL university professors?

Constructing a Confidence Interval • The general formula for a confidence interval uses information from the sample and our knowledge of the sampling distribution from the Central Limit Theorem. • We then construct an interval in which we think the population parameter will be.

Confidence Interval Formula In words, the confidence interval is determined by adding and subtracting the bound on the estimate (the z score representing the level of confidence times the standard error) to and from the mean from the sample.

The Area Under the Normal Curve and the Sampling Distribution of the Means 95%

The Sampling Distribution of the Means • From the previous slide we can see that given our knowledge of the sampling distribution of the means, we know that 95% of all that we would obtain from numerous samples will fall with 1.96 standard errors of the unknown m.

Our University Administrator • Let’s return to our university administrator and see how she will estimate the average IQ [m] of all professors at her university. • She will need to know the • Shape, • Mean, and • Standard Deviation of the • Sampling Distribution!

What about the Shape of the Sampling Distribution? • If we had repeated taking hundreds of samples of 20 professors, what does the CLT tell us will be the shape of the distribution of sample means from these samples? • It would be approximately normal or mound-shaped.

What about the Mean of the Sampling Distribution? • What does the CLT tell us the mean of the sampling distribution would be? • It would be the same as the population, which is m and is unknown.

What about the Standard Deviation of the Sampling Distribution? • What does the CLT tell us the standard deviation (standard error) of the sampling distribution would be? • It would be the same as the population standard deviation, 15, divided by the square root of the sample size, 20.

We can display this graphically 3.35 3.35

Using the standard deviation of the sampling distribution (standard error), we can determine a bound on the estimate. We know that 95% of the observations in a distribution will fall within 1.96 standard deviations of the mean, therefore, if we take 1.96 of the standard errors (1.96 x 3.35 = 6.57), we know the maximum distance that our estimate will miss the population parameter (error) 95% of the time.

Using this information, we can determine the points on this graph where the sample mean would occur 95% of the time: 95% 6.57 6.57

Let’s illustrate the computations-- First, we compute the bound on the error of estimation: We then subtract and add it to the sample mean:

The Answer! • Based on our calculations, the way to state the estimate is: • The administrator is 95% confident that the mean IQ for all professors at her university is between 128.4 and 141.6.

We can show graphically the concept of the confidence interval. Since there is a 95% chance that the sample mean will be in this interval, the interval around the sample mean will capture the population mean (m) 95% of the time.

Important Concept • When we construct a confidence interval, we are not saying that the parameter m is in the middle but merely somewhere in that interval!! It is like throwing a net into the sea, we hope to catch the fish but we do not know where the fish is. If we are really hungry, we better throw a big net! (Which statistically is to have a higher degree of confidence).

How often will the 95% confidence interval capture m? Answer: 95% of the time

Here, the sample mean is as far left as it will fall 95% of the time. Please note that it still captures (barely) the population mean.

Here, the sample mean is as far right as it will fall 95% of the time. Please note that it still captures (barely) the population mean.

Only 5 times in 100 samples will the obtained sample mean be so far away that a 95% confidence interval will not capture m.

Summary The sample mean is the point estimate for the population mean. The standard deviation of the sampling distribution is also called the standard error for the estimate of the mean. 1.96 standard errors provides the 95% bound on the error of estimation. If we add and subtract this bound from the sample mean, we can create a confidence interval. Finally, we can alter the confidence limits (from 95%) depending on the distance from the mean of the distribution that we choose.

Summary in Symbols for Estimating μ Note: z would be 1.96 for 95% Bound, 2.575 for 99% Bound, 1.64 for 90% Bound, etc. Confidence Interval:

One–Sample Test of HypothesisHypothesis Testing ona Population Mean population sample

Research Situation • A particular test has a national mean and standard deviation of 100 and 15 respectively. The superintendent of a particular school system wants to know if the average IQ in her school system is different than the national average on this test. population

Definitions Related to Hypothesis Testing • Null Hypothesis: The hypothesis that we will test statistically. In a single sample problem it is the “guess” about the population mean (m). • Written as: Ho: m= value. • Alternative Hypothesis: If the null is not feasible, then the alternative must be. • Written as: Ha: m ‘value’, or m < ‘value’, or m > ‘value’

Step by Step: The One-Sample Test of Hypothesis Using the z-test. • State Research Question • Establish the Hypotheses • Establish Level of Significance • Collect Data • Calculate Statistical Test • Interpret the Results

1. Stating the Research Problem Is the average IQ of students in that particular system different from the national average? population Difference? sampling sample

2. Establish the Research Hypothesis Research or Alternative Hypothesis The mean IQ is not equal to 100. Null Hypothesis The mean IQ is 100.

Alternative or Research Hypotheses • The Alternative Hypotheses may take either a non-directional form, m = ‘value’. • The Alternative Hypotheses may be a directional hypothesis, m > ‘value’ or m < ‘value’. • The decision to use a directional alternative is based on the research question under investigation.

Errors in Decisions

3. Establish the Level of Significance • a is the probability of rejecting a true null hypothesis and will be equal to the area NOT within the area we would expect to find our sample mean (e.g., if we use 95% under the curve, then a is .05). • a defines what is called the “Rejection Region” because we will reject the Null if our calculated z statistic is in that region.

Graphical Depiction of Rejection Region Rejection Region Rejection Region Hypothesized

Rejection RegionDirectional Hypotheses m +1.645 This would represent a directional hypothesis m > ‘value’. The total area would be on only one side, e.g., .05, thus the critical value of z would be 1.645 rather than 1.96, giving a greater likelihood of rejecting the Null Hypothesis.

Rejection RegionDirectional Hypotheses -1.645 m This would represent a directional hypothesis m , ‘value’. The total area would be on only one side, e.g., .05, and again, the critical value of z would be 1.645 rather than 1.96, giving a greater likelihood of rejecting the Null Hypothesis.

Rejection Region • It is determined by the Alpha (a) selected. a defines how much of the area under the curve will be in the rejection region. • The probability of rejecting a TRUE Null Hypothesis is equal to the area in the rejection region, since a sample mean will only be obtained that frequently; if the Null Hypothesis is True.

4. Collecting the Data • A random sample of 81 students were given the IQ test and their scores were recorded. The mean of the sample was 105. Student IQ 1 109 88 2 81 122

5. Analyzing the Data: Calculating the Test Statistic The statistic we will calculate to determine if the Research Hypothesis is tenable is a modification of our z score. Notice the new formula uses the data from the sampling distribution and the population mean m divided by the standard error. These are exactly as we discussed in Estimation.

Calculating the Z Statistic • From our sample of 81 students, we calculated the sample mean to be 81. • The population standard deviation is 15. • Using our Z test formula we can determine where our sample mean would fall, if the population mean m is 100.

The Z Statistic Thus, our sample mean lies 2.99 standard deviations (standard errors) above the population mean of 100.

Locating our Mean on the Sampling Distribution of Means 95% of all means m – 1.667 m + 1.667 m 100 105 101.67 98.33

6. Interpreting the Results • Since our mean is not in the area where we would expect 95% of all sample means from a distribution where the population mean is 100, we would reject the Null Hypothesis that m = 100 and accept the alternative that it is different m = 100. • We would state that we reject the Null Hypothesis at the .05 level of confidence.

Problems with Z Test • The z test requires that we know the population standard deviation, which we usually do not know. • The z test is designed for large samples (n> 30), again which we don’t always have. • What is the solution?

Basic Statistics