Chapter 6

Chapter 6 Introduction to Inference

Introduction • We use statistical inference to draw conclusions from data • Our conclusions must account for the natural variability in the data • To account for the variability, formal inference relies on probability to describe chance variation • We can then correct our “eyeball” judgement by formal calculation

Example • Scenario with the US draft. • Supposedly there should be no correlation between a draft number and birth-date. • Imagine that a sample shows the correlation was r = -.226 • If the correlation between draft number and birth-date really is zero how likely is it to get a correlation far from zero just by chance? • Does r = -0.226 sample correlation put the claim that the population correlation is 0 in doubt?

TV Ads

Cautions

Statistical Inference • Chapter 6 introduces the reasoning of statistical inference • A major assumption is that data come from a random sample. • No statistical methodology can rescue bad data. • We temporarily make the generally unrealistic assumption that we know the standard deviation of the population ( ) • We will try and motivate inference based on the ideas we have developed in class. • Sampling distributions.

Section 6.1 Estimating with Confidence

Inference on Mean Population Parameter? Population Inference Sample Sample Statistic

Introduction • One way to characterize a collection of businesses is to determine the average of some measure of size • Total assets is one commonly used measure • Asset Turnover, Manufacturing Defect Rate, etc. • If the collection of businesses is large, we generally take a sample and use the information gathered to make an inference about the entire collection • We use the term population to refer to the entire collection of interest

Case 6.1 • Community banks are banks with less than a billion dollars of assets. • There are approximately 7500 such banks in the US • The Community Bankers Council of the American Bankers Association (ABA) conducts an annual survey of community banks • For the n=110 banks that make up the sample in a recent survey, the mean assets are 220 million dollars

Review • Recall • The mean of the sampling distribution of is • is an unbiased estimator of • Which says that there is no systematic tendency to underestimate or over estimate the truth. • By LLN, we know that as sample size gets larger sample mean gets closer to population mean

Review • So since is an unbiased estimate of and because of the Law of Large Numbers the value = 220 therefore appears to be a reasonable estimate of the mean assets  for all community banks

Statistical Inference • But how reliable is this estimate? • A second sample would surely not give a mean of 220 again • Unbiasedness only says that there is no systematic tendency to underestimate or overestimate the truth • Could we plausibly get a sample mean of 250 or 200 on repeated samples? Of course! • An estimate without an indication of its variability is of limited value!

Statistical Inference • We can answer questions about variation by looking at the spread • Recall: • The variation of if the standard deviation of is • And because of the Central Limit Theorem if the sample size is large enough

Statistical Inference • In the majority of situations we will not know . So what do you do? • For now lets suppose that in our example the true standard deviation  is equal to the sample standard deviation s = 161 • This assumption is not realistic, although the assumption will give reasonably accurate results for large samples (n=110 is probably large enough) • In the next chapter we will learn how to proceed when  is not known

Statistical Inference • Therefore through the central limit theorem and our large sample size of 110 individuals, we can reasonably assume that

Statistical Inference • Recall the 68-95-99.7 we know that the probability that is between • Thus 95% of random samples will produce an x-bar thatlies within a 2 ’s of 

Statistical Inference • To say that lies within 2 ’s of  is the same as to say that  is within 2 ’s of . • In our example then, saying that (x-bar) lies within 30 million dollars of  is the same as saying that  is within 30 million dollars of x-bar • We can say that in 95% of all samples the interval, will capture the true . • This is the same as saying, “We are 95% confident that  is in the interval • We can express “confidence” in the results from ANY ONE sample.

Statistical Confidence • If we repeat what we have done many, many times then we will catch the true  95% of the time • Confidence describes what happens in the long run • It does not mean the probability that the true mean, , falls between is 95% •  is not random. Rather  is fixed and does not change.  is either in the interval, or not.

Statistical Inference • We cannot know whether our sample is one of the 95% for which the interval catches  or one of the unlucky 5% • The statement: we are 95% confident that the unknown  lies between 190 and 250 is shorthand for saying “we arrived at these numbers by a process that gives correct results 95% of the time”

Confidence Intervals • The interval of values between is called a 95% confidence interval for  • In general a confidence interval is in the following form: • estimate +/- margin of error • Margin of Error; • Is evaluated based on the variability of the estimate • Shows how precise our guess is • Estimate: • Is our guess for the value of the unknown parameter

Confidence Intervals • A level C confidence interval for a parameter has two parts • An interval calculated from the data, of the form: estimate +/- margin of error • A confidence level C, which gives the probability that the interval will capture the true parameter value in repeated samples • You can choose the confidence level • Commonly, statisticians choose 95% • 90% and 99% are also popular depending on your needs

Finish Example • So to finalize our example • We want to know the mean assets of the 7500 community banks • We take a random sample of 110 and find that • The margin of error is: • Thus a 95% confidence interval is: • (220 - 30, 220 + 30) = (190, 250) • We say “We are 95% confident that values between 190 million dollars and 250 million dollars will capture the mean assets of all 7500 community banks.”

Statistical Confidence • But what would happen if we took another random sample of 110 banks • Most likely will be different, so that means that we would get a different confidence interval • In the long run, if we repeated the sampling process many times, 95% of the constructed confidence intervals will contain the population mean.

Confidence Interval for a Population Mean • Now we will generalize the idea to get a confidence interval for any confidence level C. • We will use what we know about the sampling distribution of • Recall: • and when n is large by CLT

Confidence Interval for a Population Mean The area between the critical values –z* and z* under the standard normal curve is C. Standard Normal Curve Probability=C P=(1-C)/2 P=(1-C)/2 -z* 0 +z*

Confidence Interval for a Population Mean • We want to find upper value and a lower “critical values”, z*, so that the area between them is C. • We’ve kind of already found these critical values for some areas. (review) • Here are a few common z*’s and corresponding C

Confidence Interval for a Population Mean Example: C=90% Standard Normal Curve The area between the critical values –z* and z* under the standard normal curve is C. Probability=0.9 P=(1-0.90)/2=0.05 P=(1-0.90)/2=0.05 -z* 0 +z*

Confidence Interval for a Population Mean • We choose some z* so that • After some algebra we get

Confidence Interval for a Population Mean • Thus, if we choose a SRS of size n from a population having unknown mean  and known standard deviation  • A level C confidence interval for  is • The quantity is the margin of error

Example • The 110 banks in the ABA survey had mean assets of 220 million dollars. Assume that the standard deviation is 161. Give a 99% confidence interval for , the mean assets for all community banks.

Answer (z* for C = 0.99) We are 99% confident that the mean LTDR for community banks is between 180.46 and 259.54

How Confidence Intervals Behave • Recall: the margin of error is • Relation between the confidence level and margin of error • A higher confidence level -> increases z* -> larger margin of error. • But we would like to have a high level of confidence and a small margin of error. • Other ways to reduce margin of error • Reduce  • Increase the sample size (larger n)

Reduce Level of Confidence • The common choices of confidence level are 99%, 95%, and 90% • The critical values z* for these levels are 2.576, 1.960, and 1.64 • Notice these decrease as the confidence level drops • If n and  are unchanged, settling for lower confidence will reduce the margin of error

Reduce  • The standard deviation  measures variation in the population • Think of the variation among individuals in the population as noise that obscures the average value  • Sometimes we can reduce  by carefully controlling the measurement process or by restricting our attention to only part of a large population

Increase the Sample Size n • Suppose we want to cut the margin of error in half • The square root in the formula implies that we must have four times as many observations, not just twice as many • E.g., Cut in half  divide by 2. The square root of 4 is 2. Hence, we must increase the sample size by a multiple of 4 to cut the margin of error in half.

Example • An SRS of 100 ISU students. The average bus waiting time is 2.5 min. • Assume we know that  = 1.2 • Find a 95% confidence interval for the population mean (average bus waiting time for all ISU students), .

Example Cont. We are 95% confident that the mean bus waiting time for all ISU students is between 2.3 and 2.7 minutes

Example Cont. Now find a 80% confidence interval We are 80% confident that the mean bus waiting time for all ISU students is between 2.3 and 2.65 minutes

Example Cont. • Now take a different SRS of 1000 ISU students. Assume that the average bus waiting time remains 2.5 min. • Find a 95% confidence interval • Z* = 1.96 • Margin of Error: • A 95% confidence interval: • We are 95 confident that the mean bus waiting time for all ISU students is between 2.43 and 2.575 minutes.

Choosing the Sample Size • Planning ahead we can choose a sample size to get a desired margin of error and confidence level • To obtain a desired margin or error m, just set the margin of error equal to m, substitute the critical value z* for your desired confidence level, and solve for the sample size n

Sample Size for Desired Margin of Error • The confidence interval for a population mean will have a specified margin of error m when the sample size is:

Sample Size for Desired Margin of Error • In practice, observations cost time and money • The sample size you calculate from this formula may turn out to be too expensive • Always round your answer up to the next higher whole number • In practice we often calculate the margins of error corresponding to a range of values of n…we then decide what margin or error we can afford

Example • You are planning a survey of starting salaries for recent business major graduates form your college. From a pilot study, you estimate that the standard deviation is about $8000. What sample size do you need to have a margin of error equal to $500 with 95% confidence.

Answer We would need to survey 984 business major graduates for our estimate to be within $500 of the true mean with 95% confidence

Conclusion • Keeping the sample size fixed, if the confidence level increases, the margin of error will be larger. A larger margin of error produces larger confidence interval. • Keeping the confidence level fixed, if the sample size gets larger, the margin of error will be smaller. A smaller margin of error gives narrower confidence interval. • Hence we can achieve both large confidence level and narrow confidence interval by increasing the sample size.

Some Cautions • We have already seen that small margins of error and high confidence can require large numbers of observations. • You should also be aware that any formula for inference is correct only in specific circumstances…. (next slide!)

Some Cautions • The data must be a SRS from the population • We are completely safe if we actually did a randomization and drew a SRS • We are not in great danger if the data can plausibly be thought of as independent observations from a population • The formula is not correct for probability sampling designs more complex than a SRS • Correct methods for other designs are available • If you plan such samples, be sure that you (or your statistical consultant) know how to carry out the inference you desire

Some Cautions • There is no correct method for inference from data haphazardly collected with bias of unknown size • Fancy formulas cannot rescue badly produced data • Because x-bar is not resistant, outliers can have a large effect on the confidence interval • You can search for outliers and try to correct them or justify their removal before computing the interval • If the outliers cannot be removed, ask your statistical consultant about procedures that are not sensitive to outliers

Chapter 6

Chapter 6

Presentation Transcript

Chapter 6

Chapter 6

Chapter 6

Chapter 6

Chapter 6

Chapter 6

Chapter 6

Chapter 6

Chapter 6

Chapter 6

CHAPTER 6

Chapter 6

Chapter 6

Chapter 6

Chapter 6

CHAPTER 6

Chapter 6

Chapter 6

Chapter 6

Chapter 6

Chapter 6

Chapter 6