Understanding Confidence Intervals and Sampling in Statistical Analysis

Stats 120A Review of CIs, hypothesis tests and more

Sample/Population • Last time we collected height/armspan data. Is this a sample or a population?

Gallup Poll, 1/9/07 "As you may know, the Bush administration is considering a temporary but significant increase in the number of U.S. troops in Iraq to help stabilize the situation there. Would you favor or oppose this?"

Results • Results based on 1004 randomly selected adults (> 18 years) interviewed Jan 5-7, 2007. • 61% are opposed. • "For results based on this sample, one can say with 95% confidence that the maximum error attributable to sampling and other random effects is ±3 percentage points. "

Pop Quiz • Is the value 61% a statistic or a parameter? • The margin of error is given as 3%. What does the margin of error measure? a) the variability in the sample b) the variability in the population c) the variability in repeated sampling

Sampling paradigm • In the U.S., the proportion of adults who are opposed to a surge is p, (or p*100%). • We take a random sample of n = 1004. • The proportion of our sample ("p hat") is an estimate of the proportion in the population.

A simulation: • Choose a value to serve as p (say p = .6) • Our "data" consist of 1004 numbers: 0's represent those in favor, 1's are those opposed. • x = 589 out of 1004 say "opposed", so p-hat = 589/1004 = .5866 • mean(x) = .5866 • sd(x) = .4926

xbar=.5866, s = .493

How do we know sample proportion is a good estimate of population proportion? • Law of Large Numbers: sample averages (and proportions) converge on population values •implying that for finite values, the sample proportion might be close if the sample size is large

Coin flips: sample proportion "settles down" to 0.5

So if we stop earlier, say n = 10 p-hat = .60

Which raises the question: • If we stop early, how far away will our sample proportion be from the true value? • Or, in a survey setting, if we take a finite sample of n=1004, how far off from the population proportion are we likely to be?

A simulation might help: • Assume p = .60 (population proportion) • Take sample of n = 1004 and find p-hat. • Save this value • Repeat above 3 steps 10000 times.

The R code (for the record) • phat <- c() for (i in 1:10000){ x <- sample(c(0,1),1004,replace=T,prob=c(.4, .6)) temp <- sum(x)/1004 phat <- c(phat,temp)} • hist(phat)

each dot represents one survey of 1004 people

10,000 sample proportions, n = 1004

Observe that... • sample proportions are centered on the true population value: p = .60 • variability is not great: smallest is .54, biggest is .66 • distribution is bell-shaped

We've just witnessed the Central Limit Theorem If samples are independent and random and sufficiently large • means (and proportions) follow a nearly Normal distribution • the mean of the Normal is the mean of the population • the SD of the Normal (aka the standard error) is the population SD divided by sqrt(n)

CLT applied to sample proportions • phat is distributed with an approx Normal • mean is p • SE is sqrt(p*(1-p)/n) • For our simulation, p = .60 so our p-hats will be centered on .6 with a SD of sqrt(.6*.4/1004) = 0.0155

We saw • Normal • mean(phat) = 0.600(expected .6) • sd(phat) = 0.01554(expected 0.0155)

In practice, we don't know p but we can get a good approximation to the standard error using sqrt(phat * (1-phat)/n) rather than sqrt(p*(1-p)/n)

So if we take a random sample of n = 1004 and we see p-hat = .61, we know that: • The true value of p can't be far away. SE = sqrt(.61*.39/1004) = 0.0154 •So 68% of the time we do this, p will be within 0.0154 of phat •And 95% of the time it will be with 2*.0154 = 0.03

Which leads us to conclude that the true proportion of the population that opposes a surge is somewhere in the interval.61 - .03 = 0.58 to .61+.03 = 0.64

Confidence intervals • This is an example of a 95% confidence interval. • Because 95% of all samples will produce a p-hat that is within 2 standard errors of the true value, we are 95% confident that ours is a "good" interval.

Formula A 95% CI for a proportion is estimate +/- 2 * (Standard Error) p-hat +/- 2*sqrt(phat*(1-phat)/n) 0.61 +/- 2*sqrt(.61*.39/1004) (.58, .64) note: our replacing phat for p in SE means we get an approximate value

What does 95% mean? • If we repeat this infinitely many times: • take a sample of n = 1004 from population • calculate sample proportion • find an interval using +/- 2 * SE • then 95% of these CIs will contain the truth and 5% will not. • We see only one: (.58, .64). It is either good or bad, but we are confident it is good.

Where did the 95% come from? • It came from the normal curve. • The CLT told us that p-hat followed a (approx) normal distribution. • For Normal's, 68% of probability is within 1 standard deviation of mean, 95% within 2, 99.7% within 3. • A normal table gives other probabilities

Change confidence level by changing the width of margin of error -0.015 +.015 1 SE 68% 2 SEs 95% 3 SEs 99.7% 90% 1.6 SE phat =0.61

The CLT applies to • any linear combination of the observations • assuming observations are randomly sampled, and independent • it does NOT matter what the distribution of the population looks like • if n is small, the distribution will be only approximately normal, and this might be a very poor approximation

the CLT does NOT apply to • non-linear combinations, such as the sample median or the standard deviation • non-random samples • samples that are dependent

simulation • http://onlinestatbook.com/stat_sim/sampling_dist/index.html

Summary • Confidence Level is a statement about the sampling process, not the sample • Margin of error is determined to achieve the desired confidence level • We can calculate the confidence level only if we know the sampling distribution: the probability distribution of the sample

Pop Quiz • Is the value 61% a statistic or a parameter? • The margin of error is given as 3%. What does the margin of error measure? a) the variability in the sample b) the variability in the population c) the variability in repeated sampling

For next time: • In WWII, German army produced tanks with sequential serial numbers. The allies captured a few tanks, and wanted to infer the total number of tanks produced. • Suppose you had captured 10 tanks. Come up with three estimators for the total number of tanks. • Data: 911 5146 6083 944 11944 9365 6087 6647 7076 12275

Understanding Confidence Intervals and Sampling in Statistical Analysis