170 likes | 289 Vues
This lecture explores the distinction between statistics and parameters using a random sample of 1014 voters who express opinions on the President's political stance. We analyze how the observed statistic (51% thinking the President is "too liberal") relates to the unknown population parameter. A sampling experiment illustrates how models can be checked for consistency with the data. Emphasis is placed on the importance of using sophisticated statistics to estimate uncertainty and interpret sampling results effectively.
E N D
Parameters and statistics • Example: A random sample of 1014 voters are asked if they think the President is too liberal, too conservative, or about right. • 514 (51%) say ‘too liberal’. • The observed percentage in the sample, 51%, is a statistic. • The unknown percent of the population who would say yes is a parameter. • Presumably it’s close to 51%, but we will never know. • Question we’d like to ask: “How likely is the statistic to be how close to the parameter?”
Sampling experiment • http://demonstrations.wolfram.com/UrnProblem/
Experiment • Population (urn) – our class (22 people) • Question: Dogs vs Cats • We will sample 5 • Assign numbers • Use R to sample • Repeat 5 times • ascertain truth and run R
US Polls • 2012 estimate of the number of eligible voters is 206,072,000. • We sampled 1014 people at random and got 514 yes.
Main idea • We learned that given a model we can check if the data is consistent with it • Idea: Find models that are consistent with the data.
US Polls • 2012 estimate of the number of eligible voters is 206,072,000. • We sampled 1014 people at random and got 514 yes. • We will consider various models: • True proportion is p = 0.05, 0.10, 0.15, … • Which of the models is the data in agreement with?
“P-value” • Consider proportion of fake samples < 514 • Values close to 0 or 1 are not consistent with the model
Cutoff • Using better resolution of models • A usual cutoff 0.05 split between both sides (0.975 and .025) • Models selected: [.477, .538]
Fake data • 3 sub-groups • 42,343,562 (R) • 59,280,986 (D) • 104,447,452 (I) • We sample roughly in proportion: • Sampled 220, 201 yes (R) • Sampled 284, 31 yes (D) • Sampled 510, 282 yes (I)
Issue • Too many levers to “fiddle” (three different Ks for each group) • Cannot simply look how well the data fits. • Needs more sophisticated statistics
Naïve parametric bootstrap • Compute a number/numbers estimated from the data (point estimate). • Use this number to simulate a lot of fake data and see how the fake data can vary. • Use this variability to estimate the uncertainty in our estimator
Original problem • Estimated K=206,072,000 * (514/1014)=104,458,588 • Estimated p in (.476,.537)
Stratified problem • Estimate K for each group – combine to get joint p estimate • pcombined=.20*.91 +.29*.11 +.51*.55=.50 • This is not that much different from ignoring stratification • There is a (small) gain in uncertainty • Bootstrap interval (.474,.525)