Survey Methodology EPID 626

Survey MethodologyEPID 626 Sampling, Part II Manya Magnus, Ph.D. Fall 2001

Lecture overview • Comments about Assignment I • More sampling techniques • Sampling error • Sample sizes

Comments about Assignment I • Late policy • Location of mailbox • Randomization vs. random selection • Validity, reliability • Sampling frames • Physician responses=?=“gold standard” • Research questions vs. survey questions • Registering for class

Comments about Assignment I • Grading Looked for completeness in answering questions, care in discussion of survey, effort, basically correct information, not just cut-n-paste, synthesis. • Questions about grade: email manyadm@tulane.edu

Comments about Assignment I • Grading: • ++ 90-100% • + 80-89% •  70-79% • - 60-69% • -- <60% • 0 not turned in

Random digit dialing (1) • Delineate the geographic boundaries of the sampling area • Identify all of the exchanges used in the geographic area • Identify the distribution of prefixes with the sampling area • Example: There may be 8 exchanges, but you may find that 3 of them are used for nearly two-thirds of residential lines.

Random digit dialing (2) • You may stratify based on the distribution of prefixes • Ex. Take more samples of the 3 exchanges that account for the most residential lines • Try to identify vacuous suffixes • These are suffixes not yet assigned or assigned in large groups to a business • Usually consider suffixes in 100s • ex. 0000-0099, 0100-0199

Random digit dialing (3) • May randomly select the four-digit suffixes • ex. use a random-numbers table • Alternatively, you may use a plus-one approach • When you reach residence, use the number as a seed, and add fixed digits (one or two) to get the next sample

Random digit dialing (4) • Provides a nonzero chance of reaching any household within a sampling area that has a telephone line regardless of whether the number is listed • Is the probability of reaching every household equal? • No. Households with more than one phone line will have a greater probability than households with one phone line. • Adjust for unequal probability by weighting

Random Digit Dialing (5) • Advantages: Inexpensive and easy to do • Disadvantages: 1. Large number of unfruitful calls2. Will exclude individuals without phones3. May be difficult to ascertain geographic area

Sampling distributions • The central limit theorem: In a sequence of samples of a population, for a particular estimate (say a mean), there will be a normal distribution around the true population value • As sample size increases, distribution becomes increasingly normal

This variation around the true value is the sampling error—it stems from the fact that, by chance, samples may differ from the population as a whole.

The larger the sample size and the less variance of what is being measured, the more tightly the sample estimates will “bunch” around the true population value, and the more accurate the sample-based estimate will be.

Example (1) (adapted from Babbie) • Survey at TUSPHTM • Approval of new Lundi Gras holiday • Dichotomous outcome: approve/disapprove • Survey population—aggregation of students • Sampling frame—student list • Random sample of students; representative sample of student body

Example (2) (adapted from Babbie) • Extremes and all combinations in between possible: 100% approve100% disapprove, 1% approve, 99% disapprove, etc.. • First random sample: 48% approve, 52% disapprove • Second random sample: 20% approve, 80% disapprove • And so forth

Example (3) (adapted from Babbie) • What results from this exercise, is a distribution of samples, or a sampling distribution. • As more independent random samples are selected, the sample statistics obtained will be distributed around true population value in a known way.

Example (4) (adapted from Babbie) • They will be clustered about the true value within a certain range. • The range is given by the standard error. • We do not know if the value in our sample is within the range, just that if many similar samples were taken in the same fashion, X% would fall within the specified range; this one may or may not.

Example (5) (adapted from Babbie) • Probability theory says that 68% of samples will fall within one standard deviation of the parameter and 95% will fall within two standard deviations of the parameter • Increasing confidence with increasing range

Note difference between standard errors & standard deviations

Standard error of a mean

Standard error of a mean • The standard deviation of the distribution of sample estimates of the mean that would be formed if an infinite number of samples of a given size were drawn.

Proportions • Mean of a two-value (binomial) distribution • Var of a proportion = p(1-p) • So the

Table 2.1Confidence Ranges for Variability Attributable to Sampling • Trends • If sample size=75 and p=0.20,

Confidence intervals • In a survey of 100 respondents, 20% say yes. What is the confidence interval for a 95% confidence level? • In a survey 250 respondents, 10% say yes. What is the confidence interval for a 95% confidence level? What if 50% said yes?

In a survey of 100 respondents, 20% say yes. What is the confidence interval for a 95% confidence level? • Interval is 8. • 95% CI=(12%, 28%)

In a survey 250 respondents, 10% say yes. What is the confidence interval for a 95% confidence level? What if 50% said yes? • Interval is about 3.8. • 95% CI is about (6.2%, 13.8%) • If 50% said yes, CI is about (43.7%, 56.3%)

Sampling error and sampling strategy • SRS is approximated by the standard error • Systematic sampling • If not stratified, sampling error is the same as in SRS. • If stratified, errors are lower than those associated with SRS for the same size for variables that differ (on average) by stratum, if rates of selection are constant across strata.

Sampling error and sampling strategy (2) • Unequal rates of selection decrease sampling error for oversampled groups. • It will generally produce sampling errors for the whole sample that are higher than those associated with SRS of the same size for variables that differ by stratum.

Sampling error and sampling strategy (3) • Clusters will produce sampling errors that are higher than SRS for the same size for variables that are more homogenous within clusters than in the population as a whole. • You must look at the nature of the clusters to evaluate the effect on the sampling error.

Caveats • Sampling error is in no way the only source of error. • Non-sampling error, bias, error resulting from incorrect specification of sampling frame, etc., etc., are also sources of error. • Often the latter are more insidious as they are seldom quantifiable • Total survey approach useful in this regard.

Sample size (1) • Very important to consider prior to undertaking study • Consult a biostatistician • Many references in texts, available spreadsheet, stat programs, EpiInfo, etc. • Never feel bad asking for assistance

Sample size (2) • What not to do • Sample size does not rely on the fraction of the population that is sampled. Nor does it depend on the size of the population you want to describe. • Sample size should not be decided solely based on what others have previously done. • Sample size should not be based on the desired level of precision for just one estimate.

Sample size (3) • What to do • develop analysis plan • desired precision of estimates for subgroups, • consider research questions • affordability, • feasibility, • and to some extent, previous studies

Sample size (5) • Parameters required to calculate sample size: • Null hypothesis—what precisely are you asking/testing? •  [Pr(type I error)] •  [Pr(type II error)]—usually included as 1-=power • What difference between groups do you want to observe? (e.g., 1- 2) • What is a good estimate of variance in population?

Sample size (6) • How sample size works—some examples

Sample size (7) sample size,  power Group A Group B

Sample size (8) sample size,  power A: B:   

Sample size (9) variability,  power A:     B:    

Sample size (10) variability,  power A:    B:   

Non-response (1) • Very big issue • Source of non-sampling error • Can lead to bias, uninterpretability of results • Violates whole point of probability sample, yet unavoidable

Non-response (2) • Issue in probability as well as non-probability samples • Exists on many levels

Non-response (3) Whole sample Not reached Reached

Non-response (4) Reached Cannot participate Can participate

Survey Methodology EPID 626