2010 Annual Conference Harvard Program in Survey Research October 22, 2010 Survey Experiments:Past, Present, Future Thomas M. Guterbock Director, Center for Survey Research University of Virginia
Overview • Why survey experiments are so cool • Defining the survey experiment • Methods vs. substance • Scan of survey methods experiments • Substantive experiments • Key design issues in survey experiments • Factorial (“vignette”) surveys • Example: dirty bomb scenarios in the National Capital Region • A look at the future of survey experiments
Sample surveys: Generalizability External validity Experiments: Valid causal inference Internal validity Why survey experiments are so cool! Survey experiments: • Generalizability • Valid causal inference • External & internal validity The best of both worlds!
Two knowledge gaps • Psych experimenters don’t always know a lot about doing surveys • Some don’t think sampling is very important • They don’t think surveys measure things well • Survey researchers don’t always know a lot about experiments • And they question the external validity or relevance of many lab experiments • Assumption: this audience, like the author, is more likely to be in the latter group
What’s a survey experiment? (and what’s not)
Experiments generally • An intervention, treatment or stimulus is varied • Subjects randomly assigned to treatment vs. control • Outcomes are measured • Because of random assignment, any variation in outcomes can be attributed to the treatment • Absent various threats to internal validity • The ‘classic experiment’ involves pre- and post-tests (measurements of outcome variables)
Survey experiments • Systematically vary one or more elements of the survey across subjects • Usually do not include ‘pre-test’ measurement • Thus, most survey experiments are not ‘classic’ in design • “Posttest-only control group design” • Random assignment is critical to the design
Diagram of Basic Experimental Design Source: Babbie textbook
An inclusive definition It’s still a survey experiment even if: • Sample is small • Sample is not probability based • Sample is not representative • It’s done in a lab setting • It’s only part of a pre-test for a survey project • Any aspect of the survey protocol is varied Large, probability-based samples do make the survey experiment better!
What’s NOT a survey experiment • General tinkering . . . • “Let’s experiment!” • One shot trial of a new method • Mid-stream change in method • No true randomization of cases when this happens • Experiments that only use a survey to measure outcomes, pre- or post-intervention • But do not vary the survey itself
Is thisclassical experiment a survey experiment? This is a questionnaire But this is not a survey experiment Source: Babbie textbook
Methods vs. Substance A slippery distinction
Prevailing narrative . . . • Methods experiments have been around a long time • Mostly split-ballot question wording experiments • New trend is: use survey experiments for testing theories about substantive social science problems • The field is moving from methods experiments to substantive experiments • And from applied to basic research . . .This is but a partial picture.
In fact . . . • Methods experiments are burgeoning in number • Methods experiments deal with far more than question wording • Some methods experiments are quite complex • The line between ‘methods’ and ‘substantive’ research is increasingly blurred • As theories are developed to explain variations in survey response, methods experiments are used to test these theories. • The same theories may underlie some ‘substantive’ experimentation
Survey experimentsabout survey methods A quick scan of the landscape
What’s a methods experiment? • Purpose: improve survey methods • Lower the cost • Deliver quicker results • Increase usability • Decrease survey error • Decrease: • Sampling error • Coverage error • Nonresponse error • Measurement error
Independent variables (factors)in methods experiments • Mode comparisons • Phone versus personal interviews • Web versus mail • Usually address several types of error • Coverage, nonresponse, measurement • Sampling and coverage experiments • RDD versus Listed sample • ABS versus area-probability • Methods of selection within households
More factors . . . • Unit non-response • Dillman’s “Total Design” research • Response rate research in mailed surveys • Reminders, advance letters, stamps, length, color • Response rate research for web surveys • Paper reminders, progress indicators • Advance letters to boost telephone response • Cash incentives research • Item non-response • Arrows, visual design, skip instructions
Still more factors . . . • Measurement error experiments • Classic (and newer) experiments in • Question wording • Question sequencing • Offering a “don’t know” response or not • Question formats, response scales • Unfolding questions • Numbering, labeling of scales • Cell phone versus landline interviewing • Interviewer, context effects • Race, gender of interviewer • School versus home setting • Conversational vs. structured interviewing
Outcomes measured in methods experiments • Response rates • Completion, cooperation, refusal, mid-survey break-off rates • Responses to the survey questions • Level of reporting of sensitive behaviors • % who say they “don’t know” • % giving extreme responses, standard deviations • Length of open-ends • Data quality measures • Rate of skip errors • Missing responses • Interview length • Usability and cost measures • Including results from para-data
In short . . . The primary corpus of accepted research in survey methods today is almost entirely based on: Survey experiments
Substantive survey experiments • Most notable in the field of race relations • Cf. Sniderman, Gilens, Kuklinski, et al. • “mere mention” experiment • Unbalanced list technique • Activation of racial identity affecting minority responses • Movement spreading to other substantive areas • but methods experiments are still more common • TESS has fostered much experimentation • Over 200 experiments by 100 researchers by 2007 • Won 2007 AAPOR Warren Mitofsky Innovators Award • Factorial “vignette” technique—a long tradition • (more on this later)
Split-ballot vs. within-subject • The vast majority of survey experiments use Split-Sample designs • “Randomized Posttest/Control Group” design • Statistical tests based on independent samples • Needs a lot of cases; most surveys have plenty • Some use within-subjects designs • Greater statistical power (paired comparisons) • But later responses may be influenced by earlier questions • Factorial vignette surveys often combine these • Vignettes vary across subjects • But each subject scores several vignettes
Statistical power issues • Power of a split-sample is greatest when cases are evenly divided • If groups are equal in size, critical value = ME * • Example: N= 1200, split over two groups of 600 each. • ME for each group = +/- 4 % • Critical value for contrast = 4% x 1.41 = +/- 5.6% • Sometimes, control group needs to be larger • To preserve comparability with earlier survey • Because experimental treatment is expensive • Many experiments use more than one treatment • Are pre-tests big enough for an experiment?
Randomization issues • Full randomization between groups is crucial to the experiment’s validity • Difficult to get people to carry out randomization • If possible, have the computer do it • In CATI systems, don’t randomize within the interview • Pre-assign all values and store in the sample database • Be sure to track which group is which! • Don’t confound experimental effects with interviewer effects • Randomize across interviewers • Control for interviewer effects in analysis • Keep interviewers blind to your hypotheses
More design issues • Lab experiment or field experiment • Lab gives better control over background variables • Usually lower cost • Easier to do complex measurements • Field experiments give greater realism, representativeness • Better external validity • “Packages” vs. factorial design • Best design depends on study purposes
Factorial (vignette) surveys (with thanks to the late Steven L. Nock, my co-author)
Factorial surveys • Substantive survey experiments about factors that affect • Judgments • Decisions • Evaluations • These studies tell us: • What elements of information enter into the judgment? • How much weight does each element receive? • How closely do people agree about the above?
More on factorial surveys . . . • Respondents evaluate hypothetical situations or objects, known as ‘vignettes.’ • Experimental stimuli: vignettes • Outcomes of interest: judgments about the vignettes • Judgments will vary across the vignettes • But also across respondents
Why factorial surveys are cool • When values of factors are allocated independently across vignettes, the factors are uncorrelated. • This simplifies modeling of the effects on judgments • Factors are also independent of respondent characteristics • Respondents can be given quite complex vignettes to consider • Unusual combinations presented more easily as vignettes than in the real world
Key design choices in factorial surveys • How many factors? How many values? • More factors make the respondent’s task more difficult • More values on more factors yield larger number of possible unique vignettes • Phone surveys need simpler vignettes • Example: in Nock’s study with 10 factors, and 2 to 9 values on each, there were 270,000 possible vignettes • These are sampled (by SRS) and randomly assigned across respondents
More design choices . . . • Which vignettes to present? • When there are a lot of vignettes, these must be sampled • SRS across all values yields uncorrelated factors • But distribution on some factors may not be like the real world • Some randomly created vignettes are implausible • The number to present to the respondent • Need to avoid fatigue, boredom, and satisficing • Later judgments may be more affected by just a few factors, to which respondents become increasingly attentive • This choice depends on mode, sample, type of respondent • How many assessments? • One judgment, or a series of questions about each vignette?
Another design choice • What survey mode to use? • Paper, self-administered is possible • use Mail Merge to create unique set of vignettes on each questionnaire • Phone is possible • But number of vignettes and number of factors is restricted due to oral administration • Face-to-face with paper vignettes • CASI (with interviewer guidance) • Internet
Analysis can be complex • If 500 respondents each rate 5 vignettes . . . • Then 2,500 vignettes are rated • Data must be converted to a vignette file of n= 2,500 • But: ratings are not independent! • Each respondent is a ‘cluster’ of interdependent ratings • Solution: • Multi-level analysis • Analyze models using HLM
2009 Survey ofBehavioral Aspects ofSheltering and Evacuationin the National Capital Region Sponsors: Virginia Dept. of Emergency Management U.S. Dept. of Homeland Security 41 41 41
Features of the Survey In-depth survey: average interview length 28 minutes Fully supported Spanish language interviews as needed Data collection using CATI (Computer-Assisted Telephone Interviewing) 2,657 interviews conducted by CSR, Sept-Dec 2009. Triple-frame sample design includes cellphones, landline RDD and listed phones Inclusion of cellphones increases representativeness Margin of error: +/- 2.3 percentage points Weighting by ownership, race, gender, geography, and type of telephone service 42 42 42
Event Scenarios Focus: dirty bomb(s) in the NCR Will residents decide to stay or to go? 3 scenarios at increasing hazard levels: Minimum, moderate, maximum Respondent is presented with only two of the three tested scenarios Over 5,000 scenario tests in the study 43 43
Factorial Design Four aspects (“factors”) of the scenarios were experimentally varied using random assignment PATH: Which two hazard levels are asked NOTICE: Whether the event is preceded by prior notice or threats LOCATION: The respondent’s location when the event occurs SOURCE: The source of the information about the event Notice, location, and source are kept constant for both scenarios asked 44 44
Three Levels of Hazard 45 45
Factors – four levels of SOURCE The four factors result in 48 different possible versions of the scenario, randomly assigned. 48 48
Detailed Follow-up Questions Follow up questions were asked about the decision to shelter in place or evacuate, as appropriate (once only) Shelter in place detail Willingness to remain at location, reasons for leaving, what would aid staying Evacuation detail Reason for leaving, destination, mode of travel, needs, use of designated route Mandatory evacuation: everyone was asked evacuation detail eventually 49