A stata program for Respondent Driven Sampling Matthias Schonlau, DIW, RAND (USA) Elisabeth Liebau, DIW Stata User Conference Berlin, June 25, 2010
What is RDS? • RDS = Respondent Driven Sampling • Invented by a sociologist (Heckathorn, 1997) • RDS is a chain referral sampling procedure • Sampling probabilities can be calculated • It is the only alternative yielding a probability sample when traditional methods do not work.
Typical RDS populations RDS is employed where traditional probabilistic sampling methods do not work well: • Sampling frame cannot easily be constructed • e.g. no registry available • Low prevalence • screening is ineffective/expensive • E.g. jazz musicians • Anonymity is an issue • E.g. Questions about illegal drugs
RDS Sampling Procedure • Approach several seed respondents • Each respondents approaches 3 further respondents from their social network • Payments to respondents and for each referral who contacts interviewers • Stop when desired sample size reached
Differences to snowball sampling • Respondent recruits directly and do not give contact information to interviewer • Length of referral chain is crucial to reach equilibrium • Formal theory requires keeping track of who recruits whom • No theory in snowball sampling • Theory attaches different sampling weights to recruits depending on their network size and the transition matrix • Snowball sampling does not use sampling weights
Single seed 3 recruits Max chain length =3 (not counting seed) Example data from Heckathorn et al. 2002 Red /blue Example The name “red/blue” is explained later.
Motivation for Theory • If the referral chains are sufficiently long, characteristics of the eventual sample will be independent of the seeds • The recruitment distribution reaches an equilibrium • The probability of recruiting someone from a certain group (e.g. „white female“) can be derived.
Example: 2 groups (red/blue) Transition Count Transition probability
Data required • id: respondent coupon • ref1,ref2,ref3 :referral coupons • degree: network size • key: analysis variable
rds syntax Two steps: rds_network analyzes the network rds does the estimation
Example: Iguchi et al. study • Large US Study of Men who have sex with men, drug users, and their sex partners. • Innovative design, multiple sites • For illustration, we look at data from Los Angeles (Phase II) • Iguchi, M., Ober, A., Berry, S., Fain, T., Heckathorn, D., Gorbach, P., et al. (2009). Simultaneous Recruitment of Drug Users and Men Who Have Sex with Men in the United States and Russia Using Respondent-Driven Sampling: Sampling Methods and Implications. Journal of Urban Health, 86, 5-31.
Large number of seed respondents. The largest referral length is 18.
Required referral length (5) is smaller than largest chain (18, previous slide). Convergence has been reached. If there were only two categories (here 4), both transition matrices would be identical.
Cumulative sample proportions for increasing number of waves Theoretically, sample proportions should stabilize after 5 waves (see program output). In practice, cumulative sample proportions stabilize later, perhaps after 13 waves. (In practice, assumptions are never perfectly met.) Los Angeles
Population + Sample proportion The estimated population proportions are the main result. The sample proportions are surprisingly similar here. This is because the Multiplicity degree does not vary a lot by group
Equilibrium If all assumptions are met, the sample proportions will eventually converge to the equilibrium. The equilibrium does not equal the population proportion, because groups that are better networked (larger degree) are sampled more often.
Degree In the sample, each Hispanic reports an average of 15 connections in the target population. By design, Average Degree is always greater than the multiplicity degree.
Homophily Race “other” recruits at random 96% of the time. Race “black” recruits 47% of the time other blacks and 53% of the time at random
Weight For example, each Hispanic receives the weight 1.0954048 . These weights can be exported using the wgt option.
Weights Weights reproduce the estimated proportions rds ethnic, id(id) degree(netsize) recruiter_id(p_id) recruiter_var(p_key) wgt(wgt)
Bootstrap results Bootstrapping is a method for obtaining confidence intervals. bootstrap _b , reps(1000) : /// rds ethnic, id(id) degree(netsize) recruiter_id(p_id) recruiter_var(p_key) estat bootstrap, percentile
Outlook • Currently working on a paper • Software will be downloadable in about a month from within stata by typing Net search rds and following the link. For now please email me and I will send the code.
THE END Contact : Matt Schonlau: email@example.com (until August) firstname.lastname@example.org Elisabeth Liebau: email@example.com Acknowledgement: We are grateful to Martin Iguchi, Sandy Berry, Allison Ober, Terry Fain for giving us access to the data for the example. The group is preparing a public release version of the data after additional publications are written.