The Network Scale-Up Method

The Network Scale-Up Method Counting the Uncountable

The network scale-up team • H. Russell Bernard (Univ. of Florida) • Peter D. Killworth (Southampton Oceanography Centre)† • Christopher McCarty (Univ. of Florida) • Eugene Johnsen (UC-Santa Barbara) • Gene A. Shelley (Georgia State Univ.)

Necessary background • Original work funded by NSF Methods, Measurement and Statistics • Justification was for estimating size of hard-to-count populations • Grant was called “Counting the uncountable” • Unstated reason was learning something about the distribution of personal network size • This is a building block of social structure • This leads to some confusion in the literature • Now has been picked up by UNAIDS and WHO as another method

Methods to estimate the size of Most At Risk Populations (MARPS) • Estimation using a cross-sectional survey • Capture-recapture • Multiplier • Network Scale-up Method (NSUM)

Conference on surveying hard to reach populations • www.amstat.org/meetings/h2r/2012/

A primitive model • t= the size of a population (e.g. the U.S.) • e = the size of some subpopulation within it (e.g. diabetics). • m = the number of members of the subpopulation known by any one person • c = personal network size • Bibliography in notes, below and at end

A maximum likelihood estimate of an individual’s network size: where there are L known subpopulations. (Here i is the individual, who knows mij in subpopulation j.) Network size is (the sum of all the people you say you know in some subpopulations of known size, divided by the total size of those subpopulations) times the population within which the subpopulations are embedded. Killworth, P. D., C. McCarty, H. R. Bernard, G. A. Shelley, and E. C. Johnsen. Estimation of Seroprevalence, Rape and Homelessness in the U.S. Using a Social Network Approach. Evaluation Review 22:289–308.

Consequences • If: • …you know the network size for each of a set of respondents • …and you know how many people each knows in a hard to count population • you can estimate the size of the hard to count population

Two step process • Estimate personal network size for a set of respondents • Back estimation from populations of known size • Summation method • Apply this to reported m for unknowns to estimate the size of the unknown subpopulation

Our Surveys in the 1990s • We did seven telephone surveys in the U.S. • We did another set in Mexico • Across seven surveys in the U.S. , we consistently found an average network size for the US of 290 (sd 232, median 231). • In Mexico the network sizes were smaller

Is 290 is an artifact of the method? • We test this in three ways. • (1) Make the estimates using a different method. • (2) Experiment with parameters and see if the outcome varies in expected ways. • (3) Compare values of c across populations of known relative sizes.

Reliability I: Use a different method • (1) In one survey, we estimated c by asking people how many people they know in each of 17 relation categories – people who are in their immediate family, people who are co-workers, people who provide a service – and summing. • This summation method once again produced a mean for c of 290. • McCarty, C., P. D. Killworth, H. R. Bernard, E. Johnsen, and G. A. Shelley. Comparing Two Methods for Estimating Network Size. Human Organization 60:38–39

Reliability II: Change the data • (2) We changed reported values at or above 5 to a value of 5 precisely. The mean dropped to 206, a change of 29%. • We set values of at least 5 to a uniformly distributed random value between 5 and 15. We repeated the random change (5 – 15), but only for large subpopulations (with >1 million). • The mean increased to 402, a change of 38% -- in the opposite direction.

Reliability III: Survey clergy • (3) We surveyed a national sample of 159 members of the clergy – people who are widely thought to have large networks. • Mean c = 598 for the scale-up method • Mean c = 948 for the summation method

So, 290 is not a coincidence • 1. Two different methods of counting produce the same result. • 2. Changing the data produces large changes in the results, and in the expected directions. • 3. People who are widely thought to have large networks do have large networks.

The data track

Distribution of network size

r =.79 … rises to .94 without the outliers

Over- and under-estimation • There is a tendency for people to overestimate small populations (<2 million) and to underestimate large ones (>3 million). • The two largest populations are people who have a twin brother or sister and diabetics. • These are highly overestimated. • Without these two outliers, the correlation rises from r = .79 to r = .94

Stigma vs. not newsworthy • Being a twin or a diabetic is neither stigmatizing, nor newsworthy. • From ethnographic evidence, personal information about close co-workers or business associates can take a decade or more to be transmitted ... and in the case of being a twin or a diabetic, may never be transmitted. • 1990 Gene Anne Shelley, H. R. Bernard, and P.D. Killworth. Information Flow in Social Networks. J. of Quantitative Anthropology 2:201–25. • 1995 Shelley, G.A., H. R. Bernard, P. D. Killworth, E. C. Johnsen, and C. McCarty. Who Knows Your HIV Status? What HIV+ Patients and Their Network Members Know About Each Other. Social Networks , 17, 189-217. • 2006 Shelley, G. A., P. D. Killworth, H. R. Bernard, C. McCarty, E. C. Johnsen, and R. E. Rice. Who knows your HIV status II: Information propagation within social networks of seropositive people. Human Organization 65:430-444.

Another encouraging result • Charles Kadushin ran a national survey to estimate the prevalence of crimes in 14 cities, large and small, in the U.S. • He asked 17,000 people to report the number of people they knew who had been victims of six kinds of crime and the number of people they knew who used heroin regularly. • 2006 C. Kadushin, P. D. Killworth, H. Russell Bernard, and A. Beveridge. Scale-up methods as applied to estimates of heroin use. Journal of Drug Issues 36:417-440.

Here are the estimates for the number of heroin users in each of the 14 cities, along with the estimates from the UCR.

Heroin user by NSUM versus Unified Crime Reports

Compromising assumptions • 1. Transmission effects: Everyone knows everything about everyone they know. • 2. Barrier effects: Everyone in t has an equal chance of knowing someone in e. • 3. Inaccurate recall. People don’t recall accurately the number of people they know in the subpopulations we ask them about. • The accuracy problem is discussed earlier.

Our estimates using NSUM • RDD telephone survey of 1554 adults in the U.S. in 1994. • Seroprevalence: 800,000 ± 43,000; • Homeless: 526,000 ± 35,000; • Women raped in the last 12 months: 194,000 ± 21,000. • These are all close to other estimates made with various enumeration or surveillance methods. • 1998 P. D. Killworth, C. McCarty, H. R. Bernard, G. A. Shelley, and E. C. Johnsen. Estimation of Seroprevalence, Rape and Homelessness in the U.S. Using a Social Network Approach. Evaluation Review 22:289–308.

How to do it

Network scale-up begins like most surveys • Define respondent population • Choose sample frame • Choose survey mode • Choose sample size • Design questionnaire (This is the part that’s different)

Selecting respondent population • Respondent population is not the same as the population to be estimated (target population) • U.S. respondents to estimate homeless population • Urban population to estimate heroin users • You must know the size of the respondent population • Do transmission and barrier errors suggest using a respondent population with more ties to target population? • This opportunity to do this research in multiple countries could help solve this problem

Choose sample frame • The sample frame represents the respondent population • For our work we used random digit dial telephone numbers • For face-to-face a general population survey may rely on census or voter registration data

Choosing mode • There are five survey modes • Face-to-face • Telephone (this is what we used) • Mail • Drop and collect • Web • There is a large literature on mode effects in surveys • For the populations of interest to UNAIDS a face-to-face or mixed mode makes sense

Choose sample size • As with any survey, the sample size should be based on expected margins of error • For this survey we have margins of error associated with network size • Although estimates of network size are remarkably reliable, they have large standard deviations • Our data suggest that a survey of 400 respondents would generate a margin of error of ±26 alters • A survey of 1,000 would generate a margin of error of ±16 alters • Keep in mind these are based on variance for U.S. respondents

Design questionnaire • Network scale-up questionnaire has three parts • Demographics used to estimate bias • Question to estimate the number of alters respondents knows in the target population • Questions to estimate network size (c) • Steps 2 and 3 require a boundary definition of who is counted as a network alter

Alter boundary • Definition of who is an alter can have enormous effects on the estimate • Our definition: • You know them and they know you by sight or by name. You have had some form of contact with them in the past two years and you could contact them if you had to • Question: Should respondents be instructed to exclude those met on networking sites such as Facebook?

There are two ways to estimate c • Scaling from known populations • The summation method

Using known populations • Select a set of known populations, the more the better • Populations should vary in size and type • Limiting the study to populations related to health conditions, although plentiful, may introduce barrier error • Using only large populations (such as men or people over age 65) introduces a lot of estimation error • Using only small populations introduces error from very few hits • Populations are often related to transmission and barrier effects • In the past we assumed that by using populations of multiple size and type these effects are cancelled out

Examples of populations we used • In the U.S. there are a variety of sources for known populations: • The U.S. Statistical Abstract • The U.S. Census • The FBI Crime Statistics • Ideally collection of sub-population data will be recurring so that they can be used in subsequent years • It is important that the data all reflect the same year (be aware that some population data lags) • Known populations are very susceptible to transmission and barrier error

Relationship between number known and demographic characteristics

We experimented with names • Census provides estimates of both first names and last names • We experimented with both types and found problems with each • The advantage of names is that they vary in size and are typically ascribed • Countries and cultures vary in the way they use names • They are prone to barrier error

Relationship between number known and demographic characteristics

Summation method • We can estimate network size (c) directly by asking respondents to tell us how many people they know • This is an unreasonable task unless it is broken into reasonable subtasks • We use culturally relevant categories of relation types that are mutually exclusive and exhaustive • These are small enough that respondents can estimate them reliably

Relation categories • Immediate family • Other birth family • Family of spouse or significant other • Co-workers • People at work but don't work with directly • Best friends/confidantes • People know through hobbies/recreation • People from religious organization • People from other organization • School relations • Neighbors • Just friends • People known through others • Childhood relations • People who provide a service • Other

Estimates of network size from two methods (scaling from known and summation) are very close • Scaling from known populations • 290.8 (SD 264.4) • Summation method • 290.7 (SD 258.8) • We checked in multiple ways to see whether this was an artifact of the method • It wasn’t

Advantages of the summation method • It is quicker, taking about half the time or less than estimating from known sub-populations • It should not be subject to transmission or barrier error • It does not require finding known populations, which could be a problem in some countries

Disadvantages of summation method • It cannot be verified statistically • It may be easy for respondents to double count network alters as they are multiplex relations (such as co-worker and social contact) • Network size calculated from scaling known populations can be checked by back-estimating each known with the other knowns

Modeling issues • At this point in our work we are convinced that our estimates of network size are relatively reliable, but not absolutely reliable • If my network is 300 then I am confident it is half as large as that of someone with a network of size 600 • I am not confident that the network size is actually 300 • This compromises our ability to estimate the absolute size of a population • This does not seem to matter in the current discussion

New directions • Problems with under-estimated m • Using perceptions of stigma • Using game of contacts to create weights • Using Demographic Health Survey (DHS) to create knowns

The Network Scale-Up Method

The Network Scale-Up Method

Presentation Transcript

Scaleability Scale Up and Scale Out

Network Simplex Method

Scale-up/down

The Network Scale-Up Method (NSUM)

Scale-up

The Network Simplex Method

Animal Scale-Up

SIZING UP SCALE

Scale-Up

The Large Scale Clinical Trial Network

EXAMPLES OF SCALE UP

Large – Scale Sensor network

The Network Simplex Method

Scaleability Scale Up and Scale Out

Network Reprogramming at Scale

Process Scale-Up Training

SCALE-UP Multivariate Calculus

The Network Beacon Announcement scanning method