Dual System Estimation and Census Adjustment

Dual System Estimation and Census Adjustment Stephen E. Fienberg Statistics 36-149 Department of Statistics Carnegie Mellon University November 27-29, 2001

fish* penguins homeless prostitutes in Glasgow Italians with diabetes* people in the U.S.** people with HIV virus adolescent injuries in Pittsburgh, PA WWW What Do Following Populations Have in Common?

Example 1: Diabetes Prevalence • Bruno et al. (1994) used 4 sources for • ascertainment of diabetes in Casale Monferrat, • Northern Italy • s1: diabetes clinic and/or family physicians • s2: patients discharged with diagnosis from hospitals • s3: insulin or oral hypoglycaemic prescriptions • s4: requests for reimbursement for insulin and • reagent strips

Example 1: Diabetes (cont.) • s1 Yes Yes No No • s2 Yes No Yes No • s3 s4 • Yes Yes 58 46 14 8 • Yes No 157 650 20 182 • No Yes 18 12 7 10 • No No 104 709 74 - n = 2069

Example 2: Fish in a Lake • 200 fish caught 1st time • 150 fish caught 2nd time • Of 150 fish in 2nd sample, 125 were among 200 counted in 1st sample • Total number of fish caught = 200 + (150 - 125) = 225 • But how many fish have gone undetected?

Example 2: Fish in a Lake • Proportion of fish in 2nd sample also in 1st = 125/150 = 5/6 • Generalize from sample to population (5/6) N = 200 N = (6/5) 200 = 240 • This is method of capture-recapture due to Peterson, Lincoln, Schnabel, etc. ^ ^

Capture-Recapture Model • Sample 2 • In Out Total • In a bn1 • Out c d ??N -n1 • Total n2 N - n2 N ?? Sample 1 ^ N = n1 n2/a

Role of Independence

Some Formal Details • Alternatively, we think in terms of the ratio of odds for row 1 vs. odds for row 2: • P{A and B} / P{A and Bc} • P{Acand B} / P{Acand Bc} • P{A and B} P{Acand Bc} • P{Acand B} P{A and Bc} • and under independence this equals 1. =

Some Formal Details • Back to data. • We think of independence in terms of equality of odds, and we set • ad/bc = 1 • and estimate unobserved d by • d = bc/a • N = a+ b+ c+bc/ a • = n1 n2/a ^ ^

More Formal Version 125 75 200 25 ? 150 Ñ= 150 200/125 = 240 Ñ =n1n2/a

Example 1: DiabetesLooking at Pairs of Lists ^ Pair N s1, s2 2,351 s1, s3 2,185 s1, s4 2,262 s2, s3 2,057 s2, s4 803 s3, s4 1,555 Estimated s.e.’s are on the order of 100. Only 3 of 6 estimates exceed n = 2069.

Diabetes Example:What is Going Wrong? • Independence of lists in the pairs!

Capture-Recapture Assumptions • Random samples • Independence • Closed population • Perfect matching (no tag loss) • Homogeneity • How do we check on assumptions? • The problem of the “wiley trout.”

Accuray and Coverage Evaluation Survey • Survey approximately 314,000 HH in 11,000 blocks. • Used to correct raw census counts using “capture-recapture” or dual systems estimation methodology. • Correct for omissions AND erroneous enumerations.

ACE Design • Two parts to ACE sample of blocks: • sample of population -- P-sample • used to estimate omissions • matched records against those for census • sample of census -- E-sample • used to estimate erroneous enumerations • subtract out EEs from census counts before using DSE

Dual Systems Components

DSE With Same Values As Fish 125 75 200 25 ? 150 nCEN=census count - EEs Ñ =nCENnACE/a Ñ= 150 200/125 = 240

DSE Features in 2000 • Excluded homeless/shelters and group quarters from calculations in 2000 • Adjusted sample counts for movers • Searching in adjacent blocks

Some Practical Issues • How big is d relative to c? • Within HH vs between HH omissions • Counts of zero • “Negative” adjustment factors -- <1 • some blocks go up in size after DSE and some go down

Dual Systems Assumptions • Perfect matching • idea of probabilistic matching with variable probabilities for different individuals • Homogeneity • Dependence between sample and census • heterogeneity and dependence get combined in what is called correlation bias • Errorless assessment of erroneous enumerations

ACE Implementation • Aggregate counts from census blocks for various demographic and racial/ethnic groups. • Apply DSE for these aggregates (called post-strata). • Generalizing from adjustments for the ACE sample of blocks and strata to the nation. • synthetic error

Post-strata • Instead of doing DSE at the block level, we reorganize the data by grouping parts of blockes according to • age • race/ethnicity • sex • occupancy status • mail return rate • Results in over 480 post-strata, and we apply DSE in each.

What Do We Know About Dual SystemsAssumptions at Post-strata Level?

Synthetic Assumption • Carrying the adjustments back to the individual blocks not in the ACE sample: • Assumes the homogenity of all of those parts of blocks in each post-stratum. • Result is that some blocks increase and some blocks decrease in estimated population size • decreases total 1 million • increases total 4.3 million

March 2001 Adjustment Decision • Not ready to adjust using DSE. • Concerns: • DA • loss functions • counties under 100,000 • balancing error • synthetic error

Oct. 2001 Adjustment Decision • Still not ready to adjust! • Old concerns: • DA • loss functions? • balancing error - no • synthetic error -no • New concern: • missed EEs in ACE

Dual System Estimation and Census Adjustment