Estimating Identification Risks for Microdata

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA

Measures of identification disclosure risk • Number of population uniques:Does not incorporate intruders’ knowledge.May not be useful for continuous data.Hard to gauge effects of SDL procedures.Hard to estimate accurately. • Probability-based methods(Direct matching using external databases.Indirect matching using existing data set.)Require assumptions about intruder behavior.May be costly to obtain external databases.

Notation for methods • Actual record j : • Released record j : • Available data: • Unavailable + perturbed data combined:

Probability of identification • Let J = j when record j in Z matches the target record, t. • J = r + 1 when target is not in Z.

Calculating CASE 1: Target assumed to be in Z: • Units whose do not match target’s values have zero probability. • For matches, probability equals 1/nt where nt is number of matches in Z. • Probability equals zero for j = r+1.

Calculating CASE 2: Target not assumed to be in Z: • Units whose do not match target’s values have zero probability. • For matches, probability is 1/Ntwhere Nt is number of matches in pop’n. • For j = r+1, probability is (Nt – nt) / Nt

Splitting

Calculating • Data swapping:Repeatedly simulate swapping mechanism using Z.Estimate probabilities for combinations of original + swapped values.

Calculating • Noise addition:Assume variable k perturbed using Gaussian noise with mean zero and known variance σ2.

Calculating • First distribution is for SDL methods. • Second distribution is best model for predicting unavailable variables given what is known.

Calculating when values in U are not perturbed. Intruders may act this way to avoid computations. It is prudent to evaluate risk assuming they do.

Calculating • Assume independence to obtain: where

Simulations • 51,016 heads of household from 2000 CPS. • Potentially available variables: Age, Sex, Race, Marital Status, Property Tax • Unavailable variables:Education, Income, Social Security, Child Support Payments

Simulations: SDL Procedures • Age: Group in five year intervals. • Race and Marital Status:Swap randomly 30% of values for each variable. • Property taxes:For positive taxes, add noise from N(0, 2902). Constrain values to be positive. Do not alter 0s. • Other variables: Leave at original values.

Simulations: Targets • Everyman : has values near median for all variables. • Unique : Sample unique on combination of age, sex, race, marital status. • Big I : Highest income in data set. • Big P : Highest property tax in data set.

Simulations: Summary of results • Swaps needed to protect Unique. • Age recode plus swaps good protection. • Knowing property taxes greatly increases probabilities of identification. • Adding noise to positive tax values is not sufficient. (Top-coding helps.)

Estimating Identification Risks for Microdata

Estimating Identification Risks for Microdata

Presentation Transcript

Methods for Estimating Distributions

Designs for Estimating

Administrative procedures for microdata access at SURS

International Food Imports: Identification of Vulnerabilities and Risks

Effective Identification and Management of Compliance Risks Peter Scott,

Microdata and schema.org

Identification for Gifted

Macromolecules For Identification

Macromolecules For Identification

Microdata and schema

Microdata for Content Enhancement

Introduction to Government Microdata

Cost Estimating for Engineers

Estimating Risks Associated with Travel Demand Forecasts

Methods for Estimating Defects

Microdata access in practice

Using Census Microdata for demographic research

Access to Microdata

Access to Business Microdata

Macromolecules For Identification