160 likes | 305 Vues
This study examines the identification disclosure risks associated with microdata, focusing on statistical measures of risk evaluation. It highlights the challenges of matching records through probability-based methods and outlines different cases for estimating the risk of identification. Moreover, it discusses data swapping and noise addition as strategies to mitigate these risks, using simulations based on extensive datasets. The analysis points to the importance of demographic variables in enhancing identification probabilities and suggests effective protection measures for sensitive data.
E N D
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA
Measures of identification disclosure risk • Number of population uniques:Does not incorporate intruders’ knowledge.May not be useful for continuous data.Hard to gauge effects of SDL procedures.Hard to estimate accurately. • Probability-based methods(Direct matching using external databases.Indirect matching using existing data set.)Require assumptions about intruder behavior.May be costly to obtain external databases.
Notation for methods • Actual record j : • Released record j : • Available data: • Unavailable + perturbed data combined:
Probability of identification • Let J = j when record j in Z matches the target record, t. • J = r + 1 when target is not in Z.
Calculating CASE 1: Target assumed to be in Z: • Units whose do not match target’s values have zero probability. • For matches, probability equals 1/nt where nt is number of matches in Z. • Probability equals zero for j = r+1.
Calculating CASE 2: Target not assumed to be in Z: • Units whose do not match target’s values have zero probability. • For matches, probability is 1/Ntwhere Nt is number of matches in pop’n. • For j = r+1, probability is (Nt – nt) / Nt
Calculating • Data swapping:Repeatedly simulate swapping mechanism using Z.Estimate probabilities for combinations of original + swapped values.
Calculating • Noise addition:Assume variable k perturbed using Gaussian noise with mean zero and known variance σ2.
Calculating • First distribution is for SDL methods. • Second distribution is best model for predicting unavailable variables given what is known.
Calculating when values in U are not perturbed. Intruders may act this way to avoid computations. It is prudent to evaluate risk assuming they do.
Calculating • Assume independence to obtain: where
Simulations • 51,016 heads of household from 2000 CPS. • Potentially available variables: Age, Sex, Race, Marital Status, Property Tax • Unavailable variables:Education, Income, Social Security, Child Support Payments
Simulations: SDL Procedures • Age: Group in five year intervals. • Race and Marital Status:Swap randomly 30% of values for each variable. • Property taxes:For positive taxes, add noise from N(0, 2902). Constrain values to be positive. Do not alter 0s. • Other variables: Leave at original values.
Simulations: Targets • Everyman : has values near median for all variables. • Unique : Sample unique on combination of age, sex, race, marital status. • Big I : Highest income in data set. • Big P : Highest property tax in data set.
Simulations: Summary of results • Swaps needed to protect Unique. • Age recode plus swaps good protection. • Knowing property taxes greatly increases probabilities of identification. • Adding noise to positive tax values is not sufficient. (Top-coding helps.)