1 / 16

Estimating Identification Risks for Microdata

Estimating Identification Risks for Microdata. Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA. Measures of identification disclosure risk.

nyla
Télécharger la présentation

Estimating Identification Risks for Microdata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA

  2. Measures of identification disclosure risk • Number of population uniques:Does not incorporate intruders’ knowledge.May not be useful for continuous data.Hard to gauge effects of SDL procedures.Hard to estimate accurately. • Probability-based methods(Direct matching using external databases.Indirect matching using existing data set.)Require assumptions about intruder behavior.May be costly to obtain external databases.

  3. Notation for methods • Actual record j : • Released record j : • Available data: • Unavailable + perturbed data combined:

  4. Probability of identification • Let J = j when record j in Z matches the target record, t. • J = r + 1 when target is not in Z.

  5. Calculating CASE 1: Target assumed to be in Z: • Units whose do not match target’s values have zero probability. • For matches, probability equals 1/nt where nt is number of matches in Z. • Probability equals zero for j = r+1.

  6. Calculating CASE 2: Target not assumed to be in Z: • Units whose do not match target’s values have zero probability. • For matches, probability is 1/Ntwhere Nt is number of matches in pop’n. • For j = r+1, probability is (Nt – nt) / Nt

  7. Splitting

  8. Calculating • Data swapping:Repeatedly simulate swapping mechanism using Z.Estimate probabilities for combinations of original + swapped values.

  9. Calculating • Noise addition:Assume variable k perturbed using Gaussian noise with mean zero and known variance σ2.

  10. Calculating • First distribution is for SDL methods. • Second distribution is best model for predicting unavailable variables given what is known.

  11. Calculating when values in U are not perturbed. Intruders may act this way to avoid computations. It is prudent to evaluate risk assuming they do.

  12. Calculating • Assume independence to obtain: where

  13. Simulations • 51,016 heads of household from 2000 CPS. • Potentially available variables: Age, Sex, Race, Marital Status, Property Tax • Unavailable variables:Education, Income, Social Security, Child Support Payments

  14. Simulations: SDL Procedures • Age: Group in five year intervals. • Race and Marital Status:Swap randomly 30% of values for each variable. • Property taxes:For positive taxes, add noise from N(0, 2902). Constrain values to be positive. Do not alter 0s. • Other variables: Leave at original values.

  15. Simulations: Targets • Everyman : has values near median for all variables. • Unique : Sample unique on combination of age, sex, race, marital status. • Big I : Highest income in data set. • Big P : Highest property tax in data set.

  16. Simulations: Summary of results • Swaps needed to protect Unique. • Age recode plus swaps good protection. • Knowing property taxes greatly increases probabilities of identification. • Adding noise to positive tax values is not sufficient. (Top-coding helps.)

More Related