130 likes | 282 Vues
On Privacy-Preserving Utility-Based Statistical Disclosure Limitation Methods. Daniela Ichim. Dissemination of microdata files Confidentiality issues Quality/utility issues Strategies to balance confidentiality and utility. Outline. Official statistics Dissemination potfolio Purposes
E N D
On Privacy-Preserving Utility-Based Statistical Disclosure Limitation Methods Daniela Ichim
Dissemination of microdata files Confidentiality issues Quality/utility issues Strategies to balance confidentiality and utility Outline
Official statistics Dissemination potfolio Purposes educational research policy-making Examples – MFR, PUF (European Statistical law) Consider both confidentiality and utility Dissemination at European level: comparability Microdata release
Official statistics Law: “confidentiality of any statistical unit should not be breached when account is taken of all relevant means that might reasonably be used” There is a law: risk definition + risk assessment + protection Privacy is an individual concept Risk management not risk avoidance evaluate risk in realistic scenarios apply risk reduction methods ... design different microdata releases consider the needs of the counterpart: users Confidentiality Issues
Local/global recoding Local/global suppression Subsampling Top/Bottom coding Rounding Adding noise Swapping Shuffling Microaggregation Post-Randomization (PRAM) IPSO ROMM Model-based perturbation methods … Guided by utility measures!!! Statistical disclosure limitation methods
No law Utility depends on survey user the most important variables (users and survey) Measure it w.r.t. original data Utility is an aggregate concept Main approaches (no widely accepted measure): Mathematical approach Statistical approach Analyses (model) based Data utility
Information theory (Shannon) Entropy change Kullback-Leiber divergence Issues: Rigorous formulation (Willenborg) High implementation costs How users perceive these measures? How these measures relate to the analyses? Data utility:mathematical approach Distribution before protection Distribution after protection
Continuous vars: e.g.Hellinger distance e.g.Total variance distance Goodness of fit tests (Qualitative indication) e.g. Kolmogorov-Smirnov Categorical vars: Measures of association (Concordant and Discordant paris) e.g. Gamma e.g. Goodman Issues: Statistical foundations Very easy to implement How should/could be used by analysts? Data utility:statistical approach
CI for the protected data CI for the original data Data utility:analyses • The idea is to simulate how data would be used in future analyses. • Compare the two analyses (one using the original data and the other using protected data) • In case the difference is acceptable, the information loss is considered negligible. • Example (Karr) • difficult to imagine all possible data analyses • multiple outputs from a single analysis • Collaboration with expert users!!!
Dissemination strategies • Microdata apply a SDL method evaluate utility • Microdata risk assessment apply a SDL method to reduce the risk, maintaining some utility measure utility Business Process Model: User Needs -> Design->Build ->Collect-> Process->Analyse->Disseminate
Utility-based SDL methods • Adding noise • ROMM • IPSO • Data swapping • Priority-driven approaches • Model-based perturbations • Regression • Classification • Descriptive statistics • …
Utility-based SDL methods • Advantage: • Utility indicators = minimum standards • Comparable dissemination • Harmonised dissemination (EU level) • Identify user needs - > apply flexible SDL methods - > deliver quality