WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures

WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Chris SkinnerUniversity of SouthamptonC.J.Skinner@soton.ac.uk Natalie Shlomo University of SouthamptonOffice for National Statisticsn.shlomo@soton.ac.uk

Disclosure Risk Assessment for Microdata • Assume: • sample • categorical key variables • no measurement error • Seek: • record level risk measures • aggregated to file level measures

Record Level Measures Record with combination of key variable values Sample count with same combination = Population count with same combination = Only consider sample unique records , i.e. = Pr(population unique) = = Pr(correct match)=

Aggregated File-level Measures Expected number of population uniques in sample Expected number of correct matches among sample uniques to the population Note: sample uniques

Estimation Problem • To make inference about: • Record level measures and for sample unique • File level measures and

Log-linear Model • , and independent given • where , sampling fraction Estimate by maximum likelihood , , ,

Some Literature Skinner and Holmes (1998, JOS): good properties of under all two-way interactions log-linear model, where: , Elamir and Skinner (2006, JOS): good properties of and under all two-way interactions model, but no need for term.

Model Sensitivity All two-way interactions model performs well, but… still evidence of some model-dependence of and in neighborhood of this model. Tendency for risk to decrease as model complexity increases.

Model Choice • Goodness of fit tests? • Pearson? • Likelihood ratio? • AIC, BIC? • Problems with very large and sparse tables

Bias Criterion Allow for small departures from Estimate bias of by: Choose model to minimise Similar to choosing model to minimise

Minimising Over- (Under-) Dispersion Model estimates degree of over- or under-dispersion tests hypothesis of equal dispersion Cameron and Trivedi (1998)

Samples from 2001 UK Census Two areas with population of 944,793. ‘Large’ Key: Area (2), Sex (2), Age (101), Marital Status (6), Ethnicity (17), Economic Activity (10) 412,080 cells ‘Small’ Key: same except Age (18) 73,440 cells

Small key, Simple random sample of size 18,896 True values: number of population uniques in sample: sum of over sample uniques:

Large Key, Simple random sample of size 4,724True values,

Model Search Algorithm • Starting solution: all 2-way interactions log-linear model • Search by: • Removing terms • Adding terms • Swapping terms • TABU method of Drezner, Marcoulides and Salhi (1999)

Large key, Simple random sample of size 9,448True values ,

True values ,

Record Level Risk Measures Preferred Model: {ea}{s*a}{s*m}(s*et}{s*ec}{a*m}{a*et}{a*ec}(m*et}{m*ec}True Global Risk: Estimated Global Risk

Conclusions • Model selection by assessing over-, under-dispersion • Similar risk estimates for models with nearly Poisson dispersion • Further work: • - stratification of files • - complex survey designs

WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures

WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures

Presentation Transcript

Assessing Cardiovascular Risk

Assessing risk for research awards using ACL

ASSESSING RISK

Global Disclosure Risk for Microdata with Continuous Attributes

Risk Measures

Assessing Model Risk in Practice

Assessing risk in sport

Assessing Reading Multiple Measures

Assessing Disclosure Risk in Sample Microdata Under Misclassification

Risk Measures

Assessing Risk

Microdata on education level

Estimated record level risk for the CVTS

Record Level Security

Important Risk Disclosure

Record Level Security

Disclosure Limitation in Microdata with Multiple Imputation

Assessing Risk

WP 9: Management

Importance measures in strategic-level supply chain risk management

Assessing Risk in Drug Development

ASSESSING STUDENT ACHIEVEMENT Using Multiple Measures