Developing Statistical Disclosure Control Methods for Census Outputs

Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005

Topics for Discussion • Introduction • Developing SDC methods • Output requirements, risk management • Risk-Utility decision problem • Methods for protecting census outputs • Pre and post-tabular methods • Safe settings, contracts • Work plan for Census 2011 • Assessmentof Census 2001 SDC methods • User and other stakeholder’s requirements • Assessment of alternative methods

Introduction • Previous ONS talk on SDC implementation for Census 2001 and lessons learnt • This talk will focus on developing SDC methodology for Census 2011 based on census output requirements and user needs more generally.The planned approach is a risk-utility decision framework. • Goal: • To provide adequate protection against the risk of re-identification taking into account user needs and output requirements and the usefulness of the data • Improve on existing and develop new SDC methods, assess SDC impact on data quality, and clarify advantages and disadvantages of each method.

Developing SDC Methodology • What are the Census outputs requirements? • Variables to be disseminated • Standard and flexible (web generated?) tables • Origin-Destination tables (workplace tables) • SARs microdata • What are the disclosure risk scenarios, i.e. realistic assumptions on information available to the public that increases the probability of disclosure? • Comment: Note that the problem are 1’s and 2’s in tables and not the 0’s (except for extreme cases)

Disclosure risk measures - quantifies the risk of re- identification: • probability that a sample unique is a population unique, • probability of a correct match between a record in the microdata to an external file • probability that a record is perturbed • Information loss measures - quantifies the loss of information content in the data as a result of the SDC method. Utility depends on the user needs and use of the data: • distortion to distributions (bias) • weaknesses in the measures of association • impact on the variance of estimates • changes to the likelihood functions.

Developing SDC Methodology • Methods for protecting outputs: • Data masking: • perturbative methods - methods that alter the data: swapping, random noise, over imputation, rounding • non-perturbative methods - methods that preserve data integrity: recoding, sub-sampling, suppression • Data access under contract in a safe setting, and additionally ensuring non-disclosive outputs • Need to develop SDC methods for checking outputs, i.e. residuals of regression analysis, bandwidth for kernel estimation of distributions.

Developing SDC Methodology • In some cases, parameters can be given to users to make corrections in their analysis or they are embedded in the SDC method to minimize information loss. • SDC an optimization problem: • Choose SDC method as a function of the data: which will maximize the utility of the data: subject to the constraint that the disclosure risk will be below a threshold: • Risk-Utility decision framework for choosing optimum method

SDC for Census OutputsPre-tabular Methods 1. Random Record Swapping (UK 2001, USA 1991) Small percentage of records have geographical identifiers (or other variables) swapped with other records matching on control variables ( larger geographical areas, household size and age-sex distribution)

Pre-tabular Methods 2. Targeted Record Swapping (USA 2001) Large percentage of unique records on set of key variables (large households, ethnicity), have geographical identifiers (or other variables) swapped with other records matching on control variables.

Pre-tabular Methods 3. Over-Imputation A percentage of randomly selected records have certain variables erased and standard imputation methods are applied by selecting donors matching on control variables.

Pre-tabular Methods 4.Post-Random Perturbation (PRAM) (UK 2001) Percentage of records have certain variables misclassified according to prescribed probabilities. Includes method that preserves marginal (compounded) distributions and edit constraints

Preliminary Evaluation of Record Swapping • 16,120 households from 1995 Israel CBS Census sample file • Households randomly swapped within control strata: Broad region, type of locality, age groups (10) and sex • Strata collapsed for unique records with no swap pair • Disclosure Risk measure: where cells of size 1 and 2 • Information Loss measure:

Preliminary Evaluation of Record Swapping • Risk-Information Loss (R-IL) Map: • Swapping rates 5%, 10%, 15%, 20% • Information Loss - distortion to the distribution: Number of persons in household in each district

Preliminary Evaluation of Record Swapping • Random record swapping vs. targeted record swapping (on uniques in control strata, i.e. large households). • Swapping rate: 10% • Information Loss - distortion to the distribution: Number of persons in household in each district

Preliminary Evaluation of Over-Imputation • 10% of households had geographic identifier erased: • Random selection of households • Targeted selection of households from unique control strata (i.e., large households) • Geographic identifier imputed using hot-deck imputation within strata: sex age groups. • Risk measure: • Information Loss measure:

Preliminary Evaluation of Over-Imputing • Risk-Information Loss (R-IL) Map: • 10% Selected Records (Random and Targeted on Uniques) • Information Loss - distortion to the distribution: Number of persons in household in each district Risk-Information Loss Assessment 10% Random and Targeted Record Swapping(blue) 10% Random and Targeted Over Imputation (pink) 1 0.9 0.8 Risk 0.7 0.6 0.5 0.4 0 20 40 60 80 100 120 140 Information Loss

Final Comments for Pre-tabular Methods • Geographies are swapped because they introduce less edit failures and are generally less correlated with other variables. If other variables are swapped (or over-imputed), such as age, the data would be badly damaged, a large amount of re-editing would be necessary and further imputations carried out. • Swapping does not affect higher (geographical) level distributions within which the records are swapped. This is an advantage and not a disadvantage. • Over imputation is similar to record swapping but causes more damage to the data. Assumptions of “missing at random” problematic for the analysis of full data sets.

SDC for Census Outputs Post-tabular Methods 1. Barnardization (UK 1991) Every internal cell in an output table modified by(+1,0,-1) according to prescribed probabilities (q, 1-2q, q)No adjustments made to zero cells.

Post-tabular Methods 2. Small Cell Adjustments (UK 2001, Australia) Small cells randomly adjusted upwards or downwards to a base depending on an unbiased stochastic method and prescribed probabilities.

Advantages Disadvantages • Provides good protection against disclosure by differencing (although not 100% guarantee) • Easy to apply • Totals are consistent between tables within the rounding base • Rounds all cells, including safe cells • Requires complex auditing to ensure protection • Totals rounded independently from internal cells so tables not additive Post-tabular Methods 3. Unbiased Random Rounding (UK NeSS, New Zealand, Canada) All cells in tables rounded up or down according to an unbiased prescribed probability scheme.

Advantages Disadvantages • Fully protects against disclosure by differencing • Tables fully additive • Minimal information loss • Works with linked tables and external constraints. • Rounds all cells, including safe cells • Requires complex SDC tool Tau Argus (and licence) • Would require more development to work with Census size tables. Post-tabular Methods 4. Controlled Rounding (UK NeSS) All cells in tables rounded up or down in an optimal method that ensures maintaining the marginal totals (up to the base)

Post-tabular Methods • 5. Table Design Methods • Population thresholds • Level of detail and number of dimensions in the table • Minimum average cell size • 6. Further development of SDC methods • Controlled small cell adjustments, controlled rounding • Better implementation and benchmarking techniques for maintaining totals at higher aggregated levels.

Evaluation Study • Origin-Destination (Workplace) Tables and Small Cell Adjustments • Totals in tables obtained by aggregating internal perturbed cells • Different tables produced different results, number of flows different between tables • ONS guidelines: (1) use table with minimum number of categories; (2) combine minimum number of smaller geographical areas for obtaining estimates for larger areas • Some problems in implementation for origin-destination tables

Evaluation Study • Workplace (ward to ward) Table W206 for West Midlands: small cell adjustment method unbiased (errors within confidence intervals of perturbation scheme), ward to ward totals not badly damaged, skewness in lower geographical areas.

Optimum SDC method a mixture of different methods depending on risk-utility management, output requirements and user needs more generally. • What is the optimum balance between perturbative and non-perturbative methods of SDC? • How transparent should the SDC method be? Pre-tabular methods have hidden effects and users are not able to make adjustments in their analysis. • What are the data used for and how to measure information loss and the impact of the SDC method on data quality? • Can we improve on post-tabular methods? • Policies and strategies for access to data through contracts and safe settings? Work started on optimal methods as part of the overall planning for 2011 Census

Work Plan Census 2011 • I. Assessment of Census 2001 SDC Methods: • Risk-Utility analysis • Comprehensive report, forums and discussion groups on SDC methods with users and other agencies • II. Alternative methods for SDC based on results of phase I, user requirements for census outputs and feedback

Final Remarks: • We areevaluating our methods and planning future improvements • Our SDC methodology is based on a scientific approach, understanding the needs and requirements of the users and international best practice • Methods for SDC are greatly enhanced by the cooperation and feedback from the user community!

Contact Details Natalie Shlomo SDC Centre, Methodology Directorate Office for National Statistics Segensworth Road Titchfield Fareham PO15 5RR 01329 812612 natalie.shlomo@ons.gsi.gov.uk

Developing Statistical Disclosure Control Methods for Census Outputs