1 / 18

WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census

WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census. Natalie Shlomo University of Southampton Office for National Statistics n.shlomo@soton.ac.uk. Topics of Discussion Introduction SDC Methods used in the 2001 UK Census Data Used Disclosure risk assessment

aric
Télécharger la présentation

WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WP 19 Assessment of Statistical Disclosure Control Methods for the2001 UK Census Natalie ShlomoUniversity of SouthamptonOffice for National Statisticsn.shlomo@soton.ac.uk

  2. Topics of Discussion • Introduction • SDC Methods used in the 2001 UK Census • Data Used • Disclosure risk assessment • Data utility assessment • R-U confidentiality maps • Conclusions

  3. Introduction • SDC methods for 2001 UK Census: • Original methods: random record swapping for all of the UK and higher thresholds in E and W • Re-assessment of disclosure risk for the 2001 Census: 100% of the questionnaire was coded, increasing technologies and small area statistics and other external files on the web; perception of risk • Additional method of small cell rounding for E and W (led to differential SDC methods across UK Statistical Offices) • 2. Need to assess the SDC methods with respect to a disclosure risk-data utility framework in order to develop strategies for the 2011 Census

  4. SDC Methods • Pre-tabular method of random record swapping: • Random sample of households with a fixed swapping rate is selected within control strata defined by the local authority, household size, sex and broad age distribution and hard-to-count index • For each household selected, a paired household (within the control strata) is selected (if one isn’t found the selection goes beyond the local authority) • All geographical variables swapped: • - less edit failures (assumes conditional independence of census target variables and geographies given the control variables) - perturbs highly matchable variable

  5. SDC Methods • Pre-tabular method of random record swapping (cont): • For this analysis we also examined: • Targeted Record Swapping where households are selected and paired with other households that are in small cells (ones and twos) of selected tables • Random record swapping not including imputed records since imputation gives a priori protection and there is no need to perturb them or take them into account in the risk assessment • 3 swapping rates: 1%, 10%, 20%

  6. Advantages Disadvantages SDC Methods Consistent totals for all tables Leaves a high proportion of risky (unique) records unperturbed Preserves marginal distributions at higher aggregated levels Errors (bias) in data, in particular joint distributions distorted Some protection against disclosure by differencing two non-coterminous tables Effects of perturbation hidden and can’t be measures or accounted for in statistical analysis, i.e. a number in a table is not the true value Less edit failures when swapping geographies Method not transparent to users and appears as if no SDC method used Targeted swapping lowers disclosure risk Targeted swapping causes more distortion in the distributions of the table Pre-tabular method of record swapping (cont.):

  7. SDC Methods • Post Tabular method of small cell rounding • Small Cell Rounding (SCA) – random rounding to base 3 of small cells: • Perturbation has a mean of zero and variance of 2. • Marginal totals obtained by adding perturbed and non-perturbed cells

  8. SDC Methods • Post Tabular method of small cell rounding (cont): • For this analysis we also examined: • Full Random Rounding (CRND) – random rounding to base 3 for all entries. First turn all entries into residuals of base 3 and apply same method as SCA. Preserve overall total of the table by controlling the stochastic process • Semi-controlled random rounding (CSCA)– Preserve overall total of the table by controlling the stochastic process

  9. Advantages Disadvantages SDC Methods Post Tabular method of small cell rounding (cont): Full protection for the high-risk (unique) cells Inconsistent totals between tables when margins aggregated from perturbed cells Full rounding protects against disclosure by differencing two non-coterminous tables. Small cell rounding gives little protection against disclosure by differencing so only one set of geographies and other variables disseminated Small cell rounding has less information loss Full rounding has margins rounded separately and tables aren’t additive Methods clear and transparent to users Stochastic methods of rounding are easier to unpick and tables may need to be audited prior to release Stochastic methods can be accounted for in statistical analysis

  10. Data Used • Estimation Area: SJ (Southwest England) • 437,744 persons, 182,337 households, 1.487 output areas • 5 standard census tables (the number of categories are in parentheses): • Religion(9) * Age-sex (6) * OA • Travel to work (12) * Age-sex(12) * OA • Country of birth (17) * Sex (2) * OA • Economic Activity (9) * Sex (2) * Long term illness (2) * OA • Health status (5) * Age-sex (14) * OA

  11. Disclosure Risk • Assume disclosure risk only arises when small cells are in the table, i.e. record swapping has disclosure risk since small cells are not eliminated but rounding has no disclosure risk • Assume there is no risk of disclosure by differencing since only one set of variables and geographies are disseminated • Take into account that imputed records have no disclosure risk • Disclosure risk measure - Number of records that were perturbed or imputed in the small cells of the tables out of all the records in the small cells.

  12. Disclosure Risk • 16% a priori protection due to imputation • No impact on disclosure risk at 1% swap • Targeted record swapping lowers disclosure risk

  13. Data Utility • Distance metrics between original and perturbed cells in each OA and average across all OA’s • Let be a table for OA k, number of cells in OA k, the number of OA’s in the area, and the cell frequency for cell c : • Average Absolute Distance per Cell (AAD) • Aggregation of perturbed cells and effects on sub-totals: • Users aggregate lower level geographies which are perturbed to obtain non-standard geographies • Calculate sub-total where

  14. Data Utility

  15. Data Utility

  16. R-U Confidentiality Map • 1% swapping rates have high utility but very high disclosure risk • 10% targeted record swapping has same disclosure risk as the 20% random record swapping but much more utility • Higher utility for random record swapping not including imputed records

  17. Conclusions • SDC methods of record swapping and rounding used for the 2001 UK Census managed the disclosure risk • Random record swapping alone gives little protection against disclosure risk. Targeted record swapping lowers risk but higher information loss because of hidden biases • Small cell adjustments give protection against disclosure risk but obtain different totals for tables with the same population base . Raise utility by controlled rounding (if possible) or semi-controlled rounding • To avoid disclosure by differencing, one set of standard geographies and other variables are disseminated. This also lowers the utility of the census tables

  18. Developing Strategies for Census 2011 • Consistent SDC methods across all UK Statistical Offices that disseminate Census data • Methods need to ensure that sufficient statistics (totals, averages and variances) are not compromised • Flexible table generating software should be developed where the SDC method would be applied only once on the final outputted table and not aggregated from lower level geographies • Improved GIS systems may allow more flexible dissemination of non-nested geographies • SDC methods should be tailored to the type of output: standard tables, microdata, origin-destination tables, etc.

More Related