190 likes | 208 Vues
WP 33 Information Loss Measures for Frequency Tables. Caroline Young University of Southampton Office for National Statistics cjy@soton.ac.uk. Natalie Shlomo University of Southampton Office for National Statistics n.shlomo@soton.ac.uk. Topics of Discussion Introduction
E N D
WP 33 Information Loss Measures for Frequency Tables Caroline YoungUniversity of SouthamptonOffice for National Statisticscjy@soton.ac.uk Natalie ShlomoUniversity of SouthamptonOffice for National Statisticsn.shlomo@soton.ac.uk
Topics of Discussion • Introduction • Methods for perturbing frequency tables containing whole population counts • Information loss measures for assessing the impact of SDC methods on utility and quality • Data description and definition of tables • Examples and analysis of results • Conclusions and future research
Introduction • Focus on frequency tables containing whole population counts: • UK Neighborhood Statistics (NeSS) website which disseminates small area statistics from census and administrative data • 2. Tables are intentionally perturbed for statistical disclosure control (SDC) causing information loss • 3. Develop quantitative information loss measures for choosing optimal SDC methods which preserves high utility in the tables • 4. Information loss depends on the SDC method, characteristics of the table and the use of the data
SDC Methods for Frequency Tables • SDC for frequency tables containing population counts: • Small Cell Adjustments (SCA) – random rounding to base 3 of small cells: • Perturbation has a mean of zero and variance of 2. Marginal totals obtained by adding perturbed and non-perturbed cells • Full Random Rounding (RaRo) – random rounding to base 3 for all entries. Same method described above after converting all entries to residuals of 3. • Marginal totals rounded separately and tables aren’t additive Can improve utility by semi-controlling for marginal totals
SDC Methods for Frequency Tables • SDC for frequency tables containing population counts (cont.): • Controlled Rounding (Cr(3)) – all entries rounded to base 3 according to solution of linear programming while ensuring that aggregated rounded internal cells equal the rounded margins. • Controlled rounding via Tau-Argus (standard tool for NeSS tables) • Cell suppression – small cells (ones and twos) are suppressed and secondary suppressions are found to protect against recalculation through margins. • Cell suppression via Tau-Argus and the hyper-cube method
SDC Methods for Frequency Tables • SDC for frequency tables containing population counts (cont.): • Imputation methods for cell suppression: • Margins are known and the total of the suppressed cells are known • Impute by average of the total of the suppressed cells in each row (S-A) • Impute by weighted average of the total of the suppressed cells in each row where weights are the column totals (S-WA)
Information Loss Measures • Measuring distortion to distributions:Distance metrics between original and perturbed cells in each geography (i.e., ward (NUTS5)) and average across all wards • Let be a table for ward k, the number of cells in the ward, the number ofwards, and the cell frequency for cell c : Hellinger’s Distance (HD) • Relative Absolute Distance (RAD) Average Absolute Distance per Cell (AAD)
Information Loss Measures • Aggregation of perturbed cells and effects on sub-totals: • Users aggregate lower level geographies which are perturbed to obtain non-standard geographies • Calculate sub-total where • Impact on Tests for Independence:Cramer’s V measure of association: where is the Pearson chi-square statistic • Information loss measure:
Information Loss Measures • Impact on Variance:- Little impact on variance of cell counts- “Between” variance of target variables for proportions in wards: Let the proportion in a ward k: and the overall proportion: • Between variance: • Information loss measure: • Mixed effects for this information loss measure
Information Loss Measures • Impact on Rank Correlations: • Sort original cell counts and define deciles Repeat on perturbed cell counts • Information loss measure: where I is the indicator function and the number of wards • Log Linear Analysis: • Information loss measure based on the ratio of the deviance (likelihood ratio test statistic) between perturbed table and original table for a given model: • Need to also compare different models since model for original table may differ from model of perturbed table
Data Used • Estimation Area Southwest England:437,744 persons, 182,337 households, 70 wards (on average 6,250 persons to a ward) • The tables were the following: • Tenure(3) * Age (7) * Health(4) * Ward • Ethnicity (17) * Ward • Economic Activity (9) * Sex (2) * Long-Term Illness(2) * Ward
RaRo RaRo RaRo CR3 CR3 CR3 SA SA SA SCA SCA SCA SWA SWA SWA Distance Metrics: (Left)-Hellinger’s Distance, (Centre)-Relative Absolute Difference and (Right)- Absolute Distance per cell
Box Plots: Difference between Perturbed and Original Subtotals of Three Consecutive Wards (ADs) PAs for Number of Unemployed Females with Long Term Illness (Internal cells) Perturbation Method
Change in Cramer’s V Measure of Association after Perturbation Percent Relative Difference 48.27 2.36 Increase in association Decrease in association
Male Students Female Students CR3 RaRo SA SCA SWA Percentage of Cells in a Different Decile after Perturbation Male (column 1) Female (column 2) Students with Long Term Illness Percentage of cells N.B. The selected columns are very sparse with approx 70% of cells having counts < 4.
Log-Linear Models: Effect of Perturbation on Model Selection Original Model: Choose a better model?
Conclusions • Inconsistent results for some of the information loss measures (Cramer’s V, “between” variance) showing that stochastic processes for SDC will have varying effects on the quality of the data • Emergence of some guidelines: • - skewed tables (one or two large columns and the rest small columns) - prefer rounding to cell suppression • - uniform tables - less information loss due to SDC methods so choose method with least changes to the table • - sparse tables – need to have benchmarked totals so control round (if possible) or semi-control random round • Improve utility by: designing tables to avoid disclosive cells; controlling for totals when random or small cell rounding; giving clear guidance to users on how best to impute suppressed cells
Future Research • Determine optimal methods of SDC depending on the use of the data and the characteristics of the table (skewed, sparse, uniform) • Generalize and expand information loss measures for all types of statistical data (tabular and microdata) and statistical analysis • Develop software to give to suppliers of data for assessing information loss under different SDC methods and choosing the optimal method which gives high utility tables