On the Anonymization of Sparse High-Dimensional Data

On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk

Publishing Transaction Data • Publishing transaction data • Retail chain-owned shopping cart data • Infer consumer spending patterns • Correlations among purchased items • e.g., 90% of cereals buyers also buy milk • What about privacy?

Privacy Threat Quasi-identifying Items Sensitive Items

Privacy Paradigm • ℓ-diversity • prevent association between quasi-identifier and sensitive attributes • Create groups of transactions • freq. of an SA value in a group < 1/p • Objective • Enforce privacy • Preserve correlations among items • Challenge: high data dimensionality

Data Re-organization PRESERVES CORELATIONS! Band Matrix Organization

Published Data Summary of Sensitive Items

Contributions • Novel data representation • Preserves correlation among items • Efficient heuristic for group formation • Linear time to data size • Supports multiple sensitive items

State-of-the-art: Mondrian[FWR06] • Generalization-based • data-space partitioning • similar to k-d-trees • split recursively until privacy condition does not hold • constrained global recoding k = 2 Age 20 40 60 GENERALIZATION + HIGH DIMENSIONALITY = UNACCEPTBLE INFORMATION LOSS 40 60 Weight 80 100 [FWR06] K. LeFevre et al. Mondrian Multidimensional k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006

State-of-the-art: Anatomy[XT06] • Permutation-based method • discloses exact QID values “Anatomized” table RANDOM GROUP FORMATION DOES NOT PRESERVE CORRELATIONS |G|! permutations [XT06] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation, Proceedings of the 32nd international conference on Very Large Data Bases (VLDB), 2006

Bandwidth = U+L+1 Minimizing bandwidth is NP-hard Band Matrix Representation

Reverse Cuthil-McKee (RCM) • Heuristic Bandwidth Minimization • Solves corresponding graph labeling problem • Permutes rows and columns • Complexity N* D * log D • N = matrix rows (# transactions) • D = maximum degree of any vertex

Group Formation • Correlation-aware Anonymization of High-Dimensional Data (CAHD) • Use the order given by RCM • Consecutive transactions highly correlated • O(pN) complexity

Group Formation

Experimental Evaluation

RCM Visualization

Experimental Setting • BMS dataset • Compare with hybrid PermMondrian(PM) • Combines Mondrian with Anatomy • Query Workload • Reconstruction Error

Recostruction Error vs p

Execution Time

Conclusions • Anonymizing transaction data • High-dimensionality • Preserving correlation • Future work • Different encodings for data representation • Enhance correlation among consecutive rows

On the Anonymization of Sparse High-Dimensional Data

On the Anonymization of Sparse High-Dimensional Data

Presentation Transcript

Handling of High-Dimensional Data Sets

Structure Preserving Anonymization of Router Configuration Data

Privacy-preserving Anonymization of Set Value Data

Biometrics and High Dimensional Data

High-Dimensional Data

ICS 278: Data Mining Lecture 5: Low-Dimensional Representations of High-Dimensional Data

Anonymization of Health Care Data in Hungary

Efficient Clustering of High-Dimensional Data Sets

Sparse Direct Solvers on High Performance Computers

High Dimensional Data Analysis

Data Anonymization (1)

Data Anonymization - Generalization Algorithms

Clustering High Dimensional Data Using SVM

Travel-time Tomography For High Contrast Media based on Sparse Data

On the Anonymization of Sparse High-Dimensional Data

Privacy-preserving Anonymization of Set Value Data

Booster in High Dimensional Data Classification

Foundation of High-Dimensional Data Visualization

High Dimensional Data

Sparse Direct Solvers on High Performance Computers

Sparse Direct Solvers on High Performance Computers

Sparse Direct Methods on High Performance Computers