Privacy-Aware Publishing of Netflix Data

null values Brian Thompson1, Chih-Cheng Chang1, Hui (Wendy) Wang2, Danfeng Yao1 1 Department of Computer Science, Rutgers University2 Department of Computer Science, Stevens Institute of Technology{bthom,geniusc,danfeng}@cs.rutgers.edu, hwang@cs.stevens.edu Motivation Our Approach Analysis • Recommender systems like Netflix store a database of users and items, and corresponding ratings. • Being able to predict which items a user will enjoy can lead to happier customers and better business. • The Netflix Challenge: Given the Netflix dataset,devise an algorithm with high prediction accuracy. • Problem: Netflix replaces names with random ID #s before publishing to protect privacy, but users can be re-identified by comparing with external sources. • How can Netflix ensure users’ privacy without degrading accuracy of prediction results? • Challenges: • The sparsity of recommendation data presents a two-fold problem: existing anonymization techniques do not perform well on sparse data; simultaneously, users are more easily identifiable. [Narayanan et. al.] • Algorithms must be scalable: the Netflix Challenge dataset is a matrix with over 8.5 billion cells! We develop a novel approach, Predictive Anonymization, that effectively anonymizes sparse recommendation datasets by introducing a padding phase which reduces sparsity while maintaining the integrity of the data. • The Netflix dataset has 480,189 users and 17,770 movies. • We use singular value decomposition (SVD) for the padding step, and also for prediction when necessary. • We measure prediction accuracy by comparing the root mean squared error (RMSE) of the published anonymized datasets with that of the original data. Privacy-Aware Publishing of Netflix Data Anonymization Algorithm Padding To eliminate the sparsity problem, we perform a pre-processing step to replace null entries with predicted values before anonymizing. Table: Prediction accuracy before and after anonymization. Predictivepaddinghelps uncover latent trends in the Chart: Distribution of ratings before and after padding. Our Predictive Anonymization algorithm guarantees node re-identification privacy in the form of k-anonymity against the stronger label-based attack. Our methods also provide privacy against link disclosure attacks. Details and further discussion can be found in our technical report. original data, leading to more accurate clustering results. Clustering To increase efficiency, we apply a sampling procedure: Choose a random sample from the padded dataset. Apply the Bounded t-Means Algorithm to the sample to generate t same-size clusters. Put each user in the data set into one of t bins corresponding to its nearest cluster. Once again, apply the Bounded t-Means Algorithm to partition each bin into smaller clusters, which will correspond to the final anonymization groups. Conclusions and Future Work • Our experiments show that Padded Anonymization yields high prediction accuracy, as well as providing theoretical privacy guarantees. However, sparsity is reduced in the released dataset, which affects the integrity of the data. • Future Work: • How to maintain sparsity without loss in prediction accuracy. • Extend our methods to prevent link disclosure attacks. Model Homogenization We model recommender systems as labeled bipartite review graphs. To defend against structure-based and label-based attacks, we homogenize the k users in each anonymization group so that they have identical review graphs, including the ratings. References A. Narayanan, V. Shmatikov. “Robust De-anonymization of Large Sparse Datasets.” S&P 2008. C. Chang, B. Thompson, W. Wang, D. Yao. “Predictive Anonymization: Utility-Preserving Publishing of Sparse Recommendation Data.” Technical Report DCS-tr-647, April 2009. Malicious parties may employ (a) structure-based or (b) label-based attacks to re-identify users and thus learn sensitive information about their rating history. (a) (b) 5 Godfather Star Wars Ben Tim EnglishPatient English Patient In Simple Anonymization (left), we average original movie ratings over users in a group. In Padded Anonymization (right), the average is taken over the padded data. 1 We say a user is k-anonymous if there are at least k-1 other users with identical item ratings.

Privacy-Aware Publishing of Netflix Data

Privacy-Aware Publishing of Netflix Data

Presentation Transcript

Privacy-aware Regression Modeling of Participatory Sensing Data

Privacy-Preserving Data Publishing

Sequence-Aware Privacy Preserving Data-Leak Detection

Privacy of Data

ESTEEM: Quality- and Privacy-Aware Data Integration

Algorithm Safe Privacy-Preserving Data Publishing

Privacy-aware Regression Modeling of Participatory Sensing Data

Data Privacy

Privacy Preserving Serial Data Publishing By Role Composition

Data publishing

Data Privacy

Preservation of Proximity Privacy in Publishing Numerical Sensitive Data

CS7380: Privacy Aware Computing

Moving Towards Privacy-aware Security

A Technological Survey on Privacy Preserving Data Publishing

Publishing Set-Valued Data via Differential Privacy

Preservation of Proximity Privacy in Publishing Numerical Sensitive Data

Privacy by Design – Principles of Privacy-Aware Ubiquitous Systems

Publishing Data

ESTEEM: Quality- and Privacy-Aware Data Integration

Privacy of Client Data

Types of Data Privacy