1 / 1

Privacy-Aware Publishing of Netflix Data

null values. Brian Thompson 1 , Chih-Cheng Chang 1 , Hui (Wendy) Wang 2 , Danfeng Yao 1. 1 Department of Computer Science, Rutgers University 2 Department of Computer Science, Stevens Institute of Technology {bthom,geniusc,danfeng}@cs.rutgers.edu, hwang@cs.stevens.edu. Motivation.

fausto
Télécharger la présentation

Privacy-Aware Publishing of Netflix Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. null values Brian Thompson1, Chih-Cheng Chang1, Hui (Wendy) Wang2, Danfeng Yao1 1 Department of Computer Science, Rutgers University2 Department of Computer Science, Stevens Institute of Technology{bthom,geniusc,danfeng}@cs.rutgers.edu, hwang@cs.stevens.edu Motivation Our Approach Analysis • Recommender systems like Netflix store a database of users and items, and corresponding ratings. • Being able to predict which items a user will enjoy can lead to happier customers and better business. • The Netflix Challenge: Given the Netflix dataset,devise an algorithm with high prediction accuracy. • Problem: Netflix replaces names with random ID #s before publishing to protect privacy, but users can be re-identified by comparing with external sources. • How can Netflix ensure users’ privacy without degrading accuracy of prediction results? • Challenges: • The sparsity of recommendation data presents a two-fold problem: existing anonymization techniques do not perform well on sparse data; simultaneously, users are more easily identifiable. [Narayanan et. al.] • Algorithms must be scalable: the Netflix Challenge dataset is a matrix with over 8.5 billion cells! We develop a novel approach, Predictive Anonymization, that effectively anonymizes sparse recommendation datasets by introducing a padding phase which reduces sparsity while maintaining the integrity of the data. • The Netflix dataset has 480,189 users and 17,770 movies. • We use singular value decomposition (SVD) for the padding step, and also for prediction when necessary. • We measure prediction accuracy by comparing the root mean squared error (RMSE) of the published anonymized datasets with that of the original data. Privacy-Aware Publishing of Netflix Data Anonymization Algorithm Padding To eliminate the sparsity problem, we perform a pre-processing step to replace null entries with predicted values before anonymizing. Table: Prediction accuracy before and after anonymization. Predictivepaddinghelps uncover latent trends in the Chart: Distribution of ratings before and after padding. Our Predictive Anonymization algorithm guarantees node re-identification privacy in the form of k-anonymity against the stronger label-based attack. Our methods also provide privacy against link disclosure attacks. Details and further discussion can be found in our technical report. original data, leading to more accurate clustering results. Clustering To increase efficiency, we apply a sampling procedure: Choose a random sample from the padded dataset. Apply the Bounded t-Means Algorithm to the sample to generate t same-size clusters. Put each user in the data set into one of t bins corresponding to its nearest cluster. Once again, apply the Bounded t-Means Algorithm to partition each bin into smaller clusters, which will correspond to the final anonymization groups. Conclusions and Future Work • Our experiments show that Padded Anonymization yields high prediction accuracy, as well as providing theoretical privacy guarantees. However, sparsity is reduced in the released dataset, which affects the integrity of the data. • Future Work: • How to maintain sparsity without loss in prediction accuracy. • Extend our methods to prevent link disclosure attacks. Model Homogenization We model recommender systems as labeled bipartite review graphs. To defend against structure-based and label-based attacks, we homogenize the k users in each anonymization group so that they have identical review graphs, including the ratings. References A. Narayanan, V. Shmatikov. “Robust De-anonymization of Large Sparse Datasets.” S&P 2008. C. Chang, B. Thompson, W. Wang, D. Yao. “Predictive Anonymization: Utility-Preserving Publishing of Sparse Recommendation Data.” Technical Report DCS-tr-647, April 2009. Malicious parties may employ (a) structure-based or (b) label-based attacks to re-identify users and thus learn sensitive information about their rating history. (a) (b) 5 Godfather Star Wars Ben Tim EnglishPatient English Patient In Simple Anonymization (left), we average original movie ratings over users in a group. In Padded Anonymization (right), the average is taken over the padded data. 1 We say a user is k-anonymous if there are at least k-1 other users with identical item ratings.

More Related