1 / 32

Towards Publishing Recommendation Data With Predictive Anonymization

Towards Publishing Recommendation Data With Predictive Anonymization. Chih-Cheng Chang † , Brian Thompson † , Hui Wang ‡ , Danfeng Yao †. †. ‡. ACM Symposium on Information, Computer, and Communications Security (ASIACCS 2010). Outline. Introduction Privacy in recommender systems

Télécharger la présentation

Towards Publishing Recommendation Data With Predictive Anonymization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Publishing Recommendation Data With Predictive Anonymization Chih-Cheng Chang†, Brian Thompson†, Hui Wang‡, Danfeng Yao† † ‡ ACM Symposium on Information, Computer, and Communications Security (ASIACCS 2010)

  2. Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization

  3. Motivation • Inevitable trend towards data sharing • Medical records • Social networks • Web search data • Online shopping, ads • Databases contain sensitive information • Growing need to protect privacy Towards Publishing Recommendation Data With Predictive Anonymization

  4. Privacy in Relational Databases identifiers sensitive information Towards Publishing Recommendation Data With Predictive Anonymization

  5. Privacy in Relational Databases “Pseudo-identifiers” 87% of the U.S. population can be uniquely identified by DOB, gender, and zip code! [S00] Towards Publishing Recommendation Data With Predictive Anonymization

  6. Approaches to Achieving Privacy • Statistical databases • Only aggregate queries: What is average salary? • Differential Privacy [Dinur-Nissim ‘03, Dwork ‘06]Adaptively add random noise to output so querier can not determine if a user is in the database • Quality decreases over multiple queries • Publishing of anonymized databases • No restriction on how data is utilized, good for complex data mining applications • How to address privacy concerns? Towards Publishing Recommendation Data With Predictive Anonymization

  7. Anonymization of Databases Techniques: • Perturbation 52 53 26 24 45 42 Towards Publishing Recommendation Data With Predictive Anonymization

  8. Anonymization of Databases Techniques: • Perturbation • Swapping 52 45 Towards Publishing Recommendation Data With Predictive Anonymization

  9. Anonymization of Databases Techniques: • Perturbation • Swapping • Generalization 52 50s 20s 24 45 40s Def. A database entry is k-anonymousif ≥ k-1 other entries match identically on the insensitive attributes. [SS98] Towards Publishing Recommendation Data With Predictive Anonymization

  10. The Generalization Approach Towards Publishing Recommendation Data With Predictive Anonymization

  11. Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization

  12. Recommender Systems • Users register for service • After buying a good, they submit a rating for it • Get recommendations based on yours and others’ ratings Towards Publishing Recommendation Data With Predictive Anonymization

  13. Recommender Systems ? ? ? The Netflix Challenge: “Anonymized” Netflix data is released to the public. $1 million prize for best movie prediction algorithm. NO! Question: Is privacy really protected? Towards Publishing Recommendation Data With Predictive Anonymization

  14. Privacy in Recommender Systems Narayanan and Shmatikov [NS08] exploited external information to re-identify users in the released Netflix Challenge dataset. Privacy breach! Towards Publishing Recommendation Data With Predictive Anonymization

  15. News Timeline How can we enable sharing of recommendation data without compromising users’ privacy? Towards Publishing Recommendation Data With Predictive Anonymization

  16. Challenges in Anonymization of Recommender Systems • All data may be considered “sensitive” by users. • All data could be used as quasi-identifiers. • Data sparsity helps re-identification attacks, and makes anonymization difficult. [NS08] • Scalability – Netflix matrix has 8.5 billion cells! Towards Publishing Recommendation Data With Predictive Anonymization

  17. Godfather Ben English Patient Star Wars Tim English Patient Attack Models We represent the recommendation database as a labeled bipartite graph: 3 0001 Star Wars “structure-based attack” 5 4 0002 4 Godfather 5 4 5 0003 1 English Patient 1 4 5 0004 Pretty in Pink “label-based attack” Towards Publishing Recommendation Data With Predictive Anonymization

  18. Privacy Models • Node re-identification privacy:Should not be possible to re-identify individuals. • Link existence privacy:Should not be possible to infer whether a user has seen a particular movie. Our approach, Predictive Anonymization, provides these notions of privacy against both the structure-based and label-based attacks. Towards Publishing Recommendation Data With Predictive Anonymization

  19. Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization

  20. Predictive Anonymization Our solution takes a 3-step approach: • Use predictive padding to reduce sparsity. • Cluster users into groups of size k. • Perform homogenization by assigning users in each group to have the same ratings. Achieves k-anonymity! Towards Publishing Recommendation Data With Predictive Anonymization

  21. Predictive Anonymization • Want to cluster users, but there is not enough information due to data sparsity. • Solution: Fill empty cells with predicted values. • Cluster users based on similar tastes, not necessarily similar lists of movies rated. Towards Publishing Recommendation Data With Predictive Anonymization

  22. Predictive Anonymization The final step, homogenization, can be done in one of several ways. We describe two methods, “padded” and “pure” homogenization. • Use predictive padding to reduce sparsity. • Cluster users into groups of size k. • Perform homogenization by assigning users in each group to have the same ratings. Towards Publishing Recommendation Data With Predictive Anonymization

  23. 3.5 3 4.5 5 3.5 4 4.5 4 2.5 4.5 3.5 4 5 3.5 1.5 1 2.5 1.5 4.5 4 4.5 4.5 5 Predictive Anonymization “Padded Homogenization” 0001 Star Wars 0002 Godfather 0003 English Patient 0004 Pretty in Pink • All edges are added to the recommendation graph. • Each cluster is averaged using the padded data. Towards Publishing Recommendation Data With Predictive Anonymization

  24. 3.5 3 5 4.5 3.5 4 4.5 4 4 4.5 5 4 4 1 1 1 4 4.5 5 5 5 Predictive Anonymization “Pure Homogenization” 0001 Star Wars 0002 Godfather 0003 English Patient 0004 Pretty in Pink • Only necessary edges are added to the graph. • Each cluster is averaged using the original data. Towards Publishing Recommendation Data With Predictive Anonymization

  25. Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization

  26. Experiments • Performed on the Netflix Challenge dataset: • 480,189 users and 17,770 movies • more than 100 million ratings • Singular value decomposition (SVD) is used for padding and prediction. • We compute the root mean squared error (RMSE) for a test set of 1 million ratings on the original and anonymized data. RMSE = Towards Publishing Recommendation Data With Predictive Anonymization

  27. Analysis: Prediction Accuracy • Padded Anonymization preserves prediction accuracy. • However, sparsity is eliminated, which affects the utility of the published dataset for data mining applications. Towards Publishing Recommendation Data With Predictive Anonymization

  28. Summary Utility Privacy Towards Publishing Recommendation Data With Predictive Anonymization

  29. Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization

  30. Conclusions • We have formalized privacy and attack models for recommender systems. • Our solutions show that privacy-preserving publishing of anonymized recommendation data is feasible. • More work is required to find a practical solution that satisfies real-world privacy and utility goals. Towards Publishing Recommendation Data With Predictive Anonymization

  31. Future Work • Investigate the use of differential privacy-like guarantees for recommendation databases • Analyze how to protect against more complex attacks with greater background knowledge • Evaluate the utility of anonymized recommendation data for advanced data mining applications Towards Publishing Recommendation Data With Predictive Anonymization

  32. Thank you! Towards Publishing Recommendation Data With Predictive Anonymization

More Related