1 / 25

You are what you say: Privacy risks of public mentions

You are what you say: Privacy risks of public mentions. Dan Frankowski et al. University of Minnesota SIGIR 2006 Presentor: Chun-Yuan Teng. Motivation. “Public data” + “Private data” + “IR Algorithm” = Privacy risk. Example of privacy risk.

Télécharger la présentation

You are what you say: Privacy risks of public mentions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. You are what you say: Privacy risks of public mentions Dan Frankowski et al. University of Minnesota SIGIR 2006 Presentor: Chun-Yuan Teng Natural Language Processing Lab National Taiwan University

  2. Motivation • “Public data” + “Private data” + “IR Algorithm” = Privacy risk Natural Language Processing Lab National Taiwan University

  3. Example of privacy risk • Privacy risk: Link datasets with overlapping users • “blog” + “purchase history” = “someone” • Ex: 吳若權 or 紫微斗數 Natural Language Processing Lab National Taiwan University

  4. Examples of privacy encroachment • People are judged by their preference • Rating + Mention in porn in forum? Natural Language Processing Lab National Taiwan University

  5. Research questions • Risks of dataset release • What are the risks to user privacy when releasing a dataset? • Altering the dataset • How can dataset owners alter the dataset they release to preserve user privacy? • Self defense • How can users protect their own privacy? Natural Language Processing Lab National Taiwan University

  6. Experimental setup • Ratings • Large • 140K users. max 6K rats, average 90, median 33. • 9K movies. max 49K rats, average 1,403, median 207 • 12.6M ratings • Forum mentions • Small • 133 forum posters • 1,685 different movies • 3,828 movie mentions Natural Language Processing Lab National Taiwan University

  7. RQ1: Risks of dataset release • How to evaluate the risks? • What’s the risky algorithms? Natural Language Processing Lab National Taiwan University

  8. K-anonymity & K-identification • K-anonymity (In Cryptography) • Sweeney: “A dataset release provides k-anonymity protection if the information for each person contained in data cannot be distinguished from k-1 individuals in the data” • K-identification • K-identification is a measure of how well an algorithm can narrow each user in a dataset to one of k users in another dataset Natural Language Processing Lab National Taiwan University

  9. K-identification (cont.) • Likely list • u1, s1 • u2, s2 • u3, s3 (t) • u4, s4 • … • Above, t is 3-identified, also 4-identified, 5-identified, etc., but NOT 2-identified • We know target user t in ratings data, too • t is k-identified if at position k or higher on the likely list. • In paper, k=1,5,10,100. We’ll talk about 1-identification, because it’s the scariest. Natural Language Processing Lab National Taiwan University

  10. An observation of data • Low Rated item may be a good indicator Natural Language Processing Lab National Taiwan University

  11. Algorithms to identify users • Set Intersection algorithm • TF-IDF algorithm • Scoring algorithm • Scoring algorithm with ratings Natural Language Processing Lab National Taiwan University

  12. Set Intersection algorithm Natural Language Processing Lab National Taiwan University

  13. Set Intersection algorithm • Find users who rate EVERY movie the target user mentioned • They all have same likeliness score • Ignore rating value entirely • RESULT: 1-identification rate: 7% • MEANING: 7% of the time there was one user at the top of the likely list, and it was the target user • Room for improvement • For target user with many mentions, no one possible Natural Language Processing Lab National Taiwan University

  14. TF-IDF algorithm • Score each user by similarity to the target user. Score more highly if • User has rated more mentions of target • User has rated mentions of rarely rated movies • For us: “word” is a movie, “document” (bag of words) is a user • Score is cosine similarity to the target user • RESULTS: 1-ident rate of 20% (compared to 7% from Set Int.) • Room for improvement • over-weights any mention for ratings user who rated few movies– high-scoring users have 4 ratings and 1 mention Natural Language Processing Lab National Taiwan University

  15. Scoring algorithm • Emphasizes mentions of rarely-rated movies, de-emphasizes number of ratings a user has • A user who has rated a mention is 10-20 times more likely to be the target user than one who has not Natural Language Processing Lab National Taiwan University

  16. Examples • Example • Target user t mentioned A, B, C, rated 20, 50, 1000 times (from 10,000 users) • User u1 rated A, user u2 rated B, C • u1 score: 0.9981 * 0.05 * 0.05 = 0.0025 • u2 score: 0.05 * 0.9501 * 0.9001= 0.043 • u2 more likely to be target t • Rating a mention is good, rare even better Natural Language Processing Lab National Taiwan University

  17. Scoring algorithm with rating • The same as above algorithm • Add threshold to add the rating feature Natural Language Processing Lab National Taiwan University

  18. Percent of k-identified Natural Language Processing Lab National Taiwan University

  19. RQ2: altering the dataset • Perturbation: Change rating value • Rating is not needed • Generalization: group items • Dataset becomes less useful • Suppression: hide data • Using in following experiments Natural Language Processing Lab National Taiwan University

  20. RQ2: Altering the dataset • We won’t modify forum data– users wouldn’t like it. Focus on ratings data • Rarely-rated items are identifying • IDEA: Release a ratings dataset suppressing all “rarely-rated” items Natural Language Processing Lab National Taiwan University

  21. RQ2: Altering the dataset Natural Language Processing Lab National Taiwan University

  22. RQ3: Self Defense • The question is how user protect their own privacy • Suppression: suppress rare-rated movie • May not be accepted by user • Misdirection Natural Language Processing Lab National Taiwan University

  23. Suppression • Not significant if more than 20% Natural Language Processing Lab National Taiwan University

  24. Misdirection • Mention popular items is more effective • Mention a popular item, more users increase their score Natural Language Processing Lab National Taiwan University

  25. Conclusion • A new problem in IR • Interesting and hard • Hard to preserve privacy • You need to suppress large data Natural Language Processing Lab National Taiwan University

More Related