You are what you say: Privacy risks of public mentions

You are what you say: Privacy risks of public mentions Dan Frankowski et al. University of Minnesota SIGIR 2006 Presentor: Chun-Yuan Teng Natural Language Processing Lab National Taiwan University

Motivation • “Public data” + “Private data” + “IR Algorithm” = Privacy risk Natural Language Processing Lab National Taiwan University

Example of privacy risk • Privacy risk: Link datasets with overlapping users • “blog” + “purchase history” = “someone” • Ex: 吳若權 or 紫微斗數 Natural Language Processing Lab National Taiwan University

Examples of privacy encroachment • People are judged by their preference • Rating + Mention in porn in forum? Natural Language Processing Lab National Taiwan University

Research questions • Risks of dataset release • What are the risks to user privacy when releasing a dataset? • Altering the dataset • How can dataset owners alter the dataset they release to preserve user privacy? • Self defense • How can users protect their own privacy? Natural Language Processing Lab National Taiwan University

Experimental setup • Ratings • Large • 140K users. max 6K rats, average 90, median 33. • 9K movies. max 49K rats, average 1,403, median 207 • 12.6M ratings • Forum mentions • Small • 133 forum posters • 1,685 different movies • 3,828 movie mentions Natural Language Processing Lab National Taiwan University

RQ1: Risks of dataset release • How to evaluate the risks? • What’s the risky algorithms? Natural Language Processing Lab National Taiwan University

K-anonymity & K-identification • K-anonymity (In Cryptography) • Sweeney: “A dataset release provides k-anonymity protection if the information for each person contained in data cannot be distinguished from k-1 individuals in the data” • K-identification • K-identification is a measure of how well an algorithm can narrow each user in a dataset to one of k users in another dataset Natural Language Processing Lab National Taiwan University

K-identification (cont.) • Likely list • u1, s1 • u2, s2 • u3, s3 (t) • u4, s4 • … • Above, t is 3-identified, also 4-identified, 5-identified, etc., but NOT 2-identified • We know target user t in ratings data, too • t is k-identified if at position k or higher on the likely list. • In paper, k=1,5,10,100. We’ll talk about 1-identification, because it’s the scariest. Natural Language Processing Lab National Taiwan University

An observation of data • Low Rated item may be a good indicator Natural Language Processing Lab National Taiwan University

Algorithms to identify users • Set Intersection algorithm • TF-IDF algorithm • Scoring algorithm • Scoring algorithm with ratings Natural Language Processing Lab National Taiwan University

Set Intersection algorithm Natural Language Processing Lab National Taiwan University

Set Intersection algorithm • Find users who rate EVERY movie the target user mentioned • They all have same likeliness score • Ignore rating value entirely • RESULT: 1-identification rate: 7% • MEANING: 7% of the time there was one user at the top of the likely list, and it was the target user • Room for improvement • For target user with many mentions, no one possible Natural Language Processing Lab National Taiwan University

TF-IDF algorithm • Score each user by similarity to the target user. Score more highly if • User has rated more mentions of target • User has rated mentions of rarely rated movies • For us: “word” is a movie, “document” (bag of words) is a user • Score is cosine similarity to the target user • RESULTS: 1-ident rate of 20% (compared to 7% from Set Int.) • Room for improvement • over-weights any mention for ratings user who rated few movies– high-scoring users have 4 ratings and 1 mention Natural Language Processing Lab National Taiwan University

Scoring algorithm • Emphasizes mentions of rarely-rated movies, de-emphasizes number of ratings a user has • A user who has rated a mention is 10-20 times more likely to be the target user than one who has not Natural Language Processing Lab National Taiwan University

Examples • Example • Target user t mentioned A, B, C, rated 20, 50, 1000 times (from 10,000 users) • User u1 rated A, user u2 rated B, C • u1 score: 0.9981 * 0.05 * 0.05 = 0.0025 • u2 score: 0.05 * 0.9501 * 0.9001= 0.043 • u2 more likely to be target t • Rating a mention is good, rare even better Natural Language Processing Lab National Taiwan University

Scoring algorithm with rating • The same as above algorithm • Add threshold to add the rating feature Natural Language Processing Lab National Taiwan University

Percent of k-identified Natural Language Processing Lab National Taiwan University

RQ2: altering the dataset • Perturbation: Change rating value • Rating is not needed • Generalization: group items • Dataset becomes less useful • Suppression: hide data • Using in following experiments Natural Language Processing Lab National Taiwan University

RQ2: Altering the dataset • We won’t modify forum data– users wouldn’t like it. Focus on ratings data • Rarely-rated items are identifying • IDEA: Release a ratings dataset suppressing all “rarely-rated” items Natural Language Processing Lab National Taiwan University

RQ2: Altering the dataset Natural Language Processing Lab National Taiwan University

RQ3: Self Defense • The question is how user protect their own privacy • Suppression: suppress rare-rated movie • May not be accepted by user • Misdirection Natural Language Processing Lab National Taiwan University

Suppression • Not significant if more than 20% Natural Language Processing Lab National Taiwan University

Misdirection • Mention popular items is more effective • Mention a popular item, more users increase their score Natural Language Processing Lab National Taiwan University

Conclusion • A new problem in IR • Interesting and hard • Hard to preserve privacy • You need to suppress large data Natural Language Processing Lab National Taiwan University

You are what you say: Privacy risks of public mentions

You are what you say: Privacy risks of public mentions

Presentation Transcript

Say what you see!

What do YOU say ?

What would you say?

What would you say?

What Did You Say?

What you could say…

What Do You Say?

Mean what you say and say what you mean

What You Say is What You Get

Say What You Mean; Mean What You Say

Say what you see

What Do You Say

Say What You Mean, Mean What You Say

Say what you see

You Are What You Say: Privacy Risks of Public Mentions

Say what you see

WHAT RECOVERY YOU SAY???

What Do You Say?

Say what you see.

You are what they say you are.

What do YOU say ?

What do YOU say ?