You Are What You Say: Privacy Risks of Public Mentions

You Are What You Say: Privacy Risks of Public Mentions Written By: Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl Presented by: David Keppel, Jorly Metzger

Table of Contents • Introduction • Related Work • Experiment Setup • Evaluation • Algorithms • Altering The Dataset • Self Defense • Conclusion

Did you ever use the term… • “Did you ever use the termLong Dong Silver in conversation with Professor Hill?” • Clarence Thomas’ confirmation hearing for the U.S. Supreme Court • Video Rental History • Not permissible in court (1988 Video Privacy Protection Act)

I wish I could get… • Tom Owad downloaded 260,000 Amazon wish lists • Flagged several “dangerous” books • Amazon wish list contained name, city, and state • Yahoo! PeopleSearch • Found the complete address of one of four wish list owners

Some more examples • Blogs • Forums • Myspace • Netflix • AOL – oops • Release of search logs by users • Users could be identified using this data

You Are What You Say • Several Online identities • sparse relation space - the movies, journal articles, or authors you mention • Those properties allow re-identification. • It also may lead to other privacy violations, • name and address. • unforeseen consequences

A lot of information on yourself is out there. • Many people reveal their preferences to organizations • Organizations keep people’s preference, purchase, or usage data. • Belief that this information should be private • These datasets usually are private to the organization that collects them.

Why do I get all that Spam? • Why doesn’t that information stay private? • Research groups demand data (AOL search logs!) • Pool or trade data for mutual benefit • Government agencies forced to release data • Sell data to other businesses. • Bankrupt businesses may be forced to sell data

Quasi-identifier • Even if obvious identifiers have been removed, they might accidentally contain a uniquely identifying quasi-identifier • 87% of the 248 million people in the 1990 U.S. census are likely to be uniquely identified based only on their 5-digit ZIP, gender, and birth date • quasi-identifiers can be linked to other databases • medical records of a former governor of Massachusetts by linking public voter registration data to a database of supposedly anonymous medical records sold to industry

Basically…. DataSet 1 – sensitive information DataSet 2 – identifying information DataSet 3 – privacy compromised Medical history Voter Registration Data Name Name Name Zip code Zip code Zip code Gender Gender Gender Birthday Birthday Birthday Medical Record Medical Record

The research paper is proposing • Re-identification of users from a public web movie forum in a private movie ratings dataset. • Three major results • They develop algorithms for re-identifying • They evaluate whether private dataset owners can protect user privacy by hiding data; • They evaluate two methods for users in a public forum to protect their own privacy

The usefulness of re-identification • The importance of re-identification • amount of data available electronically is increasing rapidly - IR techniques can be applied. • Serious privacy risks for users • Re-identification may prove valuable • Identifying shills • even fighting terrorism!

Linking People in Sparse Relation Spaces • sparse relation spaces • Purchase data, online music player, Wikipedia • differ from traditional databases • Identified vs. Non-identified datasets • Accessible vs inaccessible datasets • Amazon might re-identify customers on competitors’ websites by comparing their purchase history against reviews written on those sites, and decide to market (or withhold) special offers from them.

Burning Questions • RISKS OF DATASET RELEASE: What are the risks to user privacy when releasing a dataset? • ALTERING THE DATASET: How can dataset owners alter the dataset they release to preserve user privacy? • SELF DEFENSE: How can users protect their own privacy?

Related Work • Studies ( [1][18]) • Shows large majority of internet users are concerned about their privacy. • Opinion mining • Novak et al. [10], investigated re-identifying multiple aliases of a user in a forum based on general properties of their post text. • marrying our algorithms to opinion mining methods will improve their ability to re-identify people

Related Work (cont.) • Identified a number of ways to modify data to preserve privacy • perturbing attribute values by adding random noise (Agrawal et al. [2]) • Techniques for preserving k-anonymity: (Sweeney [17]) • suppression (hiding data) • generalization (reducing the fidelity of attribute values).

K-idenification • K-anonymity: "A [dataset] release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release." • K-identification: k-anonymity on 2 datasets • a measure of how well an algorithm can narrow each user in a dataset to one of k users in another dataset • If k is large, or if k is small and the k-identification rate is low, users can plausibly deny being identified

Experiment Setup • Offline experiments using two sparse relation spaces – both from a snapshot of MovieLens Database (Jan., 2006). • Prove that re-identification of users is possible, by using information from both databases. • Information available: • a set of movie ratings • a forum for referencing movies

Experiment Setup (cont.) • MovieLens movie recommender: • a set of movie ratings – assigned by a user • a set of movie mentions – derived from the forum

Experiment Dataset • Drawn from posts in the MovieLens forums and from the MovieLens user’s data set of movie ratings. • Dataset includes: • 12,565,530 ratings • 140,132 users, • 8,957 items. • Users can make movie references while posting to the forum • Manual • Automated

The power law • Typical and important feature of real world sparse relation spaces • (Review) Sparse relationship • A) relates people to items; • B) is sparse, having relatively few relationships recorded per person; • C) large space of items. • data roughly follows a power law • rating dataset • distribution between users and ratings • Mentions dataset

Binning Strategy • The bins contain similar numbers of users and have intuitive meaning • Hypothesize that identifying a user depends on the number of mentions: • Users with more mentions disclose more information.

Experiment Overview • Objective • Evaluation Criteria • Re-Identification Algorithms • Altering the Dataset • Self-Defense

Objective Take a user from the public dataset of mentions and attempt to re-identify them within the private dataset of rankings PUBLIC PRIVATE • member43 • movie62 • move12 • member65 • movie4 • movie2 • movie6 • movie15 ? • jmetzger • movie2 • movie6 • movie4 ? ? • member21 • movie4 • move95 • movie6

Evaluation CriteriaOverview • 133 users selected from the public mentions dataset to be target users. Each target user has at least one mention. • Users to be re-identified will reside in the private ratings dataset. • K-identification will be evaluated for k = 1, 5, 10, and 100.

Evaluation CriteriaK-Identification • K-identification – measures how well an algorithm can narrow each user in a dataset to one of k users in another dataset • Let t = target user and j = rank rt. Then t is k-identified for k  j. Note: for ties involving t, j is the highest rank among tied users • K-identification rate is the fraction of k-identified users t = jmetzger Mt = {movie4, movie10, movie12} reIdentAlg(Mt) returns a likely list of ratings users in order of likelihood of being t jmetzger is identified as Member45, then t is 4-identified and 5-identified jmetzger is identified as Member9 and member94. Then t is 3-identified, 4-identified, and 5-identified

Evaluation CriteriaK-Identification Rate Evaluate for K = 1, 2, 4 4 target users selected from a public dataset t1 = Member4 t2 = Member1 t3 = Member5 t4 = Member8 K = 1: K-Identification Rate = 2 / 4 = 50% K = 2: K-identification Rate = 2 / 4 = 50% K = 4: K-identification Rate = 3 / 4 = 75%

Algorithms: Set IntersectionBasic Concept • Finds all users in the ratings dataset that rates every item mentioned by the target user t. • Each returned user will have the same likeness score. • Actual rating value given by user is ignored

Behavior at 1-Identification Algorithms: Set IntersectionEvaluation • Failure Scenarios • “Not Narrow” – more than k users matched • “No One Possible” – no users matched • “Misdirected” – users found, but none matched target user t

Algorithms: TF-IDFBasic Concept • Finds users who have mentioned items, but have not rated those items • Desired Properties • Users who rate more mentions score higher • Concerned with Ratings Users that contain most mentions • Users who rate rare movies that are mentioned score higher than users who rate common mentioned movies • Concerned with the number of Ratings Users that have a particular mention

U Wum = tfumlog2 u U who rated m wtwu sim(t,u) = wt xwu Algorithms: TF-IDFFormula • Term frequency tfum for mentions users • is 1 if user mentioned m • is 0 otherwise • Term frequency tfum for ratings users • is 1 if user rated m • is 0 otherwise t = target user u = user m = movie U = set of all users

Algorithms: TF-IDFEvaluation • Better performance than Set Intersection • Over-weighted any mention for a ratings user who had rated few movies

Algorithms: ScoringBasic Concept • Emphasizes mentions of rarely-rated movies • De-emphasizes the number of ratings a user has • Assuming that scores are separable, sub-scores are calculated for each mention then multiplied to get an overall score

{u`  U who rated m} – 1 1 – if u rated m sub-score: ss(u,m) = U  otherwise ss(u,mi) 0.05 score s(u,t) = mi T Algorithms: ScoringFormula • Sub-score ss(u,m) gives more weight to rarely rated movies • Users who rated more than 1/3 of the movies were discarded (12 users total)

Algorithms: ScoringEvaluation • Outperformed TF-IDF • Including the heaviest-rating users reduced 1-identification performance • Using a flat sub-score of 1 for rated mentions reduced 1-identification performance

{u`  U who rated m} – 1 1 – if u rated m and r(u,m) – r(t,m)   subscore: ss(u,m) = U otherwise 0.05 Algorithms: ScoringWith Ratings r(t,m) = rating given by user t for mention m r(u,m) = rating given by user u for mention m • Mined rating value used to restrict the scoring algorithm • Exact Rating,  = 0 • Fuzzy Rating,  = 1

Algorithms: Comparison

Altering the DatasetOverview • Question: How can dataset owners alter the dataset they release to preserve user privacy? • Suggestions • Perturbation • Generalization • Suppression

Altering the DatasetSuppression • Drop movies that are rarely-rated • Drop ratings of items that have ratings below a specified threshold

Self DefenseOverview • Question: How can users protect their own privacy? • Suggestions • Suppression • Misdirection

Self DefenseSuppression • Same behavior from Altering the Dataset holds true here • Workaround: • Line-up items that have been mentioned and rated • Order items by how many times the item has been rated (rarest first) • Suppress mentions of only the top portion of list

Self DefenseMisdirection • User intentionally mentions items they have not rated • Procedure • Misdirection Item list is created • Choose items rated above threshold and order in increasing popularity • Choose items rated above threshold and order in decreasing popularity • Thresholds vary from 1 to 8192, in powers of 2 • Each user takes the first item from list that they have not rated and mentions it • K-identification is re-computed • This is repeated for each K-identification level

Conclusions • Re-identification in a sparse relation space can violate privacy • Relationships to items in a sparse relation space can be a quasi-identifier • As prevention, suppression of datasets is impractical • User-level misdirection does provide some anonymity at a fairly low cost

Future Work • How to determine that a user in one sparse dataset exists in another sparse dataset • Design a re-identifying algorithm that ignores the most popular mentions entirely • Construct an Intelligent Interface that helps people manage their privacy • If people were convinced to intentionally misdirect data, how would this change the nature of public discourse in sparse relation spaces?

Critiques • Explain how user matches were determined • In TF-IDF Algorithm, notation was not clear in differentiating users for weights • Graphs were very useful in understanding behavior of algorithms

References • [1] Ackerman, M. S., Cranor, L. F., and Reagle, J. 1999. Privacy in e-commerce: examining user scenarios and privacy preferences. In Proc. EC99, pp. 1-8. • [2] Agrawal, R. and Srikant, R. 2000. Privacy-preserving data mining. In Proc. SIGMOD00, pp. 439-450. • [3] Berkovsky, S., Eytani, Y., Kuflik, T., and Ricci, R. 2005. Privacy-Enhanced Collaborative Filtering. In Proc. User Modeling Workshop on Privacy-Enhanced Personalization. • [4] Canny, J. 2002. Collaborative filtering with privacy via factor analysis. In Proc. SIGIR02, pp. 238-245. • [5] Dave, K., Lawrence, S., and Pennock, D. M. 2003. Mining the peanut gallery: opinion extraction and semantic classi-fication of product reviews. In Proc. WWW03, pp. 519-528. • [6] Drenner, S., Harper, M., Frankowski, D., Terveen, L., and Riedl, J. 2006. Insert Movie Reference Here: A System to Bridge Conversation and Item-Oriented Web Sites. Accepted for Proc. CHI06. • [7] Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J. 2002. Privacy preserving mining of association rules. In Proc. KDD02, pp. 217-228. • [8] Hong, J.I. and J.A. Landay. An Architecture for Privacy- Sensitive Ubiquitous Computing. In Mobisys04 pp. 177- • [9] Lam, S.K. and Riedl, J. 2004. Shilling recommender systems for fun and profit. In Proc. WWW04, pp. 393-402.

References (cont.) • [10] Novak, J., Raghavan, P., and Tomkins, A. 2004. Anti- on the Web. In Proc. WWW04, pp. 30-39. • [11] Pang, B., Lee, L., and Vaithyanathan, S. 2002. Thumbs Sentiment classification using machine learning techniques. Proc. Empirical Methods in NLP, pp. 79-86. • [12] Polat, H., Du, W. 2003. Privacy-Preserving Collaborative Filtering Using Randomized Perturbation Techniques. ICDM03, p. 625. • [13] Ramakrishnan, N., Keller, B. J., Mirza, B. J., Grama, A. and Karypis, G. 2001. Privacy Risks in Recommender Systems. IEEE Internet Computing 5(6):54-62. • [14] Rizvi, S., and Haritsa, J. 2002. Maintaining Privacy in Association Rule Mining. In Proc. VLDB02, pp. 682- • [15] Sarwar, B. M., Karypis, G., Konstan, J.A., and Riedl, Item-based collaborative filtering recommendation algorithms. In Proc. WWW01. • [16] Sweeney, L. 2002. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5):571-588. • [17] Sweeney, L. 2002. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 557-570. • [18] Taylor, H. 2003. Most People Are “Privacy Pragmatists.” Harris Poll #17. Harris Interactive (March 19, 2003). • [19] Terveen, L., et al. 1997. PHOAKS: a system for sharing recommendations. CACM 40(3):59-62. • [20] Verykios, V. S., et al. 2004. State-of-the-art in privacy preserving data mining. SIGMOD Rec. 33(1):50-57.

You Are What You Say: Privacy Risks of Public Mentions