1 / 29

Publishing Search Query logs

Publishing Search Query logs. CompSci 590.03 Instructor: Ashwin Machanavajjhala. Outline. Uses of search query logs Privacy and search logs K-Anonymity Differentially private agorithms. Search Query Log. < anonid , query, querytime , itemrank , clickurl >. Uses of search query logs.

martha
Télécharger la présentation

Publishing Search Query logs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Publishing Search Query logs CompSci 590.03Instructor: Ashwin Machanavajjhala Lecture 17: 590.03 Fall 12

  2. Outline • Uses of search query logs • Privacy and search logs • K-Anonymity • Differentially private agorithms Lecture 17: 590.03 Fall 12

  3. Search Query Log • <anonid, query, querytime, itemrank, clickurl> Lecture 17: 590.03 Fall 12

  4. Uses of search query logs • Search result caching • Query Recommendation • Synonym identification • Reranking search results • Search advertising and Keyword popularity estimation Lecture 17: 590.03 Fall 12

  5. Uses of search query logs [SilvestriFnT ‘10] Lecture 17: 590.03 Fall 12

  6. Google Flu [Ginsberg Nature ‘09] “We've found that certain search terms are good indicators of flu activity. Google Flu Trends uses aggregated Google search data to estimate current flu activity around the world in near real-time.” http://www.google.org/flutrends/ • Predictions by Google Flu are 1-2 weeks ahead of CDC’s ILI (Influenza-like illness) surveillance reports Lecture 17: 590.03 Fall 12

  7. Google Flu [Ginsberg Nature ‘09] • Identify single ILI-related search queries that could most accurately model the CDC ILI visit percentages in 9 regions • P: probability of a ILI-related physician visit in a region (based on CDC data) • Q: ILI-related query fraction • Pick 45 highest scoring queries, and fit a linear model to predict ILI visit rates. Lecture 17: 590.03 Fall 12

  8. Google Flu [Ginsberg Nature ‘09] Lecture 17: 590.03 Fall 12

  9. Flu Trends National Data Google Flu Estimate U.S. Australia Lecture 17: 590.03 Fall 12

  10. Outline • Uses of search query logs • Privacy and search logs • K-Anonymity • Differentially private agorithms Lecture 17: 590.03 Fall 12

  11. Privacy and Search Logs [NYTimes 2006] Lecture 17: 590.03 Fall 12

  12. Sensitive Information [Chen et al FnT ‘09] • Obtain sensitive information directly from queries (user1) • Identifying users via demographic attributes (user2) • Identifying users by following urls (user3) • Identification leads to learning sensitive queries (user2) Lecture 17: 590.03 Fall 12

  13. Challenges • Not clear which queries are identifying and which queries are sensitive • Users queries are almost always unique • Adversaries may launch active attacks • Create many queries from different accounts to test if some user search for sensitive queries. Lecture 17: 590.03 Fall 12

  14. Outline • Uses of search query logs • Privacy and search logs • Privacy-Enhancing Techniques • Differentially private agorithms Lecture 17: 590.03 Fall 12

  15. Identifier Deletion • Delete personally identifying information like IP addresses and cookies, names, social security numbers … • Even if we remove age, gender, zip code from search logs, one can estimate these from the remaining log [Jones et al ‘07] Lecture 17: 590.03 Fall 12

  16. Hashing Queries • Replace queries with hash values • One can estimate the words based on co-occurrence analysis if token based hashing schemes are used[Kumar et al WWW ‘07] • Utility is lost … Lecture 17: 590.03 Fall 12

  17. K-anonymity and deleting infrequent queries [Adar WWW 07] • Unlikely that many people search for a specific individual’s identifiers. • Algorithm: Suppress all queries which are posed by at most K users. • … But a combination of frequent queries can still identify an individual … • Solution: Split a users log into smaller ones … based on query sessions • Query session is a set of queries that are related to each other. Lecture 17: 590.03 Fall 12

  18. TrackMeNot [Howe & Nissbaum 08] • Users send noise queries in addition to real queries • TrackMeNot is a browser plugin which posts queries to search engines Problems: • Distribution of noisy queries is different from distribution of actual queries … so noise can be removed • Imposes load on the search engine • Query log loses utility … Lecture 17: 590.03 Fall 12

  19. Outline • Uses of search query logs • Privacy and search logs • Privacy-Enhancing Techniques • Differentially private agorithms Lecture 17: 590.03 Fall 12

  20. Differential Privacy and Search Logs • Consider two databases that differ in the log of one user • In the worst case all queries are by the same user • Sensitivity = |query log| • Can guarantee no utility! • Pick at most m queries from each user Lecture 17: 590.03 Fall 12

  21. Differential Privacy and Search Logs • Domain of search terms is very large. Hence no differentially private algorithm is “useful”. • Consider the problem of publishing all queries posted by at least τ users each. Theorem: [Gotz et al TKDE 2012] For a sufficiently large domain size, the accuracy of any differentially private algorithm is worse than that of an algorithm with always returns an emptyset! Lecture 17: 590.03 Fall 12

  22. Probabilistic Differential Privacy For every pair of inputs that differ in one value For every probable output D1 D2 O Adversary may distinguish between D1 and D2 based on a set of unlikely outputs with probability at most δ • Pr[D1 O] • Pr[D2 O] < eε] > 1 - δ Pr[O | Lecture 17: 590.03 Fall 12

  23. Publishing Frequent queries/clicks [Korolova et al WWW 2009, Gotzet al TKDE 2012] Lecture 17: 590.03 Fall 12

  24. Privacy • The algorithm presented in the previous slide guarantees (ε,δ)-probabilistic differential privacy if • Where U is the number of users, m is the maximum number of queries per user, λ is the laplace noise parameter, and τ, τ’ are the two thresholds used by the algorithm Lecture 17: 590.03 Fall 12

  25. Utility • Let ξ = (τ’ – τ)/3*, and let τ* = τ+ ξ Any query that appears with frequency < τ* - ξ … • Has frequency less than τ • Is published in the output with probability 0. Any query that appears with frequency > τ* + ξ … • Is published if τ* + ξ+ Lap(λ) > τ’ • That is, noise > ξ • That is, query is published with probability 1- 0.5*e-ξ/λ. Lecture 17: 590.03 Fall 12

  26. Utility [Gotzet al TKDE 2012] Distributions are significantly different Lecture 17: 590.03 Fall 12

  27. Web Caching scenario [Gotzet al TKDE 2012] • Speed up web search, by storing the results for most frequent queries. • Each keyword is given a score based on frequency in the (anonymous) log. • Top few keywords are maintained in memory … Lecture 17: 590.03 Fall 12

  28. Summary • Publishing search logs can lead to very useful applications • Web • Social Science • … • Very sensitive information. Also individuals are easily identifiable. • Simple techniques do not provide sufficient protection • Differentially private techniques throw away a significant amount of data • Only m queries per person • All tail queries (with low frequency) are thrown away Lecture 17: 590.03 Fall 12

  29. References F. Silvestri, “Mining Query Logs: Turning Search Usage Data into Knowledge”, Foundations and Trends 4 (1-2) 2010 J. Ginsberg, M. Mohebbi, R. Patel, L. Brammer, M. Smolinski, L. Brilliant, “Detecting influenza epidemics using search engine query data”, Nature, vol. 457, Feb 2009 Bee-Chung Chen, Daniel Kifer, Kristen LeFevre and Ashwin Machanavajjhala "Privacy-Preserving Data Publishing", Foundations and Trends® in Databases: Vol. 2: No 1-2, pp 1-167, 2009. R. Jones, R. Kumar, B. Pang, and A. Tomkins, “I know what you did last summer — query logs and user privacy,” in CIKM, 2007. R. Kumar, J. Novak, Bo. Pang, and A. Tomkins, “On anonymizingquery logs via token-based hashing,” WWW 2007 E. Adar, “User 4xxxx9: Anonymizing query logs”, WWW 2007 HOWE, D. AND NISSENBAUM, H. 2008. TrackMeNot: Resisting surveillance in web search. A. Korolova, K. Kenthapadi, N. Mishra, A. Ntoulas, “Releasing Search Queries and Clicks Privatey”, WWW 2009 M. Gotz, A. Machanavajjhala, G. Wang, X. Xiao, J. Gehrke, “Publishing Search Logs”, TKDE 2012 Lecture 17: 590.03 Fall 12

More Related