1 / 51

Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy (Based on Slides of Kobbi Nissim). Benny Pinkas HP Labs, Israel. d. Why not use cryptographic methods?. Many users contribute data. Cannot require them to participate in a cryptographic protocol.

belita
Télécharger la présentation

Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy (Based on Slides of Kobbi Nissim) Benny Pinkas HP Labs, Israel 10th Estonian Winter School in Computer Science

  2. d Why not use cryptographic methods? • Many users contribute data. Cannot require them to participate in a cryptographic protocol. • In particular, cannot require p2p communication between users. • Cryptographic protocols incur considerable overhead. … 10th Estonian Winter School in Computer Science

  3. d Data Privacy Data users access mechanism breach privacy 10th Estonian Winter School in Computer Science

  4. Mr. Brown Ms. John d d Mr. Doe Easy Tempting Solution A Bad Solution • But, ‘harmless’ attributes uniquely identify many patients (gender, age, approx weight, ethnicity, marital status…) • Recall, DOB+gender+zip code identify people whp. • Worse:`rare’ attributes (e.g. disease with prob.  1/3000) Idea: a. Remove identifying information (name, SSN, …) b. Publish data 10th Estonian Winter School in Computer Science

  5. What is Privacy? • Something should not be computable from query answers • E.g.  Joe={Joe’s private data} • The definition should take into account the adversary’s power (computational, #of queries, prior knowledge, …) • Quite often it is much easier to say what issurely non-private • E.g. Strong breaking of privacy: adversary is able to retrieve (almost) everybody’s private data Intuition: privacy breached if it is possible to compute someone’s private information from his identity 10th Estonian Winter School in Computer Science

  6. The Data Privacy Game: an Information-Privacy Tradeoff • Private functions: • want to hide x(DB)=dx • Information functions: • want to reveal f(q, DB) for queries q • Here: explicitdefinition of private functions. • The question: which information functions may be allowed? • Different from Crypto (secure function evaluation): • There, want to reveal f() (explicit definition of information function) • want to hide all functions () not computable from f() • Implicit definition of private functions • The question whether f() should be revealed is not asked f x f f 10th Estonian Winter School in Computer Science

  7. answer Mr. Fox 0/1 Ms. John 0/1 d {0,1}n aq=iq di Mr. Doe 0/1  A simplistic model: Statistical Database (SDB) query q  [n] bits 10th Estonian Winter School in Computer Science

  8. Approaches to SDB Privacy • Studied extensively since the 70s • Perturbation • Add randomness. Give `noisy’ or `approximate’ answers • Techniques: • Data perturbation (perturb data and then answer queries as usual)[Reiss 84, Liew Choi Liew 85, Traub Yemini Wozniakowski 84] … • Output perturbation (perturb answers to queries)[Denning 80, Beck 80, Achugbue Chin 79, Fellegi Phillips 74] … • Recent interest:[Agrawal, Srikant 00] [Agrawal, Aggarwal 01],… • Query Restriction • Answer queries accurately but sometimes disallow queries • Require queries to obey some structure[Dobkin Jones Lipton 79] • Restricts number of queries • Auditing[Chin Ozsoyoglu 82, Kleinberg Papadimitriou Raghavan 01] 10th Estonian Winter School in Computer Science

  9. Some Recent Privacy Definitions X – data, Y – (noisy) observation of X [Agrawal, Srikant ‘00] Interval of confidence • Let Y = X+noise (e.g. uniform noise in [-100,100]. • Perturb input data. Can still estimate underlying distribution. • Tradeoff: more noise  less accuracy but more privacy. • Intuition: large possible interval  privacy preserved • Given Y, we know that within c% confidence X is in [a1,a2]. For example, for Y=200, with 50% X is in [150,250]. • a2-a1 defines the amount of privacy at c% confidence • Problem: there might be some a-priori information about X • X = someone’s age & Y= -97 10th Estonian Winter School in Computer Science

  10. The [AS] scheme can be turned against itself • Assume that N is large • Even if the data-miner doesn’t have a-priori information about X, it can estimate it given the randomized data Y. • The perturbation is uniform in [-1,1] • [AS]: privacy interval 2 with confidence 100% • Let fx(X)=50% for x[0,1], and 50% for x[4,5]. • But, after learning fx(X) the value of X can be easily localized within an interval of size at most 1. • Problem: aggregate information provides information that can be used to attack individual data 10th Estonian Winter School in Computer Science

  11. Some Recent Privacy Definitions X – data, Y – (noisy) observation of X • [Agrawal, Aggarwal ‘01] Mutual information • Intuition: • High entropy is good. I(X;Y) = H(X)-H(X|Y) (mutual information) • small I(X;Y)(mutual information)  privacy preserved (Y provides little information about X). • Problem [EGS] : • Average notion. Privacy loss can happen with low but significant probability, but without affecting I(X;Y). • Sometimes I(X;Y) seems good but privacy is breached 10th Estonian Winter School in Computer Science

  12. Output Perturbation (Randomization Approach) • Exact answer to query q: • aq =iq di • Actual SDB answer: âq • Perturbation  : • For all q: | âq – aq | ≤  • Questions: • Does perturbation give any privacy? • How much perturbation is needed for privacy? • Usability 10th Estonian Winter School in Computer Science

  13. âq q/2 q/2 aq Privacy Preserved by Perturbation  n Database: dR{0,1}n (uniform input distribution!) Algorithm: on query q, • Let aq=iq di • If | aq - |q|/2 | <  return âq = |q| / 2 • Otherwise return âq = aq  n (lgn)2  Privacy is preserved • Assume poly(n) queries • If  n (lgn)2, whp always use rule 2 • No information about d is given! • (but database is completely useless…) • Shows that sometimes perturbation  n is enough for privacy. Can we do better? 10th Estonian Winter School in Computer Science

  14. Perturbation  << n Implies no Privacy • The previous useless database achieves the best possible perturbation. • Theorem [Dinur-Nissim]: Given any DB and anyDB response algorithm with perturbation  = o(n), there is a poly-time reconstruction algorithm that outputs a database d’, s.t. dist(d,d’) < o(n). strong breaking of privacy 10th Estonian Winter School in Computer Science

  15. The Adversary as a Decoding Algorithm aq1 âq1 aq2 âq2 d d âq3 aq3 pert encode decode aqt âqt partial sums perturbed sums 10th Estonian Winter School in Computer Science

  16. Proof of Theorem [DN03] The Adversary Reconstruction Algorithm • Query phase: Get âqjfor t random subsets q1,…,qt • Weeding phase: Solve the Linear Program (over ): • 0  xi 1 • |iqj xi - âqj |  • Rounding: Let ci = round(xi), output c Observation: A solution always exists, e.g. x=d. 10th Estonian Winter School in Computer Science

  17. Why does the Reconstruction Algorithm Work? • Consider x{0,1}n s.t. dist(x,d)=c·n = (n) • Observation: • A random q contains c’·n coordinates in which x≠d • The differences in the sum of these coordinates is, with constant probability, at least (n) (>  = o(n) ). • Such aq disqualifies x as a solution for the LP • Since the total number of queries q is polynomial, then all such vectors x are disqualified with overwhelming probability. 10th Estonian Winter School in Computer Science

  18. small DB medium DB large DB Summary of Results (statistical database) • [Dinur, Nissim 03] : • Unlimited adversary: • Perturbation of magnitude (n) required • Polynomial-time adversary: • Perturbation of magnitude (sqrt(n)) is required (shown above) • In both cases, adversary may reconstruct a good approximation for the database • Disallows even very weak notions of privacy • Bounded adversary, restricted to T << n queries (SuLQ): • There is a privacy preserving access mechanism with perturbation << sqrt(T) • Chance for usability • Reasonable model as database grows larger and larger 10th Estonian Winter School in Computer Science

  19. k attributes f aq,f f  f n persons 0 0 0 1 0 f 1 1 0 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 SuLQ for Multi-Attribute Statistical Database (SDB) Database {di,j} Query (q, f) q  [n] f : {0,1}k {0,1} Answer aq,f=iq f(di) Row distributionD (D1,D2,…,Dn) 10th Estonian Winter School in Computer Science

  20. Privacy and Usability Concerns for the Multi-Attribute Model [DN] • Rich set of queries: subset sums over any property of the k attributes • Obviously increases usability, but how is privacy affected? • More to protect: functions of the k attributes • Relevant factors: • What is the adversary’s goal? • Row dependency • Vertically split data(between k or less databases): • Can privacy still be maintained with independently operating databases? 10th Estonian Winter School in Computer Science

  21. use all gained info to choose i, g Privacy Definition - Intuition • 3-phase adversary • Phase 0: defines a target set G of poly(n) functions g: {0,1}k {0,1} • Will try to learn some of this information about someone • Phase 1: adaptively queries the database T=o(n) times • Phase 2: chooses an index i of a row it intends to attack and a function gG • Attack: • given d-i • try to guess g(di,1…di,k) 10th Estonian Winter School in Computer Science

  22. The Privacy Definition • P 0i,g – a-priori probability that g(di,1…di,k)=1 • p Ti,g – a-posteriori probability that g(di,1…di,k)=1 • Given answers to the T queries, and d-i • Define conf(p) = log (p/(1-p)) • 1-1 relationship between p and conf(p) • conf(1/2)=0; conf(2/3)=1; conf(1)= • conf i,g= conf(pTi,g) – conf(p0i,g) • (,T) – privacy: (“relative privacy”) • For all distributions D1…Dn , row i, function g and any adversary making at most T queries: Pr[conf i,g > ] = neg(n) 10th Estonian Winter School in Computer Science

  23. The SuLQ* Database • Adversary restricted toT <<nqueries • On query (q, f): • q  [n] • f : {0,1}k {0,1} (binary function) • Let aq,f = iq f(di,1…di,k) • Let N  Binomial(0, T ) • Return aq,f+N *SuLQ – Sub Linear Queries 10th Estonian Winter School in Computer Science

  24. Privacy Analysis of the SuLQ Database • Pmi,g- a-posteriori probability that g(di,1…di,k)=1 • Given d-iand answers to the first m queries • conf(pmi,g)Describes a random walk on the line with: • Starting point:conf(p0i,g) • Compromise: conf(pmi,g) – conf(p0i,g)>  • W.h.p. more than T steps needed to reach compromise conf(p0i,g) conf(p0i,g) + 10th Estonian Winter School in Computer Science

  25. 1 0 0 1 0 1 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 Usability: One multi-attribute SuLQ DB • Statistics of any property f of thek attributes • I.e. for what fraction of the (sub)population does f(d1…dk) hold? • Easy: just put f in the query • Other applications: • k independent multi-attribute SuLQ DBs • Vertically partitioned SulQ DBs • Testing whether Pr[|] ≥ Pr[]+ • Caveat: we hide g() about a specific row (not about multiple rows) 10th Estonian Winter School in Computer Science

  26. (Restricted) Query SDB User Exact Response Or Denial Query SDB SDB’ Data Perturbation User Response (Restricted) Query SDB User Perturbed Response Overview of Methods • Input Perturbation • Output Perturbation • Query Restriction 10th Estonian Winter School in Computer Science

  27. (Restricted) Query SDB User Exact Response Or Denial Query restriction • The decision whether to answer or deny the query • Can be based on the content of the query and on answers to previous queries • Or, can be based on the above and on the content of the database 10th Estonian Winter School in Computer Science

  28. Auditing • [AW89] classify auditing as a query restriction method: • “Auditing of an SDB involves keeping up-to-date logs of all queries made by each user (not the data involved) and constantly checking for possible compromise whenever a new query is issued” • Partial motivation:May allow for more queries to be posed, if no privacy threat occurs. • Early work: Hofmann 1977, Schlorer 1976, Chin, Ozsoyoglu 1981, 1986 • Recent interest:Kleinberg, Papadimitriou, Raghavan 2000, Li, Wang, Wang, Jajodia 2002, Jonsson, Krokhin 2003 10th Estonian Winter School in Computer Science

  29. How Auditors may Inadvertently Compromise Privacy 10th Estonian Winter School in Computer Science

  30. q = (f ,i1,…,ik) f (di1,…,dik) The Setting • Dataset: d={d1,…,dn} • Entries di: Real, Integer, Boolean • Query: q = (f ,i1,…,ik) • f : Min, Max, Median, Sum, Average, Count… • Bad users will try to breach the privacy of individuals • Compromise  uniquely determine di (very weak def) Statisticaldatabase 10th Estonian Winter School in Computer Science

  31. Statisticaldatabase Auditor Auditing Here’s the answer OR Query denied (as the answer would cause privacy loss) Here’s a new query: qi+1 Query log q1,…,qi 10th Estonian Winter School in Computer Science

  32. Auditor Example 1: Sum/Max auditing • di real, sum/max queries, privacy breached if some di learned q1 = sum(d1,d2,d3) sum(d1,d2,d3) = 15 q2 = max(d1,d2,d3) Denied (the answer would cause privacy loss) There must be a reason for the denial… q2 is denied iff d1=d2=d3 = 5 I win! Oh well… 10th Estonian Winter School in Computer Science

  33. Sounds Familiar? • David Duncan, Former auditor for Enron and partner in Andersen: Mr. Chairman, I would like to answer the committee's questions, but on the advice of my counsel I respectfully decline to answer the question based on the protection afforded me under the Constitution of the United States. 10th Estonian Winter School in Computer Science

  34. dn d8 d7 d5 d3 d6 d4 d2 d1 dn-1 … q2 = max(d1,d2,d3) q2 = max(d1,d2) Auditor Max Auditing • di real q1 = max(d1,d2,d3,d4) M1234 M123 / denied If denied: d4=M1234 M12 / denied If denied: d3=M123 Learn an item with prob ½ 10th Estonian Winter School in Computer Science

  35. d4 d2 d1 d3 d5 d6 d8 … dn dn-1 d7 q1 = sum(d1,d2) q2=sum(d2,d3) Auditor Boolean Auditing? • di Boolean 1 / denied 1 / denied … qi denied iff di = di+1  learn database/complement 10th Estonian Winter School in Computer Science

  36. The Problem • The problem: • Query denials leak (potentially sensitive) information • Users cannot decide denials by themselves Possible assignments to {d1,…,dn} Assignments consistent with (q1,…qi, a1,…,ai) qi+1 denied 10th Estonian Winter School in Computer Science

  37. q1,…,qi Statisticaldatabase q1,…,qia1,…,ai qi+1 qi+1 Simulator Auditor Deny/answer Deny/answer Solution to the problem: simulatable Auditing An auditor is simulatable if a simulator exists s.t.:  Simulation  denials do not leak information 10th Estonian Winter School in Computer Science

  38. Assignments consistent with (q1,…qi, a1,…,ai ) qi+1 denied/allowed Why Simulatable Auditors do not Leak Information? Possible assignments to {d1,…,dn} 10th Estonian Winter School in Computer Science

  39. Simulatable auditing 10th Estonian Winter School in Computer Science

  40. Query Restriction for Sum Queries • Given: • D={x1,..,xn} dataset, xi • S is a subset of X. Query: xiS xi • Is it possible tocompromise D? • Here compromisemeans: uniquely determine xifrom the queries • Can compromise if subsets arbitrarily small: • sum(x9)= x9 10th Estonian Winter School in Computer Science

  41. Query Set Size Control • Do not permit queries that involve a small subset of the database. • Compromise still possible • Want to discover x: sum(x,y1,..,yk) - sum(y1,..,yk) = x • Issue: Overlap • In general, overlap is not enough. • Need to restrict also the number of queries • Note that overlap itself sometimes restricts number of queries (e.g. size of queries = cn, overlap = const, only about 1/c possible queries) 10th Estonian Winter School in Computer Science

  42. Restricting Set-Sum Queries • Restricting the sum queries based on • Number of database elements in the sum • Overlap with previous sum queries • Total number of queries • Note that the criteria are known to the user • They do not depend on the contents of the database • Therefore, the user can simulate the denial/no-denial answer given by the DB • Simulatable auditing 10th Estonian Winter School in Computer Science

  43. ≥ k ≥ k xl • 0 0 0 1 1 1 1 • 1 0 0 1 0 0 1 0 Q1 Q2 Q3 ... Qt • ≤r x1 x2 x3 .. xn • ≤r = ≥ k • ≤r ≥ k Restricting Overlap and Number of Queries • Assume: • |Query Qi| ≥ k • |Qi Qj| ≤ r • Adversary knows a-priori at most Lvalues, L+1 < k • Claim: Data cannot be compromised with fewer than 1+(2k-L)/r Sum Queries. 10th Estonian Winter School in Computer Science

  44. Overlap + Number of Queries Claim: Data cannot be compromised with fewer than 1+(2k-L)/r Sum Queries [Dobkin,Jones,Lipton][Reiss] • k< query size, r> overlap, L a-priori known items • Suppose xc compromised aftertqueries where each query represented by: • Qi = xi1 + xi2 + … + xik for i =1, …, t • Implies that: • xc = i=1,t i Qi = i=1,t ij=1,k xij • Let i = 1 if x in query i, 0 otherwise • xc= i=1,t i=1,n i x = =1,n (i=1,t i i)x 10th Estonian Winter School in Computer Science

  45. Overlap + Number of Queries We have: xc= =1,n (i=1,ti i)x • In the above sum, (i=1,ti i) must be 0 for all x except for xc(in order for xc to be compromised) • This happens iff i=0 for all i, or if i =j =1 and i jhave opposite signs • or i =0, in which case the ith query didn’t matter 10th Estonian Winter School in Computer Science

  46. Overlap + Number of Queries • Wlog, first query contains xc, second query is of opposite sign. • In the first query,kelements are probed • The second query adds at least k-relements • Elements from first and second queries cannot be canceled within the same (additional) query (opposite signs requires) • Therefore each new query cancels items from first or from second query, but not from both. • Need to cancel 2k-r-Lelements. • Need 2+(2k-r-L)/r queries, i.e. 1+(2k-L)/r. 10th Estonian Winter School in Computer Science

  47. Notes • The number of queries satisfying|Qi|≥ k and |Qi Qj| ≤r is small • If k=n/cfor some constant c and r=const, then there are only ~c queries where no two queries overlap by more than 1. • Hence , query sequence length may be uncomfortably short. • Or, if r=k/c (overlap is a constant fraction of query size) then number of queries, 1+(2k-L)/r, is O( c). 10th Estonian Winter School in Computer Science

  48. Conclusions • Privacy should be defined and analyzed rigorously • In particular, assuming randomization  privacy is dangerous • High perturbation is needed for privacy against polynomial adversaries • Threshold phenomenon – above n: total privacy, below n: no privacy (for poly-time adversary) • Main tool: a reconstruction algorithm • Careless auditing might leak private information • Self Auditing (simulatable auditors) is safe • Decision whether to allow a query based on previous `good’ queries and their answers • Without access to DB contents • Users may apply the decision procedure by themselves 10th Estonian Winter School in Computer Science

  49. ToDo • Come up with good model and requirements for database privacy • Learn from crypto • Protect against more general loss of privacy • Simulatable auditors are a starting point for designing more reasonable audit mechanisms 10th Estonian Winter School in Computer Science

  50. References • Course web page: • A Study of Perturbation Techniques for Data Privacy, Cynthia Dwork and Nina Mishra and Kobbi Nissim, http://theory.stanford.edu/~nmishra/cs369-2004.html • Privacy and Databases http://theory.stanford.edu/~rajeev/privacy.html 10th Estonian Winter School in Computer Science

More Related