1 / 65

CS573 Data Privacy and Security Anonymization methods

CS573 Data Privacy and Security Anonymization methods. Li Xiong. Today. Permutation based anonymization methods (cont.) Other privacy principles for m icrodata publishing Statistical databases. Anonymization methods. Non-perturbative: don't distort the data Generalization

drake
Télécharger la présentation

CS573 Data Privacy and Security Anonymization methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS573 Data Privacy and SecurityAnonymization methods Li Xiong

  2. Today • Permutation based anonymization methods (cont.) • Other privacy principles for microdata publishing • Statistical databases

  3. Anonymization methods • Non-perturbative: don't distort the data • Generalization • Suppression • Perturbative: distort the data • Microaggregation/clustering • Additive noise • Anatomization and permutation • De-associate relationship between QID and sensitive attribute

  4. Concept of the Anatomy Algorithm • Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST) • Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column • Then produce a sensitive table with Disease statistics

  5. Specifications of Anatomy cont. DEFINITION 3. (Anatomy) With a given l-diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: (Aqi1,Aqi2, ..., Aqid,Group-ID) ST will be constructed as the following: (Group-ID, As, Count)

  6. Privacy properties THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l

  7. Comparison with generalization • Compare with generalization on two assumptions: • A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata • If A1 and A2 are true, anatomy is as good as generalization 1/l holds true • If A1 is true and A2 is false, generalization is stronger • If A1 and A2 are false, generalization is still stronger

  8. Preserving Data Correlation • Examine the correlation between Age and Disease in T using probability density function pdf • Example: t1

  9. Preserving Data Correlation cont. • To re-construct an approximate pdf of t1 from the generalization table:

  10. Preserving Data Correlation cont. • To re-construct an approximate pdf of t1 from the QIT and ST tables:

  11. Preserving Data Correlation cont. • To figure out a more rigorous comparison, calculate the “L2 distance” with the following equation: • The distance for anatomy is 0.5 while the distance for generalization is 22.5

  12. Preserving Data Correlation cont. Idea: Measure the error for each tuple by using the following formula: Objective: for all tuplest in T and obtain a minimal re-construction error (RCE): Algorithm: Nearly-Optimal Anatomizing Algorithm

  13. Experiments • dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes • Created two sets of microdata tables • Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC-d (3 ≤d≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute As • Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL-d (3 ≤d≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute Asg

  14. Experiments cont.

  15. Today • Permutation based anonymization methods (cont.) • Other privacy principles for microdata publishing • Statistical databases • Differential privacy

  16. Attacks on k-Anonymity • k-Anonymity does not provide privacy if • Sensitive values in an equivalence class lack diversity • The attacker has background knowledge A 3-anonymous patient table Homogeneity attack Background knowledge attack

  17. l-Diversity [Machanavajjhala et al. ICDE ‘06] Sensitive attributes must be “diverse” within each quasi-identifier equivalence class

  18. Distinct l-Diversity • Each equivalence class has at least l well-represented sensitive values • Doesn’t prevent probabilistic inference attacks 8 records have HIV 10 records 2 records have other values

  19. Other Versions of l-Diversity • Probabilistic l-diversity • The frequency of the most frequent value in an equivalence class is bounded by 1/l • Entropy l-diversity • The entropy of the distribution of sensitive values in each equivalence class is at least log(l) • Recursive (c,l)-diversity • r1<c(rl+rl+1+…+rm) where ri is the frequency of the ith most frequent value • Intuition: the most frequent value does not appear too frequently

  20. Neither Necessary, Nor Sufficient Original dataset 99% have cancer

  21. Neither Necessary, Nor Sufficient Original dataset Anonymization A 50% cancer  quasi-identifier group is “diverse” 99% have cancer

  22. Neither Necessary, Nor Sufficient Original dataset Anonymization A Anonymization B 99% cancer  quasi-identifier group is not “diverse” 50% cancer  quasi-identifier group is “diverse” This leaks a ton of information 99% have cancer

  23. Limitations of l-Diversity • Example: sensitive attribute is HIV+ (1%) or HIV- (99%) • Very different degrees of sensitivity! • l-diversity is unnecessary • 2-diversity is unnecessary for an equivalence class that contains only HIV- records • l-diversity is difficult to achieve • Suppose there are 10000 records in total • To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes

  24. Skewness Attack • Example: sensitive attribute is HIV+ (1%) or HIV- (99%) • Consider an equivalence class that contains an equal number of HIV+ and HIV- records • Diverse, but potentially violates privacy! • l-diversity does not differentiate: • Equivalence class 1: 49 HIV+ and 1 HIV- • Equivalence class 2: 1 HIV+ and 49 HIV- l-diversity does not consider overall distribution of sensitive values!

  25. Sensitive Attribute Disclosure A 3-diverse patient table Similarity attack Conclusion Bob’s salary is in [20k,40k], which is relatively low Bob has some stomach-related disease l-diversity does not consider semantics of sensitive values!

  26. t-Closeness: A New Privacy Measure • Rationale • Observations • Q is public or can be derived • Potential knowledge gain from Q and Pi about Specific individuals • Principle • The distance between Q and Pi should be bounded by a threshold t. ExternalKnowledge Overall distribution Q of sensitive values Distribution Pi of sensitive values in each equi-class

  27. t-Closeness [Li et al. ICDE ‘07] Distribution of sensitive attributes within each quasi-identifier group should be “close” to their distribution in the entire original database

  28. Distance Measures • P=(p1,p2,…,pm), Q=(q1,q2,…,qm) • Trace-distance • KL-divergence • None of these measures reflect the semantic distance among values. • Q:{3K,4K,5K,6K,7K,8K,9K,10K,11k} P1:{3K,4K,5k} P2:{5K,7K,10K} • Intuitively, D[P1,Q]>D[P2,Q]

  29. Earth Mover’s Distance • If the distributions are interpreted as two different ways of piling up a certain amount of dirt over region D, EMD is the minimum cost of turning one pile into the other • the cost is amount of dirt moved * the distance by which it is moved • Assume two piles have the same amount of dirt • Extensions for comparison of distributions with different total masses. • allow for a partial match, discard leftover "dirt“, without cost • allow for mass to be created or destroyed, but with a cost penalty

  30. Earth Mover’s Distance • Formulation • P=(p1,p2,…,pm), Q=(q1,q2,…,qm) • dij: the ground distance between element i of P and element j of Q. • Find a flow F=[fij] where fij is the flow of mass from element i of P to element j of Q that minimizes the overall work: subject to the constraints:

  31. How to calculate EMD(Cont’d) • EMD for categorical attributes • Hierarchical distance • Hierarchical distance is a metric

  32. Earth Mover’s Distance • Example • {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k} • Move 1/9 probability for each of the following pairs • 3k->6k,3k->7k cost: 1/9*(3+4)/8 • 4k->8k,4k->9k cost: 1/9*(4+5)/8 • 5k->10k,5k->11k cost: 1/9*(5+6)/8 • Total cost: 1/9*27/8=0.375 • With P2={6k,8k,11k} , we can get the total cost is 1/9 * 12/8 = 0.167 < 0.375. This make more sense than the other two distance calculation method.

  33. Experiments • Goal • To show l-diversity does not provide sufficient privacy protection (the similarity attack). • To show the efficiency and data quality of using t-closeness are comparable with other privacy measures. • Setup • Adult dataset from UC Irvine ML repository • 30162 tuples, 9 attributes (2 sensitive attributes) • Algorithm: Incognito

  34. Experiments • Comparisons of privacy measurements • k-Anonymity • Entropy l-diversity • Recursive (c,l)-diversity • k-Anonymity with t-closeness

  35. Experiments • Efficiency • The efficiency of using t-closeness is comparable with other privacy measurements

  36. Experiments • Data utility • Discernibility metric; Minimum average group size • The data quality of using t-closeness is comparable with other privacy measurements

  37. Anonymous, “t-Close” Dataset This is k-anonymous, l-diverse and t-close… …so secure, right?

  38. What Does Attacker Know? Bob is Caucasian and I heard he was admitted to hospital with flu…

  39. What Does Attacker Know? Bob is Caucasian and I heard he was admitted to hospital … And I know three other Caucasions admitted to hospital with Acne or Shingles …

  40. k-Anonymity and Partition-based notions • Syntactic • Focuses on data transformation, not on what can be learned from the anonymized dataset • “k-anonymous” dataset can leak sensitive information • “Quasi-identifier” fallacy • Assumes a priori that attacker will not know certain information about his target

  41. Today • Permutation based anonymization methods (cont.) • Other privacy principles for microdata publishing • Statistical databases • Definitions and early methods • Output perturbation and differential privacy

  42. Statistical Data Release • Originated from the study on statistical database • A statistical database is a database which provides statistics on subsets of records • OLAP vs. OLTP • Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of records

  43. Static – a static database is made once and never changes Example: U.S. Census Dynamic – changes continuously to reflect real-time data Example: most online research databases Types of Statistical Databases

  44. Centralized – one database Decentralized – multiple decentralized databases Types of Statistical Databases • General purpose – like census • Special purpose – like bank, hospital, academia, etc

  45. Data Compromise • Exact compromise – a user is able to determine the exact value of a sensitive attribute of an individual • Partial compromise – a user is able to obtain an estimator for a sensitive attribute with a bounded variance • Positive compromise – determine an attribute has a particular value • Negative compromise – determine an attribute does not have a particular value • Relative compromise – determine the ranking of some confidential values

  46. Statistical Quality of Information • Bias – difference between the unperturbed statistic and the expected value of its perturbed estimate • Precision – variance of the estimators obtained by users • Consistency – lack of contradictions and paradoxes • Contradictions: different responses to same query; average differs from sum/count • Paradox: negative count

  47. Methods • Query restriction • Data perturbation/anonymization • Output perturbation

  48. Data Perturbation

  49. Output Perturbation Query Results Results Query

  50. Statistical data release vs. data anonymization • Data anonymization is one technique that can be used to build statistical database • Other techniques such as query restriction and output purterbation can be used to build statistical database or release statistical data • Different privacy principles can be used

More Related