1 / 34

Anonymization Algorithms - Other techniques, metrics, and extended scenarios

Anonymization Algorithms - Other techniques, metrics, and extended scenarios. Li Xiong CS573 Data Privacy and Anonymity. So far. k-anonymity (protect identity disclosure) Anonymization algorithms Generalization and suppression Microaggregation and clustering

xylia
Télécharger la présentation

Anonymization Algorithms - Other techniques, metrics, and extended scenarios

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Anonymization Algorithms - Other techniques, metrics, and extended scenarios Li Xiong CS573 Data Privacy and Anonymity

  2. So far • k-anonymity (protect identity disclosure) • Anonymization algorithms • Generalization and suppression • Microaggregation and clustering • Privacy principles beyond k-anonymity • l-diversity, t-closeness (protect attribute disclosure) • m-invariance (protect continuous publishing)

  3. Agenda • Other anonymization technique • Anatomization • Information metrics • Extended scenarios

  4. Anonymization methods • Non-perturbative: don't distort the data • Generalization • Suppression • Perturbative: distort the data • Microaggregation/clustering • Additive noise • Anatomization and permutation • De-associate relationship between QID and sensitive attribute

  5. Problems with k-anonymity and l-diversity Query A: SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]

  6. Querying generalized table • R1 and R2 are the anonymized QID groups • Q is the query range • p = Area(R1∩ RQ)/Area(R1) = (10*10)/(50*40) = 0.05 • Estimated Answer for A: 2(0.05) = 0.1

  7. Concept of the Anatomy Algorithm • Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST) • Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column • Then produce a sensitive table with Disease statistics

  8. Concept of the Anatomy Algorithm • Does it satisfy k-anonymity? l-diversity? • Query results? SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]

  9. Specifications of Anatomy • T is representation of the microdata to be published • T has d QI attributes Aqi1,Aqi2, ..., Aqidand a sensitive attribute As • Each Aqii (1 ≤ i ≤ d ) is either numerical or categorical, but As can only be categorical because of l-diversity • t is a tuple within T and Aqii is the value of t with [d + 1] as the Asvalue • With the above stated, we can consider t to be a point in a (d +1)-dimensional data space regarded as DS

  10. Specifications of Anatomy cont. DEFINITION 1.(Partition/QI-group) A partition is several subsets of T and only allow each tuple to belong to one subset Subsets are know as QI-groups and are denoted as follows QI1, QI2, ...,QIm

  11. Specifications of Anatomy cont. DEFINITION 2. (l-diverse partition) A partition is considered l-diverse if it conforms to the following: v is the most frequent sensitive value in a QI-group QIj and cj(v) is the number of tuples that match v cj(v)/|QIj| ≤ 1/l |QIj| is the number of tuples of QIj c1(dyspepsia) = c1(pneumonia) = 2 and c2(flu) = 2 |QI1| = |QI2| = 4 so this satisfies the condition 2/4 ≤ 1/2

  12. Specifications of Anatomy cont. DEFINITION 3. (Anatomy) With a given l-diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: (Aqi1,Aqi2, ..., Aqid,Group-ID) ST will be constructed as the following: (Group-ID, As, Count)

  13. Privacy properties THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l

  14. Comparison with generalization • Compare with generalization on two assumptions: • A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata • If A1 and A2 are true, anatomy is as good as generalization 1/l holds true • If A1 is true and A2 is false, generalization is stronger • If A1 and A2 are false, generalization is still stronger

  15. Preserving Data Correlation • Examine the correlation between Age and Disease in T using probability density function pdf • Example: t1

  16. Preserving Data Correlation cont. • To re-construct an approximate pdf of t1 from the generalization table:

  17. Preserving Data Correlation cont. • To re-construct an approximate pdf of t1 from the QIT and ST tables:

  18. Preserving Data Correlation cont. • To figure out a more rigorous comparison, calculate the “L2 distance” with the following equation: • The distance for anatomy is 0.5 while the distance for generalization is 22.5 • Anatomy provides for better re-constructions of the probability density functions of all tuples.

  19. Preserving Data Correlation cont. • measure the error for each pdf by using the following formula: • Objective: for all tuples t in T and obtain a minimal re-construction error (RCE):

  20. Nearly-Optimal Anatomizing Algorithm • They propose an efficient algorithm for anatomizing tables that will minimize the RCE • The resulting QIT and ST achieves an RCE that only deviates from the lower bound by a factor < 1 + 1/n, where n is the size of T • This algorithm has linear I/O complexity O(n/b) where b is the page size

  21. Nearly-Optimal Anatomizing Algorithm cont. PROPERTY 1. At the end of the group-creation phase, each non-empty bucket has only one tuple. PROPERTY 2. The set S' always includes at least one QI-group. PROPERTY 3. After the residue-assignment phase, each QI group has at least l tuples with distinct senstive attribute value

  22. Experiments • dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes • Created two sets of microdata tables • Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC-d (3 ≤d≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute As • Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL-d (3 ≤d≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute Asg

  23. Experiments cont.

  24. Experiments cont.

  25. Experiments cont.

  26. Experiments cont.

  27. Conclusion • Anatomy was designed to overcome the problem of generalization of losing too much data and still obtain privacy • Anatomy has a significantly lower error rate as compared with generalization • Several items would require further research • - Multiple sensitive attributes - Effective mining of patterns in microdata

  28. Agenda • Other anonymization technique • Anatomization • Information metrics • Extended scenarios

  29. Information Metrics • General purpose metrics • Special purpose metrics • Trade-off metrics

  30. General Purpose Metrics • General idea: measure “similarity” between the original data and the anonymized data • Minimal distortion metric (Samarati 2001; Sweeney 2002, Wang and Fung 2006) • Charge a penalty to each instance of a value generalized or suppressed (independently of other records) • ILoss (Xiao and Tao 2006) • Charge a penalty when a specific value is generalized

  31. General Purpose Metrics cont. • Discernibility Metric (DM) (K-OPTIMIZE, Mondrian, l-diversity …) • Charge a penalty to each record for being indistinguishable from other records

  32. Special Purpose Metrics • Classification: Classification metric (CM) (Iyengar 2002) • Charge a penalty for each record suppressed or generalized to a group in which the record’s class is not the majority class • Query • Query error: count queries • Query imprecision: overlapped range

  33. Extended Scenarios • Multiple release publishing • Continuous release publishing • Collaborative/distributed publishing

  34. Other types of data • High dimensional transaction data • Market basket, web queries • Moving objects data • Location based services • Textual data

More Related