Towards Achieving Anonymity
Towards Achieving Anonymity. An Zhu. Introduction. Collect and analyze personal data Infer trends and patterns Making the personal data “public” Joining multiple sources Third party involvement Privacy concerns Q: How to share such data?. Example: Medical Records. De-identified Records.
Towards Achieving Anonymity
E N D
Presentation Transcript
Towards Achieving Anonymity An Zhu
Introduction • Collect and analyze personal data • Infer trends and patterns • Making the personal data “public” • Joining multiple sources • Third party involvement • Privacy concerns • Q: How to share such data?
Not Sufficient! [Sweeney 00’] Unique Identifiers! Public Database
Not Sufficient! [Sweeney 00’] Unique Identifiers! Public Database
Anonymize the Quasi-Identifiers! Unique Identifiers! Public Database
Q: How to share such data? • Anonymize the quasi-identifiers • Suppress information • Privacy guarantee: anonymity • Quality: the amount of suppressed information • Clustering • Privacy guarantee: cluster size • Quality: various clustering measures
Q: How to share such data? • Anonymize the quasi-identifiers • Suppress information • Privacy guarantee: anonymity • Quality: the amount of suppressed information • Clustering • Privacy guarantee: cluster size • Quality: various clustering measures
k-anonymized Table [Samarati 01’] Each row is identical to at least k-1 other rows
Definition: k-anonymity • Input: a table consists of n row, each with m attributes (quasi-identifiers) • Output: suppress some entries such that each row is identical to at least k-1 other rows • Objective: minimize the number of suppressed entries
Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for 2-anonymity • 2-approximation for 3-anonymity
Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for 2-anonymity • 2-approximation for 3-anonymity
Graph Representation 4 A: B: C: D: E: F: A B 2 3 F C 3 4 E D 6 W(e)=Hamming distance between the two rows
Edge Selection I A: B: C: D: E: F: A B 2 2 0 3 F C 2 E D Each node selects the lightest weight edge k=3
Edge Selection II A: B: C: D: E: F: A B 2 3 0 3 F C 2 E D For components with <k vertices, add more edges k=3
Lemma • Total weight of edges selected is no more than OPT • In the optimal solution, each vertex pays at least the weight of the (k-1)st lightest weight edge • Forest: at most one edge per vertex • By construction, the edge weight is no more than the (k-1)st lightest weight edge per vertex
Grouping • Ideally, each connected component forms a group • Anonymize vertices within a group • Total cost of a group: (total edge weights) (number of nodes) • (2+2+3+3)6 A B 2 3 0 3 F C 2 E D Small groups: O(k)
Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • Aim: all sub-trees <k <k <k <k <k k k k k
Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • Rotate the tree if necessary k k k
Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • T. condition: max(2k-1, 3k-5) <k <k <k <k <k
An Example A B A: B: C: D: E: F: 2 3 0 3 F C 2 E D
An Example C A: B: C: D: E: F: 2 2 3 B E F 3 A 0 D
An Example C A: B: C: D: E: F: 2 2 B E F 3 A 0 D Estimated cost: 43+33 Optimal cost: 33+33
Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for2-anonymity • 2-approximation for 3-anonymity
1.5-approximation 1 A: B: C: D: E: F: A B 6 6 F C 0 5 E D 6 W(e)=Hamming distance between the two rows
Minimum {1,2}-matching 1 A: B: C: D: E: F: A B 0 F C 0 1 E D Each vertex is matched to 1 or 2 other vertices
Properties • Each component has 3 nodes >3 Not possible (degree 2) Not Optimal
Qualities • Cost 2OPT • For binary alphabet: 1.5OPT a p q r p,q OPT pays: 2a We pay: 2a OPT pays: p+q+r We pay: 3(p+q) 2(p+q+r)
Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for 2-anonymity • 2-approximation for 3-anonymity
Open Problems • Can we improve O(k)? • (k) for graph representation
Open Problems • Can we improve O(k)? • (k) for graph representation 1111111100000000000000000000000000000000 0000000011111111000000000000000000000000 0000000000000000111111110000000000000000 0000000000000000000000001111111100000000 0000000000000000000000000000000011111111 k = 5, d = 16,c = k d / 2
Open Problems • Can we improve O(k)? • (k) for graph representation 1111111100000000000000000000000000000000 0000000011111111000000000000000000000000 0000000000000000111111110000000000000000 0000000000000000000000001111111100000000 0000000000000000000000000000000011111111 k = 5, d = 16,c = k d / 2
Open Problems • Can we improve O(k)? • (k) for graph representation 10101010101010101010101010101010 11001100110011001100110011001100 11110000111100001111000011110000 11111111000000001111111100000000 11111111111111110000000000000000 k = 5, d = 16,c = 2 d
Open Problems • Can we improve O(k)? • (k) for graph representation 10101010101010101010101010101010 11001100110011001100110011001100 11110000111100001111000011110000 11111111000000001111111100000000 11111111111111110000000000000000 k = 5, d = 16,c = 2 d
Q: How to share such data? • Anonymize the quasi-identifiers • Suppress information • Privacy guarantee: anonymity • Quality: the amount of suppressed information • Clustering • Privacy guarantee: cluster size • Quality: various clustering measures
Measure • How good are the clusters • “Tight” clusters are better • Minimize max radius: Gather-k • Minimize max distortion error: Cellular-k • radius num_nodes Cost: Gather-k: 10 Cellular-k: 624
Measure • How good are the clusters • “Tight” clusters are better • Minimize max radius: Gather-k • Minimize max distortion error: Cellular-k • radius num_nodes • Handle outliers • Constant approximations!
Comparison • k = 5 • 5-anonymity • Suppress all entries • More distortion • Clustering • Can pick R5 as the center • Less distortion • Distortion is directly related with pair-wise distances
Results [AFKKPTZ 06’] • Gather-k • Tight 2-approximation • Extension to outlier: 4-approximation • Cellular-k • Primal-dual const. approximation • Extensions as well
Results [AFKKPTZ 06’] • Gather-k • Tight 2-approximation • Extension to outlier: 4-approximation • Cellular-k • Primal-dual const. approximation • Extensions as well
2-approximation • Assume an optimal value R • Make sure each node has at least k – 1 neighbors within distance 2R. R 2R A
2-approximation • Assume an optimal value R • Make sure each node has at least k – 1 neighbors within distance 2R. • Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. • Make sure we can reassign nodes to the selected centers.
Optimal Solution R 1 2