Towards Achieving Anonymity

Towards Achieving Anonymity An Zhu

Introduction • Collect and analyze personal data • Infer trends and patterns • Making the personal data “public” • Joining multiple sources • Third party involvement • Privacy concerns • Q: How to share such data?

Example: Medical Records

De-identified Records

Not Sufficient! [Sweeney 00’] Unique Identifiers! Public Database

Anonymize the Quasi-Identifiers! Unique Identifiers! Public Database

Q: How to share such data? • Anonymize the quasi-identifiers • Suppress information • Privacy guarantee: anonymity • Quality: the amount of suppressed information • Clustering • Privacy guarantee: cluster size • Quality: various clustering measures

k-anonymized Table [Samarati 01’]

k-anonymized Table [Samarati 01’] Each row is identical to at least k-1 other rows

Definition: k-anonymity • Input: a table consists of n row, each with m attributes (quasi-identifiers) • Output: suppress some entries such that each row is identical to at least k-1 other rows • Objective: minimize the number of suppressed entries

Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for 2-anonymity • 2-approximation for 3-anonymity

Graph Representation 4 A: B: C: D: E: F: A B 2 3 F C 3 4 E D 6 W(e)=Hamming distance between the two rows

Edge Selection I A: B: C: D: E: F: A B 2 2 0 3 F C 2 E D Each node selects the lightest weight edge k=3

Edge Selection II A: B: C: D: E: F: A B 2 3 0 3 F C 2 E D For components with <k vertices, add more edges k=3

Lemma • Total weight of edges selected is no more than OPT • In the optimal solution, each vertex pays at least the weight of the (k-1)st lightest weight edge • Forest: at most one edge per vertex • By construction, the edge weight is no more than the (k-1)st lightest weight edge per vertex

Grouping • Ideally, each connected component forms a group • Anonymize vertices within a group • Total cost of a group: (total edge weights) (number of nodes) • (2+2+3+3)6 A B 2 3 0 3 F C 2 E D Small groups: O(k)

Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • Aim: all sub-trees <k <k <k <k <k k k k k

Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • Rotate the tree if necessary k k k

Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • T. condition: max(2k-1, 3k-5) <k <k <k <k <k

An Example A B A: B: C: D: E: F: 2 3 0 3 F C 2 E D

An Example C A: B: C: D: E: F: 2 2 3 B E F 3 A 0 D

An Example C A: B: C: D: E: F: 2 2 B E F 3 A 0 D Estimated cost: 43+33 Optimal cost: 33+33

Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for2-anonymity • 2-approximation for 3-anonymity

1.5-approximation 1 A: B: C: D: E: F: A B 6 6 F C 0 5 E D 6 W(e)=Hamming distance between the two rows

Minimum {1,2}-matching 1 A: B: C: D: E: F: A B 0 F C 0 1 E D Each vertex is matched to 1 or 2 other vertices

Properties • Each component has 3 nodes >3 Not possible (degree  2) Not Optimal

Qualities • Cost  2OPT • For binary alphabet: 1.5OPT a p q r  p,q OPT pays: 2a We pay: 2a OPT pays: p+q+r We pay: 3(p+q)  2(p+q+r)

Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for 2-anonymity • 2-approximation for 3-anonymity

Open Problems • Can we improve O(k)? • (k) for graph representation

Open Problems • Can we improve O(k)? • (k) for graph representation 1111111100000000000000000000000000000000 0000000011111111000000000000000000000000 0000000000000000111111110000000000000000 0000000000000000000000001111111100000000 0000000000000000000000000000000011111111 k = 5, d = 16,c = k  d / 2

Open Problems • Can we improve O(k)? • (k) for graph representation 10101010101010101010101010101010 11001100110011001100110011001100 11110000111100001111000011110000 11111111000000001111111100000000 11111111111111110000000000000000 k = 5, d = 16,c = 2  d

Q: How to share such data? • Anonymize the quasi-identifiers • Suppress information • Privacy guarantee: anonymity • Quality: the amount of suppressed information • Clustering • Privacy guarantee: cluster size • Quality: various clustering measures

Clustering Approach [AFKKPTZ 06’]

Transfers into a Metric…

Clusters and Centers

Measure • How good are the clusters • “Tight” clusters are better • Minimize max radius: Gather-k • Minimize max distortion error: Cellular-k •  radius  num_nodes Cost: Gather-k: 10 Cellular-k: 624

Measure • How good are the clusters • “Tight” clusters are better • Minimize max radius: Gather-k • Minimize max distortion error: Cellular-k •  radius  num_nodes • Handle outliers • Constant approximations!

Comparison • k = 5 • 5-anonymity • Suppress all entries • More distortion • Clustering • Can pick R5 as the center • Less distortion • Distortion is directly related with pair-wise distances

Results [AFKKPTZ 06’] • Gather-k • Tight 2-approximation • Extension to outlier: 4-approximation • Cellular-k • Primal-dual const. approximation • Extensions as well

2-approximation • Assume an optimal value R • Make sure each node has at least k – 1 neighbors within distance 2R. R 2R A

2-approximation • Assume an optimal value R • Make sure each node has at least k – 1 neighbors within distance 2R. • Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. • Make sure we can reassign nodes to the selected centers.

Example: k = 5

Optimal Solution R 1 2

Towards Achieving Anonymity