Privacy-Preserving Data Publishing

Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

motivation • several agencies, institutions, bureaus, organizations make (sensitive) data involving people publicly available • termed microdata (vs. aggregated macrodata) used for analysis • often required and imposed by law • to protect privacy microdata are sanitized • explicit identifiers (SSN, name, phone #) are removed • is this sufficient for preserving privacy? • no! susceptible to link attacks • publicly available databases (voter lists, city directories) can reveal the “hidden” identity

link attack example • looking for governor’s record • join the tables: • 6 people had his birth date • 3 were men • 1 in his zipcode • regarding the US 1990 census data • 87% of the population are unique based on (zipcode, gender, dob) • [Sweeney01]managed to re-identify the medical record of the governor of Massachussetts • MA collects and publishes sanitized medical data for state employees (microdata) left circle • voter registration list of MA (publicly available data) right circle

Microdata

Inference Attack Published table An adversary Quasi-identifier (QI) attributes

k-anonymity [Samarati and Sweeney02] • Transform the QI values into less specific forms generalize

Generalization • Transform each QI value into a less specific form A generalized table An adversary

35000 12000 14000 18000 25000 20000 26000 27000 33000 34000 52 24 43 56 22 40 21 36 37 41 23 Graphically… Alice Bob

35000 12000 14000 18000 25000 20000 26000 27000 33000 34000 52 24 43 56 22 40 21 36 37 41 23 Why not… How many people with age in [30, 50] contracted flu?

k-anonymity How many people with age in [30, 50] contracted flu? generalization with low utility: answer less accurately: [0..3] generalization with high utility: answer queries more accurately: 2.

k-anonymity with utility • Among all generalizations that enforce k-anonymity, we should maximize utility by minimizing the “rectangle” sizes! • Several measures. E.g. to minimize the maximal perimeter size of the rectangles.

Mondrian [LDR06] Recursive half-plane partitioning, alternating dimensions. let k=2

Mondrian [LDR06] Unbounded approximation ratio! let k=4

Our contributions [DXT+07] • Proved that to find the optimal partitioning is NP-hard. • Proved that to find a partitioning with approximation ratio less than 1.25 is also NP-hard. • Provided three algorithms with tradeoffs in complexity and approximation ratio.

Divide-And-Group (DAG) • Divide the space into square cells with proper size • Find a set of non-overlapping tiles of 2 x 2 cells to cover the points, such that each tile covers at least k points • Assign the rest of (uncovered) points to the nearest tile

Min-MBR-Group (MMG) • For each point p, find the smallest MBR which covers at least k points including p • Find a set of non-overlapping MBRs from the result of previous step • Assign the points to the nearest MBR

Nearest-Neighbor-Group (NNG) • For each point p, find the MBR which covers p and its k-1 nearest neighbors • Find a set of non-overlapping MBRs from the result of previous step • Assign the points to the nearest MBR

Analysis

Drawback of k-anonymity • In a QI group, if many records have the same sensitive attribute value... Quasi-identifier (QI) attributes Sensitive attribute If Bob is in this group, he must have pneumonia.

l-diversity [ICDE06] • A QI-group with m tuples is l-diverse, iff each sensitive value appears no more than m /l times in the QI-group. • A table is l-diverse, iff all of its QI-groups are l-diverse. • The above table is 2-diverse. Quasi-identifier (QI) attributes Sensitive attribute 2 QI-groups

What l-diversity guarantees • From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l A 2-diverse generalized table A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

Problem with multi-publishing • A hospital keeps track of the medical records collected in the last three months. • The microdata table T(1), and its generalization T*(1), published in Apr. 2007. 2-diverse Generalization T*(1) Microdata T(1)

Problem with multi-publishing • Bob was hospitalized in Mar. 2007 2-diverse Generalization T*(1)

Problem with multi-publishing • One month later, in May 2007 Microdata T(1)

Problem with multi-publishing • One month later, in May 2007 • Some obsolete tuples are deleted from the microdata. Microdata T(1)

Problem with multi-publishing • Bob’s tuple stays. Microdata T(1)

Problem with multi-publishing • Some new records are inserted. Microdata T(2)

Problem with multi-publishing • The hospital published T*(2). 2-diverse Generalization T*(2) Microdata T(2)

Problem with multi-publishing • Consider the previous adversary. 2-diverse Generalization T*(2)

Problem with multi-publishing • What the adversary learns from T*(1). • What the adversary learns from T*(2). • So Bob must have contracted dyspepsia! • A new generalization principle is needed.

m-invariance [SIGMOD07] • A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if • T*(1), …, T*(n) are m-unique, and • each individual has the same signature in every generalized table s/he is involved. • Explanation • m-unique: every QI group contains at least m tuples with different sensitive attributes • signature: all the sensitive attributes in the individual’s QI group.

m-unique • A generalized table T*(j) is m-unique, if and only if • each QI-group in T*(j) contains at least m tuples • all tuples in the same QI-group have different sensitive values. A 2-unique generalized table

Signature • The signature of Bob in T*(1) is {dyspepsia, bronchitis} • The signature of Jane in T*(1) is {dyspepsia, flu, gastritis} T*(1)

The m-invariance principle • Lemma: if a sequence of generalized tables {T*(1), …, T*(n)} is m-invariant, then for any individual o involved in any of these tables, we have risk(o) <= 1/m

The m-invariance principle • Lemma: let {T*(1), …, T*(n-1)} be m-invariant. {T*(1), …, T*(n-1), T*(n)} is also m-invariant, if and only if {T*(n-1), T*(n)} is m-invariant • Only T*(n - 1) is needed for the generation of T*(n). T*(1), T*(2), …, T*(n-2), T*(n-1), T*(n) Can be discarded

Solution idea • Goal: Given T(n) and T*(n-1), create T*(n) such that {T*(n-1) and T*(n)} is m-invariant. • Idea: create counterfeits. • Optimization goal: to impose as little amount of generalization as possible.

Microdata T(2) Counterfeited generalization T*(2) The auxiliary relation R(2) for T*(2)

Generalization T*(1) Counterfeited Generalization T*(2) The auxiliary relation R(2) for T*(2)

A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if • T*(1), …, T*(n) are m-unique, and • each individual has the same signature in every generalized table s/he is involved. Generalization T*(1) Generalization T*(2)

In case of corruption… • If an adversary knows from Alice that she has bronchitis, he can conclude that Bob has dyspepsia. 2-diverse Generalization Microdata

Anti-corruption publishing [ICDE08] • We formalized anti-corruption publishing, by modeling the degree of privacy preservation as a function of an adversary’s background knowledge. • We proposed a solution, by integrating generalization with • perturbation: switch selected records’ sensitive information. • stratified sampling: sample some records from each QI group.

Summary • Introduced the problem of privacy-preserving publishing. • Two principles: • k-anonymity • l-diversity • Two extensions: • multi-publishing • corruption

Privacy-Preserving Data Publishing

Privacy-Preserving Data Publishing

Presentation Transcript

Privacy Preserving Market Basket Data Analysis

Privacy-Preserving Data Mining

Privacy-Preserving Data Publishing

Privacy Preserving Data Mining

Privacy-Preserving Data Mashup

Randomization in Privacy Preserving Data Mining

Algorithm Safe Privacy-Preserving Data Publishing

Privacy Preserving Data Dissemination

data privacy-preserving

Privacy Preserving Serial Data Publishing By Role Composition

Privacy Preserving Data Mining

Privacy-Preserving Distributed Data Mining

Privacy-Preserving Multi-Domain Data Aggregation

Data Transformation for Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data Mining

Privacy-Preserving Data Sharing

A Technological Survey on Privacy Preserving Data Publishing

Inference Problem Privacy Preserving Data Mining

Privacy-Preserving Data Mining

Privacy Preserving Data Mining

Data Transformation for Privacy-Preserving Data Mining

Privacy Preserving Data Mining