Differential Privacy Xintao Wu Oct 31, 2012

Differential PrivacyXintao WuOct 31, 2012

Sanitization approaches • Input perturbation • Add noise to data • Generalize data • Summary statistics • Means, variances • Marginal totals • Model parameters • Output perturbation • Add noise to summary statistics

Blending/hiding into a crowd • K-anonymity based approaches • Adversary may have various background knowledge to breach privacy • Privacy models often assume “the adversary’s background knowledge is given”

Classic intuition for privacy • Privacy means that anything can be learned about a respondent from the statistical database can be learned without access to the database. • Security of encryption • Anything about the plaintext that can be learned from a ciphertext can be learned without the ciphertext. • Prior and posterior views about an individual should not change much

Motivation • Publicly release statistical information about a dataset without compromising the privacy of any individual

Requirement • Anything that can be learned about a respondent from a statistical database should be learnable without access to the database • Reduce the knowledge gain of joining the database • Require that the probability distribution on the public results is essentially the same independent of whether any individual opts in to, or opts out of the dataset

Definition

Sensitivity function • Captures how great a difference must be hidden by the additive noise

LAP distribution noise

Guassian noise

Adding LAP noise

Proof sketch

Delta_f=1, epsilon varies

Delta_f=1 epsilon=0.01

Delta_f=1 epsilon=0.1

Delta_f=1 epsilon=1

Composition • Sequential composition • Parallel composition --for disjoint sets, the ultimate privacy guarantee depends only on the worst of the guarantees of each analysis, not the sum.

Example • Let us assume a table with 1000 customers and each record has attributes: name, gender, city, cancer, salary. • For attribute city, we assume the domain size is 10; • for attribute cancer, we only record Yes or No for each customer; • for attribute salary, the domain range is 0-10k. • The privacy threshold \epsilon is a constant 0.1 set by data owner. • For one single query “How many customers got cancer?” • The adversary is allowed to ask three times of the query shown the above.

Example (continued) • “How many customers got cancer in each city?” • For one single query “What is the sum of salaries across all customers?”

Type of computing (query) • some are very sensitive, others are not • single query vs. query sequence • query on disjoint sets or not • outcome expected: number vs. arbitrary • interactive vs. not interactive

Sensitivity • Global sensitivity • Local sensitivity • Smooth sensitivity

Different areas of DP • PINQ • DM with DP • Optimizing linear counting queries under differential privacy. -Matrix mechanism for answering a workload of predicate counting queries

PPDM interface--PINQ • A programmable privacy preserving layer • Add calibrated noise to each query • Need to assign privacy cost budget

Data Mining with DP • Previous study—privacy preserving interface ensures everything about DP • Problems—inferior results if the interface is utilized simply during data mining • Solution—consider both together • DP ID3 —noisy count —evaluate all attributes in one exponential mechanism query using entire budget instead of splitting budget among multiple

DP in Social Networks • Page 97-120 of pakdd11 tutorial

Differential Privacy Xintao Wu Oct 31, 2012