Exploring Randomized Response Techniques for Privacy Preservation in Data Collection

Other Perturbation Techniques

Outline • Randomized Responses • Sketch • Project ideas

Randomized Responses • Problem description • A provides the answer to B’s question • A wants to preserve his/her privacy • Question/answer can be sensitive • The method • Assume the answer can be “yes” or “no” • A has a probability  to be honest, and the probability 1-  to give a random response • We can estimate the real probability of “yes” and “no” from the randomized responses

Notations: • O(yes): observed probability of yes from the randomized responses • # of yes/total # of responses • P(yes): real probability of yes • Inference • O(yes) = P(yes) * + P(no)*(1-) = P(yes) * + (1-P(yes))*(1-)  P(yes) = (O(yes)+-1)/(2-1)

Extend to multiple categories • The answer ci has a prob ij changed to cj • O((c1,c2,…,cn)): observed prob of ci • P((c1,c2,…,cn)) : real prob of ci • The relationship between O and P Note: When  is invertible, use matrix inversion to solve P. Otherwise, use iterative methods similar to that in Rakesh’s paper

Different perturbation matrices can be used. Which one is the best? • Balance between privacy and utility? Zero privacy is preserved, while full data utility is preserved Uniform randomization, privacy is fully preserved, while no data utility is left

Optimizing both privacy&utility • Read paper 33 • Privacy: similar to previous discussion • Based on accuracy of estimation • A Bayes method: • C = {c1,c2,…,cn) • Y is the perturbed value, X is the original value, and X^ is the estimated value Accuracy of estimation * It can be calculated by checking the original data, the perturbed data and the estimated data

Privacy • Average: 1- (accuracy of estimation) • Worst case: • Utility • P(ci) the original prob, O(ci) the prob on perturbed data, P^(ci) is the estimated prob • Utility depends on the difference between the original prob and the estimated prob

Optimization algorithm • Find the perturbation that balance the two metrics • The evolutionary algorithm • Start with a set of initial RR matrices • Repeat the following steps in each iteration • Mating: selecting two RR matrices in the pool • Crossover: exchanging several columns between the two RR matrices • Mutation: change some values in a RR matrix • Meet the privacy bound: filtering the resultant matrices • Evaluate the fitness value for the new RR matrices. Note : the fitness values is defined in terms of privacy and utility metrics

summary • Randomized response is the basic technique for perturbing categorical data • Boolean • Multi-category

Sketch • Address the problem of high-dimensional sparse data • Multiplicative perturbation • Randomized responses • Market basket data • Bag of words

Definition of sketch • Similar to projection perturbation • Map d dimensional data  r dimensional data, r<<d • Difference: for each record the mapping matrix is different • Definition • X = (x1,…xd), S(s1,…,sr) is randomly drawn from {-1, +1}

property • Dot product of the original data X and Y can be approximated with their sketches • Dot product is important in calculating Euclidean distances!

Accuracy of the dot product estimation Large r  smaller variance  better quality however,  lower privacy

Privacy • Original data value can be estimated • Sparse data • Most are canceled in sketch • Estimate of xk :

privacy •  - anonimity Suppress the record if this condition is not satisfied… Another concept: K-variance paper 29 for more details.

Applications: • Dot product estimation • Determine the length of sparse transaction (# of non-zero items in boolean vector) • Determine Euclidean distance • Average of a set of records (centroid of a cluster)

Exploring Randomized Response Techniques for Privacy Preservation in Data Collection

Exploring Randomized Response Techniques for Privacy Preservation in Data Collection

Presentation Transcript

OTHER KNOWLEDGE CAPTURE TECHNIQUES

Random Data Perturbation Techniques and Privacy Preserving Data Mining

MEMBRANE TECHNOLOGY OTHER TECHNIQUES

Other Multivariate Techniques

Perturbation Theory

Perturbation Theory

Other Classification Techniques

Perturbation Theory

PERTURBATION THEORY

Perturbation

Perturbation Theory

Perturbation Theory

Other Modeling Techniques

SYMBOLIC PERTURBATION

Other Valuation Techniques

Sewing Techniques and Other Terms

Other Analysis Techniques

Additive Data Perturbation: the Basic Problem and Techniques

Perturbation theory

Sequential Perturbation

OTHER KNOWLEDGE CAPTURE TECHNIQUES