180 likes | 309 Vues
This outline presents a comprehensive overview of Randomized Response (RR) techniques aimed at overcoming privacy concerns in sensitive data collection. It explains how RR methods can be used to ensure respondents' confidentiality while allowing for estimate accuracy of "yes" or "no" answers. The document covers the mathematical foundations, including the observed vs real probabilities, optimization algorithms for balancing privacy and utility, and potential applications in handling high-dimensional sparse data. It is a crucial resource for researchers and practitioners working with sensitive information.
E N D
Outline • Randomized Responses • Sketch • Project ideas
Randomized Responses • Problem description • A provides the answer to B’s question • A wants to preserve his/her privacy • Question/answer can be sensitive • The method • Assume the answer can be “yes” or “no” • A has a probability to be honest, and the probability 1- to give a random response • We can estimate the real probability of “yes” and “no” from the randomized responses
Notations: • O(yes): observed probability of yes from the randomized responses • # of yes/total # of responses • P(yes): real probability of yes • Inference • O(yes) = P(yes) * + P(no)*(1-) = P(yes) * + (1-P(yes))*(1-) P(yes) = (O(yes)+-1)/(2-1)
Extend to multiple categories • The answer ci has a prob ij changed to cj • O((c1,c2,…,cn)): observed prob of ci • P((c1,c2,…,cn)) : real prob of ci • The relationship between O and P Note: When is invertible, use matrix inversion to solve P. Otherwise, use iterative methods similar to that in Rakesh’s paper
Different perturbation matrices can be used. Which one is the best? • Balance between privacy and utility? Zero privacy is preserved, while full data utility is preserved Uniform randomization, privacy is fully preserved, while no data utility is left
Optimizing both privacy&utility • Read paper 33 • Privacy: similar to previous discussion • Based on accuracy of estimation • A Bayes method: • C = {c1,c2,…,cn) • Y is the perturbed value, X is the original value, and X^ is the estimated value Accuracy of estimation * It can be calculated by checking the original data, the perturbed data and the estimated data
Privacy • Average: 1- (accuracy of estimation) • Worst case: • Utility • P(ci) the original prob, O(ci) the prob on perturbed data, P^(ci) is the estimated prob • Utility depends on the difference between the original prob and the estimated prob
Optimization algorithm • Find the perturbation that balance the two metrics • The evolutionary algorithm • Start with a set of initial RR matrices • Repeat the following steps in each iteration • Mating: selecting two RR matrices in the pool • Crossover: exchanging several columns between the two RR matrices • Mutation: change some values in a RR matrix • Meet the privacy bound: filtering the resultant matrices • Evaluate the fitness value for the new RR matrices. Note : the fitness values is defined in terms of privacy and utility metrics
summary • Randomized response is the basic technique for perturbing categorical data • Boolean • Multi-category
Sketch • Address the problem of high-dimensional sparse data • Multiplicative perturbation • Randomized responses • Market basket data • Bag of words
Definition of sketch • Similar to projection perturbation • Map d dimensional data r dimensional data, r<<d • Difference: for each record the mapping matrix is different • Definition • X = (x1,…xd), S(s1,…,sr) is randomly drawn from {-1, +1}
property • Dot product of the original data X and Y can be approximated with their sketches • Dot product is important in calculating Euclidean distances!
Accuracy of the dot product estimation Large r smaller variance better quality however, lower privacy
Privacy • Original data value can be estimated • Sparse data • Most are canceled in sketch • Estimate of xk :
privacy • - anonimity Suppress the record if this condition is not satisfied… Another concept: K-variance paper 29 for more details.
Applications: • Dot product estimation • Determine the length of sparse transaction (# of non-zero items in boolean vector) • Determine Euclidean distance • Average of a set of records (centroid of a cluster)