1 / 52

PrivBayes: Private Data Release via Bayesian Networks

PrivBayes: Private Data Release via Bayesian Networks. Jun Zhang , Graham Cormode , Cecilia M. Procopiuc , Divesh Srivastava , Xiaokui Xiao. Overview. The Problem: Private Data Release Differential Privacy Challenges The Algorithm: PrivBayes Bayesian Network Details of PrivBayes

sona
Télécharger la présentation

PrivBayes: Private Data Release via Bayesian Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PrivBayes: Private Data Release via Bayesian Networks Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, DiveshSrivastava, Xiaokui Xiao

  2. Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments

  3. Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments

  4. Data Release company institute sensitive database public adversary

  5. Private Data Release similar properties company accurate inference sensitive database synthetic database How can we design such a private data release algorithm? adversary

  6. Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments

  7. Differential Privacy [TCC’06] • Definition of -Differential Privacy • A randomizeddata release algorithm satisfies -differential privacy, if for any two neighboring datasets and for any possible synthetic data ,

  8. Differential Privacy [TCC’06] • A general approach to achieve differential privacy is injecting Laplace noise to the output, in order to cover the impact of any individual! • More details in Preliminaries part of the paper

  9. Our Target Design a data release algorithm with differential privacy guarantee.

  10. Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments

  11. Challenges of Private Data Release • To build a synthetic data, we need to understand the tuple distribution of the sensitive data. convert + noise sample sensitive database full-dim tuple distribution noisy distribution synthetic database

  12. Challenges of Private Data Release • Example: Database has 10M tuples, 10 attributes (dimensions), and 20 values per attribute: • Scalability: full distribution has cells • most of them have non-zero counts after noise injection • privacy is expensive (computation, storage) • Signal-to-noise: avg. information in each cell is ; avg. noise is (for ) Previous solutions suffer from either scalability or signal-to-noise problem

  13. Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments

  14. PrivBayes: Dimension Reduction convert + noise sample sensitive database full-dim tuple distribution noisy distribution synthetic database approximate sample convert + noise a set of low-dim distributions noisy low-dim distributions

  15. PrivBayes: Dimension Reduction • The advantages of using low-dimensional distributions • easy to compute • small domain -> high signal density -> robust against noise • But, how to find a set of low-dim distributions that provides a good approximation to full distribution?

  16. Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments

  17. Bayesian Network • A -dimensional database: workclass age income title education

  18. Bayesian Network • A -dimensional database: workclass age income title education

  19. Bayesian Network workclass age income title education Quality of Bayesian network decides the quality of approximation

  20. Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments

  21. Outline of the Algorithm • STEP 1: Choose a suitable Bayesian network • must in a differentially private way • STEP 2: Compute conditional distributions implied by • straightforward to do under differential privacy • inject noise – Laplace mechanism • STEP 3: Generate synthetic data by sampling from • post-processing: no privacy issues

  22. Optimal Bayesian Network • Finding optimal -degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree , and maximizes the sum of mutual information of its edges where

  23. Optimal Bayesian Network • Finding optimal -degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree , and maximizes the sum of mutual information of its edges finding the maximum spanning tree, where the weight of edge is mutual information .

  24. Build a Bayesian Network • Build a -degree BN for database

  25. Build a Bayesian Network • Start from a random attribute A C B D

  26. Build a Bayesian Network • Select next tree edge by its mutual information A candidates: C B D

  27. Build a Bayesian Network • Select next tree edge by its mutual information A candidates: C B D

  28. Build a Bayesian Network • Select next tree edge by its mutual information A C B D

  29. Build a Bayesian Network • Select next tree edge by its mutual information A candidates: C B D

  30. Build a Bayesian Network • Select next tree edge by its mutual information A DONE! C B D

  31. -degree Bayesian Network • It is NP-hard to train the optimal -degree Bayesian network, when [JMLR’04]. • Most approximation algorithms are too complicated to be converted into private algorithms. • In our paper, we find a way to extend the Chow-Liu solution (-degree) to higher degree cases. • In this talk, we focus on -degree cases for simplicity.

  32. Private Bayesian Network • Do it under Differential Privacy! • (Non-private) select the edge with maximum • (Private) is data-sensitive -> the best edge is also data-sensitive Solution: randomizededge selection!

  33. Exponential Mechanism [FOCS’07] Databases Edges • Howgood edge is as the result of selection, given database define Return with probability: where

  34. Private Bayesian Network Problem solved? NO Sensitivity (noise scale) is too large for • Do it under Differential Privacy! • Select edges with exponential mechanism • define (edge) = (edge) • we prove , where . (Lemma 1)

  35. Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments

  36. Basic Facts and have a strong positive correlation

  37. Function IDEA: define scoreto agree with at maximum values and interpolate linearly in-between : “optimal” dbns over that maximize how far? Range of : Sensitivity of :

  38. Function 0.4 1.6

  39. vs. and of random distributions correlation coefficient

  40. Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments

  41. vs. Adult dataset

  42. Dataset • We use four datasets in our experiments • Adult, NLTCS, TPC-E, BR2000 • Adult dataset • census data of 45,222 individuals • 15 attributes: age, workclass, education, marital status, etc. • tuple domain size (full-dimensional): about

  43. Counting Queries Query: all -way marginals Query: all -way marginals

  44. Multiple SVMs Adult, education Adult, gender Query: build 4 classifiers

  45. Multiple SVMs Adult, education Adult, gender Query: build 4 classifiers

  46. Concluding Remarks • Differential privacy can be applied effectively for data release • Key ideas of the solution: • Bayesian networks for dimension reduction • carefully designed linear quality for exponential mechanism • Many open problems remain: • extend to other forms of data: graph data, mobility data • obtain alternate (workable) privacy definitions Thanks!

  47. Appendix

  48. Previous Work • Privacy, accuracy, and consistency too: a holistic solution to contingency table release [PODS’07] • incurs an exponential running time • only optimized for low-dimensional marginals • Differentially private publication of sparse data [ICDT’12] • achieves scalability, but no help for signal-to-noise problem • Differentially private spatial decompositions [ICDE’12] • coarsens the histogram H to control nr. cells • has some limits, e.g., range queries, ordinal domain

  49. : Optimal Distributions • Assume that . A distribution maximizes the mutual information between and if and only if • , for any ; • For each , there is at most one with .

  50. Analogy: Logarithmic vs.Linear • two score functions for real and • neighboring databases and • Sensitivity (noise) max of derivative and

More Related