1 / 17

Randomization in Privacy Preserving Data Mining

Randomization in Privacy Preserving Data Mining. Agrawal , R., and Srikant , R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include materials from this paper . Privacy-Preserving Data Mining. Problem: How do we publish data without compromising individual privacy?

bishop
Télécharger la présentation

Randomization in Privacy Preserving Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include materials from this paper

  2. Privacy-Preserving Data Mining • Problem: How do we publish data without compromising individual privacy? • Solution : randomization, anonymization

  3. Randomization • Adding random noise to original dataset • Challenge • Is data still useful for further analysis?

  4. Randomization • Model: data is distorted by adding random noise • Original data X = {x1 . . .xN}, for record xi ∈ X, random variable Y = {y1 . . .yN} is added, so new data is denoted by Z ={ z1 . . .zN},zi=xi + yi. • yiis a random value • Uniform, [-α, +α] • Gaussian, N (0, σ2)

  5. Reconstruction • Perturbed data hides data distribution and need be reconstructed before data mining • Given • x1+y1, x2+y2, ..., xn+yn – the probability distribution of Y • Estimate the probability distribution of x Clifton AusDM‘11

  6. Reconstruction • Bayes rule to estimate cumulative density functions fx0 = Uniform distribution Repeat update until stop criterion met reconstruction algorithm

  7. N(0, 0.25) reconstructed original original reconstructed randomized randomized (-0.5, 0.5)

  8. Privacy Metric • If a data x is estimated to be in the interval [α, β] with c% confidence, then the interval (β-α) defines the amount of privacy with c% confidence. • Example Age 20-40, 95% confidence, 50% privacy in Uniform 2 α = 20*0.5/0.95 = 10.5

  9. Decision Tree

  10. Training Decision Tree • Split point • interval boundaries • Reconstruction algorithm • Global • Byclass • Local • Dataset • Synthetic dataset, training set of 100,000 records and testing set of 5,000 records, equally split into two classes

  11. byclass Byclass and local local original global and randomized original randomized global

  12. Extended Work • ‘02 proposed a method to quantify information loss • Mutual information • ‘07 evaluated randomization with combining of public information • Gaussian is better than uniform • Dataset with inherent cluster pattern will improve randomization performance • Varying density and outliers will decrease performance

  13. Multiplicative Randomization • Rotation randomization • Distorted by an orthogonal matrix • Projection randomization • Project high-dimensional dataset into low-dimensional space • Preserving Euclidean distance and can be applied with distance-based classification (KNN, SVM) and clustering (K-means)

  14. Summary • Pros: data and noise are independent, can be applied during data collection time, useful for stream data • Cons: information loss, dimensionality curse

  15. Questions?

More Related