1 / 22

Privacy-Preserving Data Mining

Privacy-Preserving Data Mining. Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120 Published in: ACM SIGMOD International Conference on Management of Data, 2000. Slides by: Adam Kaplan (for CS259, Fall’03). What is Data Mining?.

cheng
Télécharger la présentation

Privacy-Preserving Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120 Published in: ACM SIGMOD International Conference on Management of Data, 2000. Slides by: Adam Kaplan (for CS259, Fall’03)

  2. What is Data Mining? • Information Extraction • Performed on databases. • Combines theory from • machine learning • statistical analysis • database technology • Finds patterns/relationships in data • Trains the system to predict future results • Typical applications: • customer profiling • fraud detection • credit risk analysis. Definition borrowed from the Two Crows corporate website: http://www.twocrows.com

  3. CREDIT RISK Age < 25 Salary < 50k High High Low A Simple Example of Data Mining TRAINING DATA Data Mining • Recursively partition training data into decision tree classifier • Non-leaf nodes = split points; test a specific data attribute • Leaf nodes = entirely or “mostly” represent data from the same class • Previous well-known methods to automate classification • [Mehta, Agrawal, Rissanen EDBT’96] – SLIQ paper • [Shafer, Agrawal, Mehta VLDB’96] – SPRINT paper

  4. Where does privacy fit in? • Data mining performed on databases • Many of them contain sensitive/personal data • e.g. Salary, Age, Address, Credit History • Much of data mining concerned with aggregates • Statistical models of many records • May not require precise information from each record • Is it possible to… • Build a decision-tree classifier accurately • without accessing actual fields from user records? • (…thus protecting privacy of the user)

  5. Preserving Data Privacy (1) • Value-Class Membership • Discretization:values for an attribute are discretized into intervals • Intervals need not be of equal width. • Use the interval covering the data in computation, rather than the data itself. • Example: • Perhaps Adam doesn’t want people to know he makes $4000/year. • Maybe he’s more comfortable saying he makes between $0 - $20,000 per year. • The most often used method for hiding individual values.

  6. Preserving Data Privacy (2) • Value Distortion • Instead of using the actual data xi • Use xi + r, where r is a random value from a distribution. • Uniform Distribution • r is uniformly distributed between [-α, +α] • Average r is 0. • Gaussian Distribution • r has a normal distribution • Mean μ(r) is 0. • Standard_deviation(r) is σ

  7. What do we mean by “private?” W = width of intervals in discretization • If we can estimate with c% confidence • The value x lies within the interval [x1, x2] • Privacy = (x2 - x1), the size of the range. • If we want very high privacy • 2α> W • Value distortion methods (Uniform, Gaussian) provide more privacy than discretization at higher confidence levels.

  8. Reconstructing Original Distribution From Distorted Values (1) • Define: • Original data values: x1, x2, …, xn • Random variable distortion: y1, y2, …, yn • Distorted samples: x1+y1, x2+y2, …, xn+yn • FY : The Cumulative Distribution Function (CDF) of random distortion variables yi • FX : The CDF of original data values xi

  9. Reconstructing Original Distribution From Distorted Values (2) • The Reconstruction Problem • Given • FY • distorted samples (x1+y1,…, xn+yn) • Estimate FX

  10. Reconstruction Algorithm (1) • How it works (incremental refinement of FX ) : • The f(x, 0) initialized to uniform distribution • For j=0 until stopping, do • Find f(x, j+1) as a function of f(x, j) and FY • When loop stops, f(x) estimates FX

  11. Reconstruction Algorithm (2) • Stopping Criterion • Compare successive estimates f(x, j). • Stop when difference between successive estimates very small.

  12. Distribution Reconstruction Results (1) Original = original distribution Randomized = effect of randomization on original dist. Reconstructed = reconstructed distribution

  13. Distribution Reconstruction Results (2) Original = original distribution Randomized = effect of randomization on original dist. Reconstructed = reconstructed distribution

  14. Summary of Reconstruction Experiments • Authors are able to reconstruct • Original shape of data • Almost same aggregate distribution • This can be done even when randomized data distribution looks nothing like the original.

  15. CREDIT RISK Age < 25 Salary < 50k High High Low Decision-Tree Classifiers w/ Perturbed Data • When/how to recover original distributions in order to build tree? • Global - for each attribute, reconstruct original distribution before building tree • ByClass – for each attribute, split the training data into classes, and reconstruct distributions separately for each class; then build tree • Local – like ByClass, reconstruct distribution separately for each class, but do this reconstruction while building decision tree

  16. Experimental Results – Classification w/ Perturbed Data • Compare Global, ByClass, Local algorithms against control series: • Original – result of classification of unperturbed training data • Randomized– result of classification on perturbed data with no correction • Run on five classification functions Fn1 through Fn5. (classify data into groups based on attributes)

  17. Results – Classification Accuracy (1)

  18. Results – Classification Accuracy (2)

  19. Experimental Results – Varying Privacy • Using ByClass algorithm on each classification function (except Fn4) • Vary privacy level from 10% - 200% • Show • Original – unperturbed data • ByClass(G) – ByClass with Gaussian perturbation • ByClass(U) – ByClass with Uniform perturbation • Random(G) – uncorrected data with Gaussian perturbation • Random(U) – uncorrected data with Uniform perturbation

  20. Results – Accuracy vs. Privacy (1)

  21. Results – Accuracy vs. Privacy (2) Note: Function 4 skipped because almost same results as Function 5.

  22. Conclusion • Perturb sensitive values in a user record by adding random noise. • Ensures privacy because original values are unknown. • Aggregate functions/classifiers can be accurately performed on perturbed values. • Gaussian distribution of noise provides more privacy than Uniform distribution at higher confidence levels.

More Related