1 / 22

Technical Seminar On Privacy Preserving Data Mining

Technical Seminar On Privacy Preserving Data Mining. Under The guidance of Indraneel Mukhopadhyay By Sarmila Acharya Roll No. :200157041 Branch: IT. Statistical approaches Alter the frequency ( PRAN/DS/PERT ) of particular features, while preserving means.

niyati
Télécharger la présentation

Technical Seminar On Privacy Preserving Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Technical Seminar On Privacy Preserving Data Mining Under The guidance of Indraneel Mukhopadhyay By Sarmila Acharya Roll No. :200157041 Branch: IT

  2. Statistical approaches Alter the frequency (PRAN/DS/PERT) of particular features, while preserving means. Additionally, erase values that reveal too much Query-based approaches involve a permanent trusted third party Query monitoring: dissallow queries that breach privacy Perturbation: Add noise to the query output Statistical perturbation + adversarial analysis Combine statistical techniques with analysis similar to query-based approaches Database Privacy

  3. Popular Press: Economist: The End of Privacy (May 99) Time: The Death of Privacy (Aug 97) Govt. directives/commissions: European directive on privacy protection (Oct 98) Canadian Personal Information Protection Act (Jan 2001) Surveys of web users 17% privacy fundamentalists, 56% pragmatic majority, 27% marginally concerned (Understanding net users' attitude about online privacy, April 99) 82% said having privacy policy would matter. Growing Privacy Concerns

  4. Privacy Preserving Methods • Two methods were used for modifying values : • Value-Class Membership • In this method, the values for an attribute are partitioned into a set of disjoint, mutually exclusive classes. • Value Distortion. • Return a value xi + r instead of xi where r is a random value drawn from some distribution. • Two random distributions were used: • ·Uniform: The random variable has a uniform distribution, between [-, + ]. The mean value of the random variable is 0. • ·Gaussian: The random variable has a normal distribution, with mean  = 0 and a standard deviation .

  5. For quantifying privacy provided by a method, we use a measure based on how closely the original values of a modified attribute can be estimated. Confidence 50% 95% 99.9% Discretization Uniform Gaussian 0.5 x W 0.5 x 2 1.34 x  0.95x W 0.95x 2 3.92x  0.999xW 0.999x 2 6.8x  Quantifying Privacy

  6. Original values x1, x2, ..., xn from probability distribution X (unknown) To hide these values, we use y1, y2, ..., yn from probability distribution Y Given x1+y1, x2+y2, ..., xn+yn the probability distribution of Y Estimate the probability distribution of X. Reconstruction Problem

  7. Use Bayes' rule for density functions Intuition (Reconstruct single point)

  8. Reconstructing the Distribution • Combine estimates of where point came from for all the points: • Gives estimate of original distribution.

  9. fX0 := Uniform distribution j := 0 // Iteration number repeat fXj+1(a) := (Bayes' rule) j := j+1 until (stopping criterion met) Converges to maximum likelihood estimate. Reconstruction Algorithm

  10. Algorithm Partition(Data S) begin if (most points in S belong to same class) return; for each attribute A evaluate splits on attribute A; Use best split to partition S into S1 and S2; Partition(S1); Partition(S2); end Decision Tree Classification:Randomized Data

  11. Classification Example

  12. Need to modify two key operations: Determining split point Partitioning data Reconstructing the Original Distribution: Reconstruct using the whole data (Globally) or reconstruct separately for each class (ByClass). Reconstruct once at the root node or at every node (Local). Training using Randomized Data

  13. We consider three different algorithms that differ in when and how distributions are reconstructed: Global: Reconstruct the distribution for each attribute once at the beginning using the complete perturbed training data. Induce decision tree using the reconstructed data. ByClass: For each attribute, first split the training data by class, then reconstruct the distributions separately for each class. Induce decision tree using the reconstructed data. Local: As in ByClass, for each attribute, split the training data by class and reconstruct distributions separately for each class. However, instead of doing reconstruction only once, reconstruction is done at each node Reconstructing the Original Distribution

  14. Experimental Methodology • Compare accuracy against • Original: unperturbed data without randomization. • Randomized: perturbed data but without making any corrections for randomization. • Test data not randomized. • Training set of 100,000 records, split equally between the two classes.

  15. Synthetic Data Functions

  16. Acceptable loss in accuracy

  17. Accuracy vs. Randomization Level

  18. Problem: Two parties owning confidential databases wish to build a decision-tree classifier on the union of their databases, without revealing any unnecessary information. Horizontally partitioned. Records (users) split across companies. Example: Credit card fraud detection model. Vertically partitioned. Attributes split across companies. Example: Associations across websites. Inter-Enterprise Data Mining

  19. In this paper, we studied the technical feasibility of realizing privacy-preserving data mining. The basic premise was that the sensitive values in a user's record will be perturbed using a randomizing function so that they cannot be estimated with sufficient precision. Randomization can be done using Gaussian or Uniform perturbations. For the specific case of decision-tree classification, we found two effective algorithms, ByClass and Local. . The algorithms rely on a Bayesian procedure for correcting perturbed distributions. Conclusion

  20. Thank You

More Related