1 / 34

Deriving Private Information from Perturbed Data Using IQR-based Approach

Deriving Private Information from Perturbed Data Using IQR-based Approach. Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte Yingjiu Li Singapore Management Univ . Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg. PIPEDA 2000.

genevieve
Télécharger la présentation

Deriving Private Information from Perturbed Data Using IQR-based Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte Yingjiu Li Singapore Management Univ PDM Workshop April 8, 2006

  2. Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg PDM April 8, 2006

  3. PIPEDA 2000 European Union (Directive 94/46/EC) • HIPAA for health care • California State Bill 1386 • Grann-Leach-Bliley Act for financial • COPPA for childern’s online privacy • etc. Source: http://www.privacyinternational.org/survey/dpmap.jpg PDM April 8, 2006

  4. Mining vs. Privacy • Data mining • The goal of data mining is summary results (e.g., classification, cluster, association rules etc.) from the data (distribution) • Individual Privacy • Individual values in database must not be disclosed, or at least no close estimation can be derived by attackers • Privacy Preserving Data Mining (PPDM) • How to “perturb” data such that • we can build a good data mining model (data utility) • while preserving individual’s privacy at the record level (privacy)? PDM April 8, 2006

  5. Our Focus Focus in this talk k-anonymity, L-diversity SDC etc. PDM April 8, 2006

  6. Additive Noise based PPDM • Distribution reconstruction • AS method, Agrawal and Srikant, SIGMOD 00 • EM method, Agrawal and Aggarwal, PODS 01 • Individual value reconstruction • Spectral Filtering (SF) , Kargupta et al. ICDM 03 • PCA, Huang, Du and Chen SIGMOD 05 PDM April 8, 2006

  7. Additive Randomization (Y = X +R ) • R.Agrawal and R.Srikant SIGMOD00 Alice’s age 30 | 70K | ... 50 | 40K | ... ... Add random number to Age Randomizer Randomizer 65 | 20K | ... 25 | 60K | ... ... 30 becomes 65 (30+35) Reconstruct Distribution of Age Reconstruct Distribution of Salary ... Classification Algorithm Model PDM April 8, 2006

  8. Distribution Reconstruction • Algorithm fX0 := Uniform distribution • j := 0 // Iteration number • repeat • fXj+1(a) := • j := j+1 • until (stopping criterion met) • Converges to maximum likelihood estimate –Agrawal and Aggarwal PODS 01 PDM April 8, 2006

  9. Individual Reconstruction • Spectral Filtering Technique (Kargupta et al. ICDM03) • Apply EVD • Using the covariance of V, extract the first k principle components • λ1≥ λ2··· ≥ λk ≥ λe and e1, e2, ··· ,ek are the corresponding eigenvectors of • Qk = [e1 e2··· ek] forms an orthonormal basis of a subspace X • Find the orthogonal projection on to X: • Estimate data as PCA Technique, Huang, Du and Chen, SIGMOD 05 PDM April 8, 2006

  10. Motivation • The goal of randomization-based perturbation • To hide the sensitive data by randomly modifying the data values using some additive noise • To keep the aggregate characteristics or distribution remain unchanged or recoverable • Do those aggregate characteristics or distribution contain confidential information which may be exploited by snoopers to derive individual’s sensitive data? private information PDM April 8, 2006

  11. Our Scenario • Each individual data is associated with one privacy interval • privacy policies • corporate agreements • The data holder can utilize or release data to the third party for analysis, however, he is required not to disclose any individual data within its privacy interval A single party (data holder) holds a collection of original individual data PDM April 8, 2006

  12. Inter-Quantile Range (IQR) • Inter-Quantile Range [xα1 , x α2 ] is defined as P( xα1 ≤ x ≤ x α2 ) ≥ c%, while c = α2 − α1 denotes the confidence. • IQR measures the amount of spread and variability of the variable. Hence it can be used by attackers to estimate the range of each individual value. • IQR we used: [x(1-c)/2 , x (1+c)/2 ] α2 α1 xα2 xα1 PDM April 8, 2006

  13. Comparison with other Privacy definition • Interval privacy (Agrawal and Srikant, SIGMOD00) • If the original value can be estimated with c% confidence to lie in the interval [a, b], then the interval width (b-a) defines the amount of privacy at c% confidence level • Mutual Information (Aggarwal and Agrawal, PODS01) • Reconstruction privacy (Rizvi & Haritsa, VLDB02) • -to- privacy breach (Evfimievski et al. PODS03) PDM April 8, 2006

  14. Individual’s privacy interval Attacker’s estimated range Disclosure Measure Measure Similarity Complete disclosed point if its estimated range • contains the original value • fully falls within the pre- specified privacy interval PDM April 8, 2006

  15. Empirical Evaluation • Data sets: • Bank • 5 attributes (Home Equity, Stock/Bonds, Liabilities, Savings, CDs) • 50,000 tuples • Signal • 35 correlated features (sinusoidal, square, triangle, normal distributions ) • 30,000 tuples • Pre-specified individual’s privacy intervals: • [ui(1-p), ui(1+p)] • p is varied PDM April 8, 2006

  16. IQR from Reconstructed Dist. Using AS with Uniform noise • IQR Direct inference---perturbed • IRQ with AS inference ---reconstructed • IRQ ideal inference ---original • Uniform noise: [-125,125] • Bank Data set • Attribute: Stock/Bonds • 95% IQR • information loss for AS : 14.6% Ratio of Complete disclosure points PDM April 8, 2006

  17. IQR from Reconstructed Dist. Using AS with Uniform noise PDM April 8, 2006

  18. AS vs. SF with Gaussian Noise • Gaussian noise N(0,8) • Signal dataset • Feature 2 (sinusoidal distributed) • 95% IQR • information loss for AS : 32.9% • information loss for SF : 47.0% PDM April 8, 2006

  19. Disclosure vs. noise • Uniform noise with varied range • Bank Data set • Attribute: Stock/Bonds • 95% IQR PDM April 8, 2006

  20. Extend to Multivariate Cases • In practice, the distribution of multiple numerical attributes are often modeled by one multi-variate normal distribution, N(μ,Σ) • The ellipsoid {z : (z − μ)′ Σ−1(z − μ) ≤ χ2p(α)} containsa fixed percentage, (1 −α)100% of data values. • The projection of this ellipsoid on axis zi has bound: PDM April 8, 2006

  21. Related Work • Rotation based approach: • Y = RX • When R is an orthonormal matrix (RRT = I) • Vector length: |Rx| = |x| • Euclidean distance: |Rx – Ry| = |x-y| • Inner product : <Rx,Ry> = <x,y> • Popular classifiers and clustering methods are invariant to this perturbation. K. Liu, H. Kargupta etc. Random projection based multiplicative data perturbation for privacy preserving distributed data mining. TKDE 2006. K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. ICDM 2005 PDM April 8, 2006

  22. Is Y=RX Secure? = Y = R X RRT = RTR = I PDM April 8, 2006

  23. Our Preliminary Results • Even Y = RX + E is NOT secure when some a-priori knowledge is available to attackers. = + Y R X E = + R can be any random matrix PDM April 8, 2006

  24. A-priori Knowledge ICA Based Attack • Privacy can be breached when a small subset of the original data X , is available to attackers PDM April 8, 2006

  25. Summary • The reconstructed distribution can be exploited by attackers to derive sensitive individual information. • Present a simple IQR attacking method • Complex and effective attacking methods exist • More research is needed on attacking methods from the attacker point of view PDM April 8, 2006

  26. Acknowledgement • NSF Grant • CCR-0310974 • IIS-0546027 • Personnel • Xintao Wu • Songtao Guo • Ling Guo • More Info • http://www.cs.uncc.edu/~xwu/ • xwu@uncc.edu, PDM April 8, 2006

  27. Questions? Thank you! PDM April 8, 2006

  28. Information Loss • Distribution level • Individual value level PDM April 8, 2006

  29. National Laws • US • HIPAA for health care • Passed August 21, 96 • lowest bar and the States are welcome to enact more stringent rules • California State Bill 1386 • Grann-Leach-Bliley Act of 1999 for financial institutions • COPPA for childern’s online privacy • etc. • Canada • PIPEDA 2000 • Personal Information Protection and Electronic Documents Act • Effective from Jan 2004 • European Union (Directive 94/46/EC) • Passed by European Parliament Oct 95 and Effective from Oct 98. • Provides guidelines for member state legislation • Forbids sharing data with states that do not protect privacy PDM April 8, 2006

  30. ICA Direct Attack? • Can we get X when only Y is available? • It seems Independent Component Analysis can help. Y = R X + E General Linear Perturbation Model X = A S + N ICA Model PDM April 8, 2006

  31. Cost Function Independent? ICA Linear Mixing Process Mixing Matrix Source Observed Separation Process Demixing Matrix Separated Optimize PDM April 8, 2006

  32. Restriction of ICA • Restrictions: • All the components si should be independent; They must be non-Gaussian with the possible exception of one component. • The number of observed linear mixtures m must be at least as large as the number of independent components n • The matrix A must be of full column rank • Can we apply the ICA directly? No • Correlations among attributes of X • More than one attributes may have Gaussian distributions X = AS + N Y = RX + E PDM April 8, 2006

  33. A-priori Knowledge based ICA (AK-ICA) Attack PDM April 8, 2006

  34. Correctness of AK-ICA • We prove that J exists such that • J represents the connection between the distributions of and PDM April 8, 2006

More Related