1 / 38

Privacy Preservation for Data Streams

Privacy Preservation for Data Streams. Feifei Li, Boston University Joint work with: Jimeng Sun (CMU), Spiros Papadimitriou, George A. Mihaila and Ioana Stanoi (IBM T.J. Watson Research Center). P. P. P. Sensitive data. Application (1). Corp. A. Analytical Services. Corp. B. Corp. C.

juana
Télécharger la présentation

Privacy Preservation for Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Privacy Preservation for Data Streams Feifei Li, Boston University Joint work with: Jimeng Sun (CMU), Spiros Papadimitriou, George A. Mihaila and Ioana Stanoi (IBM T.J. Watson Research Center)

  2. P P P Sensitive data Application (1) Corp. A Analytical Services Corp. B Corp. C Finding trends, clusters, patterns, aggregations.

  3. Publish data as a service Subscribe data to identify trends, patterns, classes Application (2) Client A Information Hub Corp. A P Client B

  4. Identify trends Target Application value stream 1 time value Cluster/ classification stream 2 time value stream 3 time value stream 4 time

  5. A1 A1t A2 + Online generated noise, one vector at a time AN t Problem Formulation time time …….. time

  6. Given σ2, obtain A* online, s.t. D(A, A*) = σ2, and for given R, D(A, A~) is close to σ2 x Offline and Online Problem Formulation (continued) time time ……. R time

  7. Data Perturbation Random i.i.d noise time time + time time time time time time i.i.d: identical independently distributed

  8. Principal Component Analysis: PCA i.i.d Noise

  9. Principal Component Analysis: PCA Correlated Noise

  10. A* Added Noise: Utility Removed Noise σ2 Projection Error A~ Remaining Noise Privacy PCA Based Data Reconstruction A: Original Data A*: Perturbed Data A~: Reconstructed Data A Principal Direction

  11. Added Noise: Utility σ2 A* Projection Error A~ Remaining Noise Privacy PCA Based Data Reconstruction Correlated Noise! A: Original Data A*: Perturbed Data A~: Reconstructed Data A Principal Direction

  12. Data Perturbation: main idea • Observations • The amount of the random noise controls privacy/utility tradeoff • i.i.d (identical independently distributed) noise does not preserve the privacy! Not well enough • Lesson learned • Noise should be correlated with original data • Z. Huang et al. Sigmod 05.

  13. Challenge 1: Dynamic Correlation

  14. Challenge 1: Dynamic Correlation

  15. Challenge 2: Dynamic Autocorrelation

  16. Challenge 2: Dynamic Autocorrelation

  17. Online Random Noise for Autocorrelation: Stock

  18. State of the Art • Privacy Preservation • Given a utility requirement, maximize the privacy • Existing Work (Z. Huang et al. Sigmod05) • Batch mode, static data • And many other works (see our paper for a detailed literature review)

  19. At Et A~t + Publish A~t U3x3: online estimation of principal components Update U Generate noise distributed along U S. Papadimitriou et al. VLDB05 Adding Dynamic Correlated Noise A1 A2 A3

  20. σ2 σ2 Added to At Rotate back to data space Noise distributed in principal components’ subspace Put it into Algorithm: Distribute Noise k=3, U: eigenvectors, V: eigenvalues

  21. Removed noise by online reconstruction Local principal component Local principal component Removed noise by online reconstruction Noise added along global PC -- offline Global principal component Noise added along global PC -- offline why is our algorithm better (state of the art)?

  22. Online Reconstruction vs. Offline Reconstruction • Choice of adversary: • Offline reconstruction based on global principal components • Online tracking of the principal components and apply local reconstruction • Please see the details in the paper

  23. h streams 1 2 3 2 3 4 3 4 5 4 5 6 Time w1 w2 w3 w4 W = Tracking Autocorrelation a=[1 2 3 4 5 6]T

  24. Distribute Noise Avoid adding noise > allowed threshold! And still auto-correlated with the stream 1 2 3 2 3 4 3 4 5 4 5 6 1 2 3 2 3 4 3 4 5 4 5 6 1 2 3 2 3 4 3 4 5 4 5 6 1 2 3 2 3 4 3 4 5 4 5 6 1 2 3 2 3 4 3 4 5 4 5 6 Idea: constraint the next k noise values based on previous h-k noises + current estimation of U  becomes a linear system W =

  25. Experiments • Three Real Data Streams • Sensor streams, Lab: Light, Humidity, Volt, Temperature. 7712x198 • Choroline environmental streams: 4310x166 • Stock streams: 8000x2

  26. Perturbation vs. Reconstruction streaming auto-correlated additive noise noise correlated with global principal components streaming correlated additive noise take perturbed data as the reconstruction streaming auto-correlated online reconstruction streaming correlated online reconstruction offline-reconstruction based on global principal components noise (discrepancy) is represented by the relative energy as percentage to the original data streams, i.e., D(A, A*)/||A||

  27. Reconstruction Error: Online-R vs. Offline-R 10% noise k=10 online reconstruction achieves better accuracy as it minimizes the projection error

  28. Reconstruction Error: vary k • online reconstruction achieves better accuracy • large k reduces projection error

  29. Privacy vs. Discrepancy, online-R: Lab data

  30. Privacy vs. Discrepancy, online-R: Choroline

  31. Online Random Noise for Autocorrelation: Choroline

  32. Online Random Noise for Autocorrelation: Stock

  33. Privacy vs. Discrepancy: Online-R (Choroline)

  34. Privacy vs. Discrepancy: Online-R (Stock)

  35. Running Time Analysis

  36. Running Time Analysis

  37. Future Work • Combing correlation and autocorrelation • Other type of data streams, other than numeric data, such as categorical data

  38. Questions • Thank you!

More Related