1 / 86

Architectures and Algorithms for Data Privacy

Architectures and Algorithms for Data Privacy . Dilys Thomas Stanford University, April 30th, 2007 Advisor: Rajeev Motwani. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A. RoadMap. Motivation for Data Privacy Research Sanitizing Data for Privacy

danno
Télécharger la présentation

Architectures and Algorithms for Data Privacy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architectures and Algorithms for Data Privacy Dilys Thomas Stanford University, April 30th, 2007 Advisor: Rajeev Motwani TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA

  2. RoadMap • Motivation for Data Privacy Research • Sanitizing Data for Privacy • Privacy Preserving OLAP • K-Anonymity/ Clustering for Anonymity • Probabilistic Anonymity • Masketeer • Auditing for Privacy • Distributed Architectures for Privacy

  3. Health Personal medical details Disease history Clinical research data Govt. Agencies Census records Economic surveys Hospital Records Banking Bank statement Loan Details Transaction history Manufacturing Process details Blueprints Production data Finance Portfolio information Credit history Transaction records Investment details Outsourcing Customer data for testing Remote DB Administration BPO & KPO Insurance Claims records Accident history Policy details Retail Business Inventory records Individual credit card details Audits Motivation 1: Data Privacy in Enterprises Privacy

  4. Motivation 2: Government Regulations

  5. Motivation 3: Personal Information • Emails • Searches on Google/Yahoo • Profiles on Social Networking sites • Passwords / Credit Card / Personal information at multiple E-commerce sites / Organizations • Documents on the Computer / Network

  6. Losses due to Lack of Privacy: ID-Theft • 3% of households in the US affected by ID-Theft • US $5-50B losses/year • UK £1.7B losses/year • AUS $1-4B losses/year

  7. RoadMap • Motivation for Data Privacy Research • Sanitizing Data for Privacy • Privacy Preserving OLAP • K-Anonymity/ Clustering for Anonymity • Probabilistic Anonymity • Masketeer • Auditing for Privacy • Distributed Architectures for Privacy

  8. Privacy Preserving Data Analysis i.e. Online Analytical Processing OLAP Computing statistics of data collected from multiple data sources while maintaining the privacy of each individual source Agrawal, Srikant, Thomas SIGMOD 2005

  9. Privacy Preserving OLAP • Motivation • Problem Definition • Query Reconstruction Inversion method Single attribute Multiple attributes Iterative method • Privacy Guarantees • Experiments

  10. Horizontally Partitioned Personal Information Client C1 Original Row r1 Perturbed p1 Table T for analysis at server Client C2 Original Row r2 Perturbed p2 EXAMPLE: What number of children in this county go to college? Client Cn Original Row rn Perturbed pn

  11. Vertically Partitioned Enterprise Information EXAMPLE: What fraction of United customers to New York fly Virgin Atlantic to travel to London? Original Relation D1 Perturbed Relation D’1 Perturbed Joined Relation D’ Original Relation D2 Perturbed Relation D’2

  12. Privacy Preserving OLAP: Problem Definition Compute select count(*) from T where P1and P2 and P3 and …. Pk Eg Find # of people between age[30-50] and salary[80-150] i.e. COUNTT( P1and P2 and P3 and …. Pk ) Goal: provide error bounds to analyst. provide privacy guarantees to data sources. scale to larger # of attributes

  13. Perturbation Example: Uniform Retention Replacement Throw a biased coin Heads: Retain Tails: Replace with a random number from a predefined pdf 5 Tails 4 Tails 3 Heads BIAS=0.2 1 Tails Tails 3 HEADS: RETAIN TAILS: REPLACE U.A.R. FROM [1-5]

  14. Retention Replacement Perturbation • Done for each column • The replacing pdf need not be uniform • Best to use original pdf if available/ estimable • Different columns can have different biases for retention

  15. Single Attribute Example What is the fraction of people in this building with age 30-50? • Assume age between 0-100 • Whenever a person enters the building flips a coin of with heads probability p=0.2. • Heads -- report true age RETAIN • Tails -- random number uniform in 0-100 reported PERTURB • Totally 100 randomized numbers collected. • Of these 22 are 30-50. • How many among the original are 30-50?

  16. Privacy Preserving OLAP • Motivation • Problem Definition • Query Reconstruction Inversion method Single attribute Multiple attributes Iterative method • Privacy Guarantees • Experiments

  17. Analysis 20 Retained 80 Perturbed Out of 100 : 80 perturbed (0.8 fraction), 20 retained (0.2 fraction)

  18. Analysis Contd. 16 Perturbed, Age[30-50] 20 Retained 64 Perturbed, NOT Age[30-50] 20% of the 80 randomized rows, i.e. 16 of them satisfy Age[30-50]. The remaining 64 don’t.

  19. Analysis Contd. 6 Retained, Age[30-50] Since there were 22 randomized rows in [30-50]. 22-16=6 of them come from the 20 retained rows. 16 Perturbed, Age[30-50] 14 Retained, NOT Age[30-50] 64 Perturbed, NOT Age[30-50]

  20. Scaling up 30 ? Thus 30 people had age 30-50 in expectation.

  21. Multiple Attributes (k=2)P1=Age[30-50], P2=Salary[80-150]

  22. Architecture

  23. Formally : Select count(*) from R where Pred p = retention probability (0.2 in example) 1-p = probability that an element is replaced by replacing p.d.f. b = probability that an element from the replacing p.d.f. satisfies predicate Pred ( in example) a = 1-b

  24. Transition matrix = i.e. Solve xA=y A00 = probability that original element satisfies : P and after perturbation satisfies : P p = probability it was retained (1-p)a = probability it was perturbed and satisfies : P A00 = (1-p)a+p

  25. Multiple Attributes For k attributes, • x, y are vectors of size 2k • x=y A-1 Where A=A1 ­A2­ .. ­ Ak [Tensor Product] Ai is the transition matrix for column i

  26. Error Bounds • In our example, we want to say when estimated answer is 30, the actual answer lies in [28-32] with probability greater than 0.9 • Given T !a T’ , with n rows f(T) is (n,e,d) reconstructible by g(T’) if |f(T) – g(T’)| < max (e, e f(T)) with probability greater than (1- d). f(T) =2,  =0.1 in above example

  27. Theoretical Basis and Results Theorem: Fraction, f, of rows in [low,high] in the original table estimated by matrix inversion on the table obtained after uniform perturbation is a (n, ,  ) estimator for f if n > 4 log(2/)(p )-2 , by Chernoff bounds Theorem: Vector, x, obtained by matrix inversion is the MLE (maximum likelihood estimator), by using Lagrangian Multiplier method and showing that the Hessian is negative

  28. Iterative Algorithm [AS00] Initialize: x0=y Iterate: xpT+1 = Sq=0t yq (apqxpT / (Sr=0t arq xrT)) [ By Application of Bayes Rule] Stop Condition: Two consecutive x iterates do not differ much

  29. Iterative Algorithm We had proved, • Theorem: Inversion Algorithm gives the MLE • Theorem [AA01]: The Iterative Algorithm gives the MLE with the additional constraint that 0 < xi , 8 0 < i < 2k-1 • Models the fact the probabilities are non-negative • Results better as shown in experiments

  30. Privacy Guarantees Say initially know with probability < 0.3 that Alice’s age > 25 After seeing perturbed value can say that with probability > 0.95 Then we say there is a (0.3,0.95) privacy breach More subtle differential privacy in the thesis

  31. Privacy Preserving OLAP • Motivation • Problem Definition • Query Reconstruction • Privacy Guarantees • Experiments

  32. Experiments • Real data: Census data from the UCI Machine Learning Repository having 32000 rows • Synthetic data: Generated multiple columns of Zipfian data, number of rows varied between 1000 and 1000000 • Error metric: l1 norm of difference between x and y. • L1 norm between 2 probability distributions Eg for 1-dim queries |x1 – y1| + | x0 – y0|

  33. Inversion vs Iterative Reconstruction 2 attributes: Census Data 3 attributes: Census Data Iterative algorithm (MLE on constrained space) outperforms Inversion (global MLE)

  34. Error as a function of Number of Columns: Iterative Algorithm: Zipf Data The error in the iterative algorithm flattens out as its maximum value is bounded by 2

  35. Error as a function of Number of Columns Census Data Inversion Algorithm Iterative Algorithm Error increases exponentially with increase in number of columns

  36. Error as a function of number of Rows Error decreases as as number of rows, n increases

  37. Conclusion Possible to run OLAP on data across multiple servers so that probabilistically approximate answers are obtained and data privacy is maintained The techniques have been tested experimentally on real and synthetic data. More experiments in the paper. Privacy Preserving OLAP is Practical

  38. RoadMap • Motivation for Data Privacy Research • Sanitizing Data for Privacy • Privacy Preserving OLAP • K-Anonymity/ Clustering for Anonymity • Probabilistic Anonymity • Masketeer • Auditing for Privacy • Distributed Architectures for Privacy

  39. Anonymizing Tables: ICDT05 Creating tables that do not identify individuals for research or out-sourced software development purposes Aggarwal, Feder, Kenthapadi, Motwani, Panigrahy, Thomas, Zhu

  40. Achieving Anonymity via Clustering: PODS06 Probabilistic Anonymity: (submitted) Aggarwal, Feder, Kenthapadi, Khuller, Panigrahy, Thomas, Zhu Lodha, Thomas

  41. Data Privacy • Value disclosure: What is the value of attribute salary of person X • Perturbation • Privacy Preserving OLAP • Identity disclosure: Whether an individual is present in the database table • Randomization, K-Anonymity etc. • Data for Outsourcing / Research

  42. Original Dataset

  43. Randomized Dataset

  44. Quasi-Identifiers Uniquely identify you! Quasi-identifiers: approximate foreign keys

  45. k-Anonymity Model [Swe00] • Modify some entries of quasi-identifiers • each modified row becomes identical to at least k-1 other rows with respect to quasi-identifiers • Individual records hidden in a crowd of size k

  46. Original Table

  47. Suppressing all entries: No Utility

  48. 2-Anonymity with Clustering Cluster centers published 27=(25+27+29)/3 70=(50+60+100)/3 37=(35+39)/2 115=(110+120)/2 Clustering formulation: NP Hard

  49. 10 points,radius 5 50 points, radius 15 20 points, radius10 Clustering Metrics

  50. 2d 2d 2d r-center Clustering: Minimize Maximum Cluster Size

More Related