1 / 42

CS573 Data Privacy and Security Statistical Databases

CS573 Data Privacy and Security Statistical Databases. Li Xiong. Today. Statistical databases Definitions Early query restriction methods Output perturbation and differential privacy. Statistical Data Release. Population count. city. 20 30 40 50. 50. Age. Diagnosis.

austin
Télécharger la présentation

CS573 Data Privacy and Security Statistical Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS573 Data Privacy and SecurityStatistical Databases Li Xiong

  2. Today • Statistical databases • Definitions • Early query restriction methods • Output perturbation and differential privacy

  3. Statistical Data Release Population count city 20 30 40 50 50 Age Diagnosis • Release statistical summary of the data (vs. individual records) • Useful for analysis and learning • Medical statistics • Query log statistics – frequent search terms • Still need rigorous inference control

  4. Statistical Database • A statistical database is a database which provides statistics on subsets of records • Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of records • Inference control to prevent inference from statistics to individual records

  5. Methods • Data perturbation/anonymization • Query restriction • Output perturbation

  6. Data Perturbation

  7. Query Resitrction

  8. Output Perturbation Query Results Results Query

  9. Methods • Data perturbation/anonymization • Query restriction • Query set size control • Query set overlap control • Query auditing • Output perturbation

  10. Query Set Size Control • A query-set size control limit the number of records that must be in the result set • Allows the query results to be displayed only if the size of the query set |C| satisfies the condition K <= |C| <= L – K where L is the size of the database and K is a parameter that satisfies 0 <= K <= L/2

  11. Query Set Size Control

  12. Tracker • Q1: Count ( Sex = Female ) = A • Q2: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) ) = B What if B = A+1?

  13. Tracker • Q1: Count ( Sex = Female ) = A • Q2: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) ) = B If B = A+1 • Q3: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) & Diagnosis = Schizophrenia) Positively or negatively compromised!

  14. Query set size control • If the threshold value k is large, then it will restrict too many queries • And still does not guarantee protection from compromise • The database can be easily compromised within a frame of 4-5 queries

  15. Query Set Overlap Control • Basic idea: successive queries must be checked against the number of common records. • If the number of common records in any query exceeds a given threshold, the requested statistic is not released. • A query q(C) is only allowed if: | q (C ) ^ q (D) | ≤ r, r> 0 Where r is set by the administrator

  16. Query-set-overlap control • Statistics for a set and its subset cannot be released – limiting usefulness • High processing overhead – every new query compared with all previous ones • Multiple users - need to keep user profile, need to consider collusion between users • Still no formal privacy guarantee

  17. Auditing • Keeping up-to-date logs of all queries made by each user and check for possible compromise when a new query is issued • Excessive computation and storage requirements • Only “efficient” methods for special types of queries

  18. Audit Expert (Chin 1982) • Query auditing method for SUM queries • A SUM query can be considered as a linear equation where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result • A set of SUM queries can be thought of as a system of linear equations • Maintains the binary matrix representing linearly independent queries and update it when a new query is issued • A row with all 0s except for ith column indicates disclosure

  19. Audit Expert • Only stores linearly independent queries • Not all queries are linearly independent Q1: Sum(Sex=M) Q2: Sum(Sex=M AND Age>20) Q3: Sum(Sex=M AND Age<=20)

  20. Audit Expert • O(L2)time complexity • Further work reduced to O(L) time and space when number of queries < L • Only for SUM queries

  21. Auditing – recent developments • Online auditing • “Detect and deny” queries that violate privacy requirement • Denial themselves may implicitly disclose sensitive information • Offline auditing • Check if a privacy requirement has been violated after the queries have been executed • Not to prevent

  22. Methods Data perturbation/anonymization Query restriction Output perturbation Differential privacy

  23. D1 Bob in A(D1) Output Perturbation Q User D2 Bob out A(D2) Differential Privacy • Differential privacy requires the outcome to be formally indistinguishable when run with and without any particular record in the data set • E.g.: Q = select count() where Age = [20,30] and Diagnosis = B

  24. D1 Bob in Differentially Private Interface Q User D2 Bob out Differential Privacy • Differential privacy • Laplace mechanism Q(D) + Y where Y is drawn from • Query sensitivity A(D1) = Q(D1) + Y1 A(D2) = Q(D2) +Y2

  25. Composition of Differential Privacy • Sequential composition [McSherry SIGMOD 09] • Let Mi each provides differential privacy. The sequence of Mi provides differential privacy • Parallel composition • If Di are disjoint subsets of the original database and Mi provides differential privacy for each Di, then the sequence of Mi provides differential privacy. D1 Bob in A1(D1), A2(D1), … Differentially Private Interface Q1,Q2, … User D2 Bob out A1(D2), A2(D2), …

  26. Differential Privacy • Is unfettered access to raw data truly essential? • Is released data sufficient (provide sufficient utility guarantee)? Privacymechanism Raw Data Released Data User count city Age Diagnosis

  27. Challenges • Differential privacy cost accumulates quickly with number of queries • Typical tasks require multiple queries or multiple steps • Need to support multiple users • Impossible to guarantee utility for all (any) data or all (any) applications

  28. Possible Middle Ground • Guaranteed utility for certain applications • Counting queries, classification, logistic regression • Guaranteed utility for certain kinds of data • Use prior or domain knowledge about data • Use intermediate results (differentially private) Prior or domain knowledge Target Applications Intermediate Result Privacymechanism Raw Data Released Data User

  29. Our Research: Adaptive Differentially Private Data Release • Data knowledge • Dense and “smooth” data • High dimensional and sparse data • Dynamic data • Application knowledge • Query workload • Specific tasks

  30. Histogram Example ?

  31. Strategy I: Baseline Cell Partitioning Q1: count() where Age = 20, Diagnosis = A Q2: count() where Age = 20, Diagnosis = B … diagnosis Diagnosis A B A B Q 20 20 Age DP Age alpha 30 30 • Goal: to release a differentially private histogram to support random predicate queries • Q: select count() where Age = [20,30] and Income = 40K • If a query predicate consists of multiple cells or partitions, it will have aggregated perturbation error

  32. Strategy II: Hierarchical Partitioning A B 20 diagnosis A B alpha/3 30 20 Age A B 30 alpha/3 20 30 alpha/3 A B 20 30 • Large perturbation error due to small divided privacy budget at each level

  33. DPCube Strategy: Two phase partitioning diagnosis A B A B 20 20 Age Age 30 30 • If a query predicate is contained in a published partition, the answer has to be estimated typically based on a uniform distribution assumption. This introduces an approximation error.

  34. DPCube Strategy: Two phase partitioning A B 1. Cell Partitioning 20 Cell histogram diagnosis 30 A B 20 Age A B 2. Multi-dimensional Partitioning 30 20 30 A B 20 partition histogram 30

  35. Partitioning Algorithm • Define a uniformity (randomness) measure for a partition H(Dt) • information gain, variance • Recursive algorithm Partition(Dt) for a given partition Dt • Find the best splitting point (e.g. largest information gain) and Partition the data into Dt1 and Dt2 • Partition(Dt1) and Partition(Dt2)

  36. Privacy and Utility of the Released Histogram • The released data satisfies -differential privacy • Support for count queries and other OLAP queries and learning tasks • Formal utility results • (epsilon,delta) - usefulness • Experimental results for partition histogram • CENSUS dataset, 1M tuples, 4 attributes: Age (79), Education (14), Occupation (23), and Income (100) • Report absolute error and relative error for random count queries

  37. DPCube Result Example Original histogram Diff. Private Cell histogram Diff. private partition histogram Diff. Private Estimated Cell histogram

  38. Experimental Results: Comparison with other partitioning strategies • Higher alpha (lower privacy) results in lower error (higher utility) • Kd tree based approach outperforms others • Cell partitioning is comparable in absolute error but suffers in relative error due to the sparsity of the data

  39. High dimensional sparse data • Many real-world data are high dimensional and sparse • Web search log data, web transactions, etc. • A direct application of the 2-phase approach • Cell histogram highly inaccurate • Computationally not scalable

  40. Top-down recursive partitioning • Recursively partition the spaces that have sufficient density • Use a context free taxonomy tree • Dynamically allocate and keep track of the budget

  41. Adaptive Hierarchical Strategy 1a. Overall count Data is sparse and Highly dimensional 1b. Partitioning of non-sparse regions 2a. Partition count 2b. Partitioning of non-sparse regions n. Partition count

  42. Today • Statistical databases • Definitions • Early query restriction methods • Output perturbation and differential privacy

More Related