Data Transformation for Privacy-Preserving Data Mining

Database Laboratory Data Transformation for Privacy-Preserving Data Mining Stanley R. M. Oliveira Database Systems Laboratory Computing Science Department University of Alberta, Canada Graduate Seminar November 26th, 2004

Introduction Motivation • Changes in technology are making privacy harder. • The new challenge of Statistical Offices. • Data Mining plays an outstanding role in business collaboration. • The traditional solution “all or nothing” has been too rigid. • The need for techniques to enforce privacy concerns when data are shared for mining. Data Transformation for Privacy-Preserving Data Mining

Introduction Conceptive Landmark Deployment Landmark Prospective Landmark PPDM: Increasing Number of Papers Data Transformation for Privacy-Preserving Data Mining

Introduction PPDM: Privacy Violation • Privacy violation in data mining: misuse of data. • Defining privacy preservation in data mining: • Individual privacy preservation: protection of personally identifiable information. • Collective privacy preservation: protection of users’ collective activity. Data Transformation for Privacy-Preserving Data Mining

A few Examples of Scenarios in PPDM Introduction • Scenario 1: A hospital shares some data for research purposes. • Scenario 2: Outsourcing the data mining process. • Scenario 3: A collaboration between an Internet marketing company and an on-line retail company. Data Transformation for Privacy-Preserving Data Mining

Introduction Contributions A Taxonomy of the Existing Solutions Data Partitioning Data Modification Data Restriction Data Ownership Fig.1: A Taxonomy of PPDM Techniques Data Transformation for Privacy-Preserving Data Mining

Framework Original Database Transformed Database Data mining The transformation process Non-sensitive patterns and trends Problem Definition • To transform a database into a new one that conceals sensitive information while preserving general patterns and trends from the original database. Data Transformation for Privacy-Preserving Data Mining

Framework Problem Definition (cont.) • Problem 1: Privacy-Preserving Association Rule Mining • I do not address privacy of individuals but the problem of protecting sensitive knowledge. • Assumptions: • The data owners have to know in advance some knowledge (rules) that they want to protect. • The individual data values (e.g., a specific item) are not restricted but the relationships between items. Data Transformation for Privacy-Preserving Data Mining

Framework Problem Definition (cont.) • Problem 2: Privacy-Preserving Clustering • I protect the underlying attribute values of objects subjected to clustering analysis. • Assumptions: • Given a data matrix Dmn, the goal is to transform D into D' so that the following restrictions hold: • A transformation T:D D’must preserve the privacy of individual records. • The similarity between objects in D and D' must be the same or slightly altered by the transformation process. Data Transformation for Privacy-Preserving Data Mining

Framework Server Client Transformed Database Original data Collective Transformation Individual Transformation Library of Algorithms PPDT Methods Retrieval Facilities Sanitization Metrics Privacy Preserving Framework A schematic view of the framework for privacy preservation A Framework for Privacy PPDM Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules D Identify Discovered Patterns Classify Discovered Patterns Select the Sensitive Transactions Modify some Sensitive Transactions D’ Transactional Database Sanitized Database Step 1 Step 2 Step 3 Step 4 TID List of Items T1 A B C D The sanitization process T2 A B C Inverted File T3 A B D T4 A C D A,B  D T1, T3 T5 A B C T6 B D A,C  D T1, T4 Sensitive Sensitive Rules Transaction IDs A sample transactional database The corresponding inverted file Privacy-Preserving Association Rule Mining Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules Heuristic 1 Heuristic 2 Heuristic 3 Privacy-Preserving Association Rule Mining A taxonomy of sanitizing algorithms Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules TID List of Items T1 A B C D T2 A B C Inverted File T3 A B D T4 A C D A,B  D T1, T3 T5 A B C T6 B D A,C  D T1, T4 Sensitive Sensitive Rules Transaction IDs A sample transactional database The corresponding inverted file Heuristic 1: Degree of Sensitive Transactions • Definition: Let D be a transactional database and ST a set of all sensitive transactions in D. The degree of a sensitive transactions t, such that t  ST, is defined as the number of sensitive association rules that can be found in t. Degree(T1) = 2 Degree(T3) = 1 Degree(T4) = 1 Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules Data Sharing-Based Algorithms • Scan a database and identify the sensitive transactions for each restrictive patterns; • Based on the disclosure threshold , compute the number of sensitive transactions to be sanitized; • For each restrictive pattern, identify a candidate item that should be eliminated (victim item); • Based on the number found in step 3, remove the victim items from the sensitive transactions. Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules TID TID TID List of Items List of Items List of Items Partial Sanitization Full Sanitization Transactional Database T1 T1 T1 B C D B C D A B C D T2 T2 T2 A B C A B C A B C T3 T3 T3 A B D A D A B D T4 T4 T4 A C D A D A C D T5 T5 T5 A B C A B C A B C T6 T6 T6 B D B D B D Data Sharing-Based Algorithms Sensitive Rules (SR): Rule1: A,BD Rule 2: A,CD • The Round Robin Algorithm (RRA) • Step 1: Sensitive transactions: A,BD = {T1, T3}; A,CD = {T1, T4} • Step 2: Select the number of sensitive transactions: (a)  = 50%; (b)  = 0% • Step 3: Identify the victim items (taking turns): •  = 50% Victim(T1) = A; Victim(T1) = A (Partial Sanitization) •  = 0% Victim(T3) = B; Victim(T4) = C (Full Sanitization) • Step 4: Sanitize the marked sensitive transactions. Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules Group Sensitive Rules A C B D Victim Item Sup(D)  Sup(A) TID TID List of Items List of Items Sanitized Database Transactional Database T1 T1 B C D A B C D T2 T2 A B C A B C T3 T3 A B A B D T4 T4 A C A C D T5 T5 A B C A B C T6 T6 B D B D Data Sharing-Based Algorithms (cont.) Sensitive Rules (SR): Rule1: A,BD Rule 2: A,CD • The Item Grouping Algorithm (IGA) • Step 1: Sensitive transactions: A,BD = {T1, T3}; A,CD = {T1, T4} • Step 2: Select the number of sensitive transactions:  = 0% • Step 3: Identify the victim items (grouping sensitive rules): • Victim(T1) = D; Victim(T3) = D; Victim(T4) = D (Full Sanitization) • Step 4: Sanitize the marked sensitive transactions. Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules Heuristic 2: Size of Sensitive Transactions • For every group of K transactions: • Step1: Distinguishing the sensitive transactions from the non-sensitive ones; • Step 2: Selecting the victim item for each sensitive rule; • Step 3: Computing the number of sensitive transactions to be sanitized; • Step 4: Sorting the sensitive transactions by size; • Step 5: Sanitizing the sensitive transactions. Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules Novelties of this Approach • The notion of disclosure threshold for every single pattern  Mining Permissions (MP). • Each mining permission mp = <sri, i>, where • i sri set of sensitive rules SR; and • i  [0 … 1]. • Mining Permissions allow an DBA to put different weights for different rules to hide. • All the thresholds ican also be set to the same value, if needed. Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules TID TID List of Items List of Items Sanitized Database Transactional Database T1 T1 A B C A B C D T2 T2 A B C A B C T3 T3 A B D A D T4 T4 C D A C D T5 T5 A B C A B C T6 T6 B D B D Data Sharing-Based Algorithms (cont.) Sensitive Rules (SR): Rule1: A,BD Rule 2: A,CD • The Sliding Window Algorithm (SWA) • Step 1: Sensitive transactions: A,BD = {T1, T3}; A,CD = {T1, T4} • Step 2: Identify the victim items (based on frequencies of the items in SR): • Victim(T3) = B; Victim(T4) = A; Victim(T1) = D; • Step 3: Select the number of sensitive transactions:  = 0% • Step 4: Sort sensitive transactions by size: A,BD = {T3, T1}; A,CD = {T4, T1} • Step 5: Sanitize the marked sensitive transactions. Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules Database D Database D’ Data Sharing-Based Metrics 1. Hiding Failure: 3. Artifactual Patterns: 4. Difference between D and D’: 2. Misses Cost: Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules Data Sharing-Based Approach Share AR generation Sanitization D’ D D’ Discovered Patterns (D’) Pattern Sharing-Based Approach AR generation Share Association Rules(D’) Association Rules(D’) Sanitization Association Rules(D) D Pattern Sharing-Based Algorithm Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules TID List of Items T1 A B C D T2 A B C T3 A C D T4 A B C T5 A B Transactional Database a) An example of forward-inference; b) An example of backward-inference. Possible Inference Channels • Inferences: based on non-restrictive rules, someone tries to deduce one or more restrictive rules that are not supposed to be discovered. Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules R: all rules R’: rules to share Non-Sensitive rules ~SR Sensitive rules SR Rules to hide Rules hidden Problem 1: RSE (Side effect) (recovery factor) Problem 2: Inference Pattern Sharing-Based Metrics • 1. Side Effect Factor (SEF) SEF = • 2. Recovery Factor (RF) RF = [0, 1] Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules • The Downright Sanitizing Algorithm (DSA) Frequent pattern lattice Level 0 Sensitive Rules: Level 1 ABC AC B C D A Level 2 AB AC BC AD CD * ABC ACD * Step1 Heuristic 3: Rule Sanitization • Step1: Identifying the sensitive itemsets. • Step 2: Selecting the subsets to sanitize. • Step 3: Sanitizing the set of supersets of marked pairs in level1. Step3 Step2 Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Privacy-Preserving Clustering (PPC) • PPC over Centralized Data: • The attribute values subjected to clustering are available in a central repository. • PPC over Vertically Partitioned Data: • There are k parties sharing data for clustering, where k 2; • The attribute values of the objects are split across the k parties. • Objects IDs are revealed for join purposes only. The values of the associated attributes are private. Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Original Data Transformed Data ID age weight heart rate Int_def QRS PR_int 123 75 80 63 32 91 193 342 56 64 53 24 81 174 254 40 52 70 24 77 129 446 28 58 76 40 83 251 286 44 90 68 44 109 128 The corresponding dissimilarity matrix A sample of the cardiac arrhythmia database (UCI Machine Learning Repository) Object Similarity-Based Representation (OSBR) Example 1: Sharing data for research purposes (OSBR). Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Object Similarity-Based Representation (OSBR) • The Security of the OSBR: • Lemma 1: Let DMmm be a dissimilarity matrix, where m is the number of objects. It is impossible to determine the coordinates of the two objects by knowing only the distance between them. • The Complexity of the OSBR: • Communication cost is of the order O(m2), where m is the number of objects under analysis. Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Object Similarity-Based Representation (OSBR) • Limitations of the OSBR: • Lemma 2: Knowing the coordinates of a particular object i and the distance r between i and any other object j, it is possible to estimate the attribute values of j. • Vulnerable to attacks in the case of vertically partitioned data (Lemma 2). • Conclusion  The OSBR is effective for PPC centralized data only, but expensive. Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Dimensionality Reduction Transformation (DRBT) • General Assumptions: • The attribute values subjected to clustering are numerical only. • In PPC over centralized data, object IDs should be replaced by fictitious identifiers; • In PPC over vertically partitioned data, object IDs are used for the join purposes between the parties involved in the solution. • The transformation (random projection) applied to the data might slightly modify the distances between data points. Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Dimensionality Reduction Transformation (DRBT) • Random projection from d to k dimensions: • D’ n k = Dn d• Rd k(linear transformation), where D is the original data, D’ is the reduced data, and R is a random matrix. • R is generated by first setting each entry, as follows: • (R1): rij is drawn from an i.i.d. N(0,1) and then normalizing the columns to unit length; • (R2): rij= Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Dimensionality Reduction Transformation (DRBT) • PPC over Centralized Data (General Approach): • Step 1 - Suppressing identifiers. • Step 2 - Normalizing attribute values subjected to clustering. • Step 3 - Reducing the dimension of the original dataset by using random projection. • Step 4 – Computing the error that the distances in k-d space suffer from: • PPC over Vertically Partitioned Data: • It is a generalization of the solution for PPC over centralized data. Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Original Data A sample of the cardiac arrhythmia database (UCI Machine Learning Repository) ID age weight heart rate Int_def QRS PR_int 123 75 80 63 32 91 193 RP1 RP2 342 56 64 53 24 81 174 254 40 52 70 24 77 129 Transformed Data 446 28 58 76 40 83 251 286 44 90 68 44 109 128 RP1: The random matrix is based on the Normal distribution. RP2: The random matrix is based on a much simpler distribution. Dimensionality Reduction Transformation (DRBT) Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Dimensionality Reduction Transformation (DRBT) • The Security of the DRBT: • Lemma 3: A random projection from d to k dimensions, where kd, is a non-invertible linear transformation. • The Complexity of the DRBT: • The complexity of space requirements is of order O(m), where m is the number of objects. • The communication cost is of order O(mlk), where l represents the size (in bits) required to transmit a dataset from one party to a central or third party. Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Overall F-measure (F) = Dimensionality Reduction Transformation (DRBT) • The Accuracy of the DRBT: Precision (P) = Recall (R) = F-measure (F) = Data Transformation for Privacy-Preserving Data Mining

Results Results and Evaluation Association Rules Clustering Datasets used in our performance evaluation Data Transformation for Privacy-Preserving Data Mining

Results Data Sharing-Based Algorithms • Item Group Algorithm (IGA)[Oliveira & Zaiane, PSDM 2002]. • Sliding Window Algorithm (SWA)[Oliveira & Zaïane, ICDM 2003]. • Round Robin Algorithm (RRA)[Oliveira & Zaïane, IDEAS 2003]. • Random Algorithm (RA)[Oliveira & Zaïane, IDEAS 2003]. • Algo2a[E. Dasseni et al., IHW 2001]. Data Transformation for Privacy-Preserving Data Mining

Results Methodology • The Sensitive rules selected based on four scenarios: • S1: Rules with items mutually exclusive. • S2: Rules selected randomly. • S3: Rules with very high support. • S4: Rules with low support. • The effectiveness of the algorithms was measured based on: • C1:  = 0%, fixed the minimum support () and minimum confidence (). • C2: the same as C1 but varied the number of sensitive rules. • C3:  = 0%, fixed the minimum confidence () and the number of sensitive rules, and varied the minimum support (). Data Transformation for Privacy-Preserving Data Mining

Algorithm Algorithm Algorithm Algorithm  = 0%  6 sensitive rules  = 0%  varying values of   = 0%  6 sensitive rules  = 0%  varying the # of rules Results S1 S1 S1 S1 S2 S2 S2 S2 S3 S3 S4 S3 S3 S4 S4 S4 Kosarak Kosarak Kosarak Kosarak IGA IGA IGA SWA IGA IGA IGA SWA SWA IGA IGA SWA IGA SWA SWA IGA SWA IGA IGA Retail Retail Retail Retail IGA SWA IGA IGA SWA SWA IGA / SWA Algo2a/IGA RA RA RA RA RRA RA Misses Cost under condition C2 Reuters Reuters Reuters Reuters IGA IGA IGA SWA IGA IGA SWA IGA IGA SWA IGA IGA IGA SWA IGA IGA BMS-1 BMS-1 BMS-1 BMS-1 SWA IGA IGA IGA IGA IGA SWA IGA IGA SWA IGA IGA SWA IGA IGA Dif (D, D’ ) under conditions C1 and C3 Measuring Effectiveness Misses Cost under condition C1 Misses Cost under condition C3 Data Transformation for Privacy-Preserving Data Mining

Results Special Cases of Data Sanitization SWA: { [rule1, 30%], [rule2, 25%], [rule3, 15%], [rule4, 45%], [rule5, 15%], [rule6, 20%] } K = 100,000 An example of different thresholds for the sensitive rules in scenario S3. Effect of  on misses cost and hiding failure in the dataset Retail Data Transformation for Privacy-Preserving Data Mining

Results CPU Time Results of CPU time for the sanitization process Data Transformation for Privacy-Preserving Data Mining

Results Pattern Sharing-based Algorithm • Downright Sanitizing Algorithm (DSA)[Oliveira & Zaiane, PAKDD 2004]. • We used the data-sharing algorithm IGA for our comparison study. • Methodology: • We used IGA to sanitize the datasets. • We used Apriori to extract the rules to share (all the datasets). • We used Apriori to extract the rules from the datasets. • We used DSA to sanitize the rules mined in the previous step. IGA DSA Data Transformation for Privacy-Preserving Data Mining

Results Measuring Effectiveness The best algorithm in terms of misses cost The best algorithm in terms of misses cost varying the number of rules to sanitize The best algorithm in terms of side effect factor Data Transformation for Privacy-Preserving Data Mining

Results Lessons Learned • Large dataset are our friends. • The benefit of index: at most two scans to sanitize a dataset. • The data sanitization paradox. • The outstanding performance of IGA and DSA. • Rule sanitization reduces inference channels, and does not change the support and confidence of the shared rules. • DSA reduces the flexibility of information sharing. Data Transformation for Privacy-Preserving Data Mining

Results Evaluation: DRBT • Methodology: • Step 1: Attribute normalization. • Step 2: Dimensionality reduction (two approaches). • Step 3: Computation of the error produced on reduced datasets. • Step 4: Run K-means to find the clusters in the original and reduced datasets. • Step 5: Computation of F-measure (experiments repeated 10 times). • Step 6: Comparison of the clusters generated from the original and the reduced datasets. Data Transformation for Privacy-Preserving Data Mining

Results DRBT: PPC over centralized Data The error produced on the dataset Chess (do = 37) Average of F-measure (10 trials) for the dataset Accidents (do = 18, dr = 12) Average of F-measure (10 trials) for the dataset Iris (do = 5, dr = 3) Data Transformation for Privacy-Preserving Data Mining

Results DRBT: PPC over Vertically Centralized Data The error produced on the dataset Pumsb (do = 74) Average of F-measure (10 trials) for the dataset Pumsb (do = 74, dr = 38) Data Transformation for Privacy-Preserving Data Mining

Conclusions Contributions of this Research • Foundations for further research in PPDM. • A taxonomy of PPDM techniques. • A family of privacy-preserving methods. • A library of sanitizing algorithms. • Retrieval facilities. • A set of metrics. Data Transformation for Privacy-Preserving Data Mining

Conclusions Future Research • Privacy definition in data mining. • Combining sanitization and randomization. • New method for PPC (k-anonymity + isometries + data distortion) • Sanitization of documents repositories. Data Transformation for Privacy-Preserving Data Mining

Conclusions Thank You! Data Transformation for Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data Mining