Data Transformation for Privacy-Preserving Data Mining

Database Laboratory Data Transformation for Privacy-Preserving Data Mining Stanley R. M. Oliveira Database Systems Laboratory Computing Science Department University of Alberta, Canada PhD Thesis - Final Examination November 29th, 2004

Introduction Motivation • Scenario 1: A collaboration between an Internet marketing company and an on-line retail company. • Objective: find optimal customer targets. • Scenario 2: Companies sharing their transactions to build a recommender system. • Objective: provide recommendations to their customers. Data Transformation for Privacy-Preserving Data Mining

Introduction Contributions A Taxonomy of the Existing Solutions Data Partitioning Data Modification Data Restriction Data Ownership Fig.1: A Taxonomy of PPDM Techniques Data Transformation for Privacy-Preserving Data Mining

Framework Original Database Transformed Database Data mining The transformation process Non-sensitive patterns and trends • Data sanitization • Dimensionality reduction Problem Definition • To transform a database into a new one that conceals sensitive information while preserving general patterns and trends from the original database. Data Transformation for Privacy-Preserving Data Mining

Framework Problem Definition (cont.) • Sub-Problem 1: Privacy-Preserving Association Rule Mining • I do not address privacy of individuals but the problem of protecting sensitive knowledge. • Sub-Problem 2: Privacy-Preserving Clustering • I protect the underlying attribute values of objects subjected to clustering analysis. Data Transformation for Privacy-Preserving Data Mining

Framework Server Client Transformed Database Original data Collective Transformation Individual Transformation Library of Algorithms PPDT Methods Retrieval Facilities Sanitization Metrics Privacy Preserving Framework A schematic view of the framework for privacy preservation A Framework for Privacy PPDM Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules Heuristic 1 Heuristic 2 Heuristic 3 Privacy-Preserving Association Rule Mining A taxonomy of sanitizing algorithms Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules Database D Database D’ Data Sharing-Based Algorithms: Problems 1. Hiding Failure: 3. Artifactual Patterns: 4. Difference between D and D’: 2. Misses Cost: Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules Data Sharing-Based Algorithms • Scan a database and identify the sensitive transactions for each sensitive rule; • Based on the disclosure threshold , compute the number of sensitive transactions to be sanitized; • For each sensitive rule, identify a candidate item that should be eliminated (victim item); • Based on the number found in step 3, remove the victim items from the sensitive transactions. Data Transformation for Privacy-Preserving Data Mining

Algorithm Algorithm Algorithm Algorithm  = 0%  6 sensitive rules  = 0%  varying values of   = 0%  6 sensitive rules  = 0%  varying the # of rules PP-Assoc. Rules S1 S1 S1 S1 S2 S2 S2 S2 S3 S3 S4 S3 S3 S4 S4 S4 Kosarak Kosarak Kosarak Kosarak IGA IGA IGA SWA IGA IGA IGA SWA SWA IGA IGA SWA IGA SWA SWA IGA SWA IGA IGA Retail Retail Retail Retail IGA SWA IGA IGA SWA SWA IGA / SWA Algo2a/IGA RA RA RA RA RRA RA Misses Cost under condition C2 Reuters Reuters Reuters Reuters IGA IGA IGA SWA IGA IGA SWA IGA IGA SWA IGA IGA IGA SWA IGA IGA BMS-1 BMS-1 BMS-1 BMS-1 SWA IGA IGA IGA IGA IGA SWA IGA IGA SWA IGA IGA SWA IGA IGA Dif (D, D’ ) under conditions C1 and C3 Measuring Effectiveness Misses Cost under condition C1 Misses Cost under condition C3 Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules Data Sharing-Based Approach (IGA) Share AR generation Sanitization D’ D D’ Discovered Patterns (D’) Pattern Sharing-Based Approach (DSA) AR generation Share Association Rules(D’) Association Rules(D’) Sanitization Association Rules(D) D Pattern Sharing-Based Algorithm Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules R: all rules R’: rules to share Non-Sensitive rules ~SR Sensitive rules SR Rules to hide Rules hidden Problem 1: RSE (Side effect) (recovery factor) Problem 2: Inference Pattern Sharing-Based Algorithms: Problems • 1. Side Effect Factor (SEF) SEF = • 2. Recovery Factor (RF) RF = [0, 1] Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules Measuring Effectiveness The best algorithm in terms of misses cost The best algorithm in terms of misses cost varying the number of rules to sanitize The best algorithm in terms of side effect factor Data Transformation for Privacy-Preserving Data Mining

PP-Assoc. Rules Lessons Learned • Large dataset are our friends. • The benefit of index: at most two scans to sanitize a dataset. • The data sanitization paradox. • The outstanding performance of IGA and DSA. • Rule sanitization does not change the support and confidence of the shared rules. • DSA reduces the flexibility of information sharing. Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Privacy-Preserving Clustering (PPC) • PPC over Centralized Data: • The attribute values subjected to clustering are available in a central repository. • PPC over Vertically Partitioned Data: • There are k parties sharing data for clustering, where k 2; • The attribute values of the objects are split across the k parties. • Objects IDs are revealed for join purposes only. The values of the associated attributes are private. Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Original Data Transformed Data ID age weight heart rate Int_def QRS PR_int 123 75 80 63 32 91 193 342 56 64 53 24 81 174 254 40 52 70 24 77 129 446 28 58 76 40 83 251 286 44 90 68 44 109 128 The corresponding dissimilarity matrix A sample of the cardiac arrhythmia database (UCI Machine Learning Repository) Object Similarity-Based Representation (OSBR) Example 1: Sharing data for research purposes (OSBR). Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Object Similarity-Based Representation (OSBR) • Limitations of the OSBR: • Expensive in terms of communication cost - O(m2), where m is the number of objects under analysis. • Vulnerable to attacks in the case of vertically partitioned data. • Conclusion  The OSBR is effective for PPC over centralized data only. Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Dimensionality Reduction Transformation (DRBT) • Random projection from d to k dimensions: • D’ n k = Dn d• Rd k(linear transformation), where D is the original data, D’ is the reduced data, and R is a random matrix. • R is generated by first setting each entry, as follows: • (R1): rij is drawn from an i.i.d. N(0,1) and then normalizing the columns to unit length; • (R2): rij= Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Original Data A sample of the cardiac arrhythmia database (UCI Machine Learning Repository) ID age weight heart rate Int_def QRS PR_int 123 75 80 63 32 91 193 RP1 RP2 342 56 64 53 24 81 174 254 40 52 70 24 77 129 Transformed Data 446 28 58 76 40 83 251 286 44 90 68 44 109 128 RP1: The random matrix is based on the Normal distribution. RP2: The random matrix is based on a much simpler distribution. Dimensionality Reduction Transformation (DRBT) Data Transformation for Privacy-Preserving Data Mining

PP-Clustering Dimensionality Reduction Transformation (DRBT) • Security: A random projection from d to k dimensions, where kd, is a non-invertible linear transformation. • Space requirement is of the order O(m), where m is the number of objects. • Communication cost is of the order O(mkl), where l represents the size (in bits) required to transmit a dataset from one party to a central or third party. • Conclusion  The DRBT is effective for PPC over centralized data and vertically partitioned data. Data Transformation for Privacy-Preserving Data Mining

PP-Clustering DRBT: PPC over Centralized Data The error produced on the dataset Chess (do = 37) Average of F-measure (10 trials) for the dataset Accidents (do = 18, dr = 12) Average of F-measure (10 trials) for the dataset Iris (do = 5, dr = 3) Data Transformation for Privacy-Preserving Data Mining

PP-Clustering DRBT: PPC over Vertically Partitioned Data The error produced on the dataset Pumsb (do = 74) Average of F-measure (10 trials) for the dataset Pumsb (do = 74, dr = 38) Data Transformation for Privacy-Preserving Data Mining

Conclusions Contributions of this Research • Foundations for further research in PPDM. • A taxonomy of PPDM techniques. • A family of privacy-preserving methods. • A library of sanitizing algorithms. • Retrieval facilities. • A set of metrics. Data Transformation for Privacy-Preserving Data Mining

Conclusions Future Research • Privacy definition in data mining. • Combining sanitization and randomization for PPARM. • Transforming data using one-way functions and learning from the distorted data. • Privacy preservation in spoken language databases. • Sanitization of documents repositories on the Web. Data Transformation for Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data Mining