A Novel Method for Protecting Sensitive Knowledge in Association Rule Mining

A Novel Method for Protecting Sensitive Knowledge in Association Rule Mining

Introduction • Why privacy preserving data mining? • Data privacy V.S. Information privacy • Data privacy • Randomization approach • Secure multi-party computation approach • Information privacy • Sensitive Knowledge

Background Knowledge • Support of {A, C}: the probability that a transaction contains {A, C} • eg: support of {A, C} = 3/4 • {A, C} is frequent: the support of {A, C} is no smaller than minimum support defined by user • if mimSup = 0.4 then {A, C} is a frequent pattern

Related Work • Oliveira_Zaiane proposed 6 algorithms • Step1:Find sensitive transactions • Step2: Choose victim items • Step3: Compute how many sensitive transactions should be changed • Step4: Select victim transactions • SWA of Oliveira_Zaiane • Almost the same with the others but each step applied to K transactions, K is the window size • With the best performance

Related Work (cont.) • Published pattern sets • Forward-Inference Attack

Preliminary • Represent TDB as a binary matrix • P : frequent patterns • PH : frequent patterns with security policies • ~PH : frequent patterns without security policies • PH ∪~PH = P • Pair-Subset : • eg: {1, 2, 3} is frequent; {{1, 2}, {1, 3}, {2, 3}} is the Pair-Subset of {1, 2, 3} and {1, 2} is a pair-subpattern of {1, 2, 3}

Problem Definition • Transform D into D', such that PH are hidden and ~PH are still mined in D' and also avoid Forward-Inference Attack • Kernel Ideal • D is multiplied by a sanitization matrix S • The problem is transformed to how to define S

Matrix Multiplication • If Dij = 0, D'ij is set to 0 directly • If ≥ 1, D'ij is set to 1 • If ≤ 0, D'ij is set to 0

Matrix Observation • Setting of “– 1” • If Sij id set to – 1, for the row that Dti and Dtjare both equal to 1, D'ij will become 0

Matrix Observation (cont.) • Setting of “1” • Setting Sij to 1 can keep the relation between item i and item j by enhancing the strength of item j

Sanitization Process

Marked-Set Generation • 1. Put the patterns with length = 2 in PH into Marked-Set directly • 2. for all remainder P in PH do • if (P has no Pair-Subsets included in Marked-Set) • Generate k groups, k = # of all Pair-Subsets of P • Class label of group is named by each Pair-Subsets of P • P is stored in each group • 3.Merge the groups with same class label

Marked-Set Generation (cont.) • 4.for all NP in ~PH do • Generate their all Pair-Subsets • Count the frequencies of all Pair-Subsets • 5. for all groups do • If the class label of the group ≠ any Pair-Subset generated in Step1 • The frequency of the group = 0 • If the class label of the group = one Pair-Subset generated in Step1 • The frequency of the group = the frequency of the Pair-Subset

Marked-Set Generation (cont.) • 6. Sort the groups by frequency in the increasing order • 7.for (i =1 to number of groups -1) for ( j = i +1 to number of groups) Compare groups pair-wise, Gi, Gj for all overlap in Gi∩Gj do If the size of Gi≠ the size of Gj Remove overlap from the small one else if Check the frequency Remove overlap from the large one else Remove overlap form the group chosen randomly • 8.for all groups do If number of patterns stored in group > 0 Put the class label into Marked-Set

An overall example

Sanitization Matrix Setting • 1. Sii = 1 • 2. for all {i, j} in Marked-Set do • if(# of i in ~PH < # of j in ~PH) • Sji= –1 • If(# of i in ~PH > # of j in ~PH) • Sij = –1 • else • if(# of i in Marked-Set > # of j in Marked-Set) • Sji = –1 • if(# of i in Marked-Set < # of j in Marked-Set) • Sij = –1 • else • Sji = –1 or Sij= –1 randomly

Sanitization Matrix Setting (cont.) • 3.for all {i, j} in (large2- Marked-Set) do • Set Sij= 1, Sji= 1 • 4.Sij= 0, otherwise

Probability Policies • Distortion Probabilityρ • Used when only one “1” in the column j and works if ; D'ij hasρj to be set to 1 and 1–ρj to be set to 0

Probability Policies (cont.) • Lemma1: Give a minimum supportσand a level of confidence c. Let {i, j} be a pattern in Marked-Set nij be the support count of {i, j} ρ is the probability of column j. W.L.O.G we assume that Sij = –1. If ρ satisfies and where |D| is the number of transaction in D, we can say that we are c confident that {i, j} isn’t frequent in D'

Probability Policies (cont.) • Conformity Probabilityμ • Used when the column j of S contains at least two “1s”, works if , and at least one –1 in j is multiplied by 1 in D, D'ij is set to 1 withμand 0 with 1–μ

Probability Policies (cont.) • Lemma 2: Given a minimum support σ, and a level of confidence c. Let {i, j} be a pattern in Marked-Set, and {k, j} be a pattern which belongs to {large2 – Marked-Set}, nikj be the support count of {i, k, j}. W.L.O.G, we assume that Sij = –1.μis the Conformity probability of column j. If μ is set according to the following rule, we can say that we are c confident that {i, j} isn’t frequent in D'.

Sanitization Algorithm for (i =1 to n){ for (j =1 to m){ if (Dij = 0) D'ij = 0 else{ temp = if (col j in S with some –1, one 1, temp ≤ 0) D'ij = 1 (ρj), 0 ( 1–ρj) else if (col j in S with some –1, more than one 1,at least one –1 multiplied by 1 in D, temp ≥ 1) D'ij = 1 (μi), 0 (1–μi) else if (temp ≤ 0) D'ij = 0 else if (temp ≥ 1) D'ij = 1 } } }

Performance Quantifying • Hiding Failure: • Miss Cost: • Dissimilarity: • Weakness:

Performance Evaluation

Performance Evaluation (cont.)

Conclusion • A probability based approach to solve sensitive knowledge problem is proposed • In some conditions, the miss cost and the dissimilarity is little higher than SWA, but overall, better performance than SWA and could not suffer from Forward-Inference Attack • Any Questions?

Reference • [LCC04]Guanling Lee, Chien-Yu Chang and Arbee L.P Chen. Hiding sensitive patterns in association rules mining. The 28th Annual International Computer Software and Applications Conference (COMPSAC 2004) • [OZ02]S. R. M. Oliveira and O. R. Zaïane. Privacy Preserving Frequent Itemset Mining. In Proc. of the IEEE ICDM Workshop on Privacy, Security, and Data Mining Japan, December 2002. • [OZ03a]S. R. M. Oliveira and O. R. Zaïane. Algorithms for Balancing Privacy and Knowledge Discovery in Association Rule Mining. In Proc. of the 7th International Database Engineering and Applications Symposium (IDEAS’03), Hong Kong, China, July 2003. • [OZ03b]S. R. M. Oliveira and O. R. Zaïane. Protecting Sensitive Knowledge By Data Sanitization. In Proc. of the 3rd IEEE International Conference on Data Mining (ICDM’03). • [OZS04]S. R. M. Oliveira, O. R. Zaïane and Yücel Saygin. Secure Association Rule Sharing The 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining 2004(PAKDD-04). • [VAE04]Verykios, V.S.; Elmagarmid, A.K.; Bertino, E.; Saygin, Y.; Dasseni, E. Association rule hiding. IEEE Transactions On Knowledge And Data Engineering, Vol. 16, No. 4, April 2004.

A Novel Method for Protecting Sensitive Knowledge in Association Rule Mining

A Novel Method for Protecting Sensitive Knowledge in Association Rule Mining

Presentation Transcript

Association Rule Mining Constraint Based Association Rule Mining

Association Rule Mining

A distributed method for mining association rules

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association rule mining

Mining Association Rule

Association Rule Mining

Association Rule Mining

Association Rule Mining

Data Mining: Association Rule Mining

Data Mining Association Rule Mining