230 likes | 585 Vues
Reconstruction-Based Association Rule Hiding. Author: Yuhong Guo (MS-Ph.D. Candidate, Peking Univ., China) yhguo@pku.edu.cn Advisor: Prof. Shiwei Tang Co-Advisors: Prof. Dongqing Yang, Jian Pei Sunday, June 10, 2007. Association Rule Hiding: what? why?? and how???.
 
                
                E N D
Reconstruction-Based Association Rule Hiding Author:YuhongGuo (MS-Ph.D. Candidate, Peking Univ., China) yhguo@pku.edu.cn Advisor: Prof. Shiwei Tang Co-Advisors: Prof. Dongqing Yang, Jian Pei Sunday, June 10, 2007
Association Rule Hiding: what? why?? and how??? • Problem: hide sensitive association rules in data without losing non-sensitives • Motivations:large repositories of data contain confidential rules disclosed with serious adverse effects Traditional: fine-tuning, control the hiding effects indirectly • Solutions • Data modification • distortion • blocking • Data reconstruction New promising: knowledge sanitization, control effects directly SIGMOD Ph.D. Workshop IDAR’07
Outline • Background • Motivation • Problem statement • Related work • Proposed Solution • Current Progress • Evaluation Plan SIGMOD Ph.D. Workshop IDAR’07
Privacy Preserving Data mining (PPDM) Background Motivation • Two problems addressed in PPDM • the protection of private data • the protection of sensitive rules (knowledge) contained in the data Data mining Data sharing Privacy preserving SIGMOD Ph.D. Workshop IDAR’07
Background Problem statement • Given • a database Dto be released • minimum threshold “MST”, “MCT” • a set of association rules R mined from D • a set of sensitive rules RhR to be hided • Find a new database D’such that • the rules in Rh cannot be mined from D’ • the rules in R-Rh can still be mined as many as possible KHD(Knowledge Hiding in Database) problem SIGMOD Ph.D. Workshop IDAR’07
Background Related work • Data modification approaches • Basic idea: data sanitization D->D’ • Current status:distortion,blocking, prosperous • Drawbacks • Cannot control hiding effects intuitively, lots of I/O • Data reconstruction approaches • Basic idea:knowledge sanitization D->K->D’ • Current status:limited, 3 papers • Advantages • Can easily control the availability of rules and control the hiding effects directly, intuitively, handily SIGMOD Ph.D. Workshop IDAR’07
Hide rules Hide large itemsets Data modification Data- Distortion Algo1a Algo1b Algo2a WSDA PDA Algo2b Algo2c Naïve MinFIA MaxFIA IGA RRA RA SWA Border-Based Integer-Programing Sanitization-Matrix Data- Blocking CR CR2 GIH Data reconstruction CIILM Background Classification of current algorithms lots of reconstruction-based work is expected SIGMOD Ph.D. Workshop IDAR’07
Outline • Background • Proposed Solution • Framework • Example • Discussion • Current Progress • Evaluation Plan SIGMOD Ph.D. Workshop IDAR’07
1 . Frequent Set Mining R 2 . Perform sanitization Algorithm 3 . FP - tree - based Inverse Frequent Set Mining ’ FS - R Rh Proposed Solution Framework of our approach FS D D D ’ FP - tree SIGMOD Ph.D. Workshop IDAR’07
Proposed Solution The first two phases • 1. Frequent set mining • Generate all frequent itemsets with their supports and support counts FS from original database D • 2. Perform sanitization algorithm • Input: FS output in phase 1, R, Rh • Output: sanitized frequent itemsets FS’ • Process • Select hiding strategy • Identify sensitive frequent sets • Perform sanitization In best cases, sanitization algorithm can ensure from FS’ ,we can exactly get the non-sensitive rules set R-Rh SIGMOD Ph.D. Workshop IDAR’07
TempD D1 D2 Proposed Method The third phase: FP-tree-based inverse mining • Basic idea: useFP-tree as a transition “bridge”, which reduces the gap between a database and its frequent itemsets and makes transformation more easily Temporary Database A set of Compatible databases Frequent Itemsets FP-Tree (i) (ii) (iii) ... FS (i) Generate a compatible FP-tree (ii) Generate a TempD that only includes frequent items (iii) Scatter infrequent items into TempD SIGMOD Ph.D. Workshop IDAR’07
F r e q u e n t I t e m s e t s : F S A s s o c i a t i o n R u l e s : R A : 6 1 0 0 % B : 4 6 6 % c o n f i d - r u l e s s u p p o r t C : 4 6 6 % e n c e σ = 4 D : 4 6 6 % Þ B A 1 0 0 % 6 6 % Þ C A 1 0 0 % 6 6 % A B : 4 6 6 % M S T = 6 6 % A C : 4 6 6 % Þ M C T = 7 5 % D A 1 0 0 % 6 6 % A D : 4 6 6 % A : 6 1 0 0 % C : 4 6 6 % c o n f i d - r u l e s s u p p o r t D : 4 6 6 % e n c e Þ A C 1 0 0 % 6 6 % A C : 4 6 6 % A D Þ 1 0 0 % 6 6 % A D : 4 6 6 % F r e q u e n t I t e m s e t s : F S ' A s s o c i a t i o n R u l e s : R - R h Proposed Solution Example: the first two phases O i g i n a l D a t a b a s e : D 1. Frequent set mining T I D I t e m s T 1 A B C E T 2 A B C T 3 A B C D T 4 A B D T 5 A D T 6 A C D 2. Perform sanitization algorithm SIGMOD Ph.D. Workshop IDAR’07
T I D I t e m s F P A : 6 1 0 0 % T 1 A C D C : 4 6 6 % T 2 A C D A : 6 D : 4 6 6 % T 3 A C T 4 A C A C : 4 6 6 % C : 4 D : 2 A D : 4 6 6 % T 5 A D F r e q u e n t I t e m s e t s : F S ' T 6 A D D : 2 R e l e a s e d D a t a b a s e : D ' T I D I t e m s T I D I t e m s T I D I t e m s T I D I t e m s E E T 1 A C D T 1 A C D T 1 A C D E T 1 A C D E T 2 A C D T 2 A C D T 2 A C D E T 2 A C D E T 3 A C T 3 A C T 3 A C . . . . . . . . . E T 3 A C T 4 A C T 4 A C T 4 A C T 4 A C T 5 A D T 5 A D T 5 A D T 5 A D T 6 A D T 6 A D T 6 A D T 6 A D D ' D ' D ' D ' q 1 2 p Proposed Solution Example: the third phase • Difficulties: • How to find the target FP-tree • How to control |D’| σ=4 SIGMOD Ph.D. Workshop IDAR’07
Proposed Solution Discussion • Sanitization algorithm • Compared with early popular data sanitization : performs sanitization directly on knowledge level of data • Inverse frequent set mining algorithm • Deals with frequent items and infrequent items separately: more efficiently, a large number of outputs Our solution provides user with a knowledge level window to perform sanitization handily and generates a number of securedatabases SIGMOD Ph.D. Workshop IDAR’07
Outline • Background • Proposed Solution • Current Progress • Work to date • Future work • Expected contributions • Evaluation Plan SIGMOD Ph.D. Workshop IDAR’07
Current Progress Work to date • FP-tree-based method for inverse frequent set mining (used in the 3rd phase of our framework) • First effort • Published in Proc. of BNCOD'06 • Provides a good heuristic search strategy to rapidly find a FP-tree satisfying the given constraints, leading to rapidly finding a set of compatible databases • Further work • Accepted by Journal of Software (JOS) • A more mature and well-designed FP-tree-based method for inverse frequent set mining by iteratively solving a sub linear constraint problem SIGMOD Ph.D. Workshop IDAR’07
DHD Integrated secure tool KHD Current Progress Future work • Develop a sound sanitization algorithm with the following considerations • The support and confidence of the rules in R- Rh should remain unchanged as much as possible • Can select appropriate hiding strategies according to different kinds of correlations among the rules in R and Rh • Can prevent rule-based reasoning • Investigate how to restrict the number of transactions in the new released database • Develop an integrated secureassociation rule mining tool • Can protect privacy data • Can protect sensitive rules contained in the data SIGMOD Ph.D. Workshop IDAR’07
Inverse Frequent Set Mining Algorithm ARH Evaluation Metrics Rule sanitization Algorithm Reconstruction-based ARH Framework Current Progress Expected contributions CHART: Credible Hiding Association Rule Tool SIGMOD Ph.D. Workshop IDAR’07
Outline • Background • Proposed Solution • Current Progress • Evaluation Plan SIGMOD Ph.D. Workshop IDAR’07
R R ~ R ② Lost Rules h h ③ Ghost Rules ① Hiding Failure R ’ Evaluation Plan • Dataset • BMS-POS • BMS-WebView-1 • BMS-WebView-2 • … • Evaluation • Hiding effects ① Hiding Failure Ratio Rh(D’)/Rh(D) ② Lost Rules Ratio ③ Ghost Rules Ratio • Data utility • Time performance • (~Rh(D) − ~Rh(D’))/ ~Rh(D) (∣R’∣−∣R∩R’∣)/∣R’∣ SIGMOD Ph.D. Workshop IDAR’07
Ongoing! 1 . Frequent Set Mining R 2 . Perform sanitization Algorithm 3 3 . . FP FP - - tree tree - - based Inverse Frequent Set Mining ’ FS - R Rh Basically completed! Reconstruction-based Association Rule Hiding Summary FS D D 3. FP-tree-based Inverse Frequent Set Mining D ’ FP - tree SIGMOD Ph.D. Workshop IDAR’07
Any suggestion or question? yhguo@pku.edu.cn Thanks for your attention