1 / 63

Fast Algorithms for Mining Frequent Itemsets

探勘頻繁項目集合之快速演算法研究. Fast Algorithms for Mining Frequent Itemsets. 博士論文初稿. 指導教授 : 張真誠 教授 研究生 : 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University Date: May 31, 2007. Outline. Introduction Background and Related Work NFP-Tree Structure

eosborn
Télécharger la présentation

Fast Algorithms for Mining Frequent Itemsets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 探勘頻繁項目集合之快速演算法研究 Fast Algorithms for Mining Frequent Itemsets 博士論文初稿 指導教授: 張真誠 教授 研究生: 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University Date: May 31, 2007

  2. Outline • Introduction • Background and Related Work • NFP-Tree Structure • Fast Share Measure (FSM) Algorithm • Three Efficient Algorithms • Direct Candidate Generate (DCG) Algorithm • Isolated Items Discarding Strategy (IIDS) • Maximum Item Conflict First (MICF) Sanitization Method • Conclusions

  3. Introduction • Data mining techniques have been developed to find a small set of precious nugget from reams of data (Cabena et al., 1998; Kantardzic, 2002) • Mining association rules constitutes one of the most important data mining problem • Two sub-problem (Agrawal & Srikant, 1994) • Identifying all frequent itemsets • Using these frequent itemsets to generate association rules • The first sub-problem plays an essential role in mining association rules

  4. Introduction (con’t) • Mining frequent itemsets • Mining share-frequent itemsets • Mining high utility itemsets • Hiding sensitive patterns

  5. Support-Confidence Framework (1/4) Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%

  6. Support-Confidence Framework (2/4) • FP-growth algorithm (Han et al., 2000; Han et al., 2004)  

  7. Support-Confidence Framework (3/4)   

  8. Support-Confidence Framework (4/4) Conditional FP-tree of “D” Conditional FP-tree of “BD”

  9. Share-Confidence Framework (1/4) • Measure value: mv(ip, Tq) • mv({D}, T01) = 1 • mv({C}, T03) = 3 • Transaction measure value: tmv(Tq) = • tmv(T02) = 10 • Total measure value: Tmv(DB)= • Tmv(DB)=47 • Itemset measure value: imv(X, Tq)= • imv({A, E}, T02)=5 • Local measure value: lmv(X)= • lmv({BC})=2+5+5=12

  10. Share-Confidence Framework (2/4) • Itemset share: SH(X)= • SH({BC})=12/47=25.5% • SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset minShare=30%

  11. Share-Confidence Framework (3/4) • ZP(Zero Pruning)、ZSP(Zero Subset Pruning) (Barber & Hamilton, 2003) • variants of exhaustive search • prune the candidate itemsets whose local measure values are exactly zero • SIP(Share Infrequent Pruning) (Barber & Hamilton, 2003) • like Apriori • with errors • The three algorithms are either inefficient or do not discover complete share-frequent (SH-frequent) itemsets

  12. Share-Confidence Framework (4/4) ZSP Algorithm SIP Algorithm

  13. Utility Mining (1/2) • Internal utility: iu(ip, Tq) • iu({D}, T01) = 1 • iu({C}, T03) = 3 • External utility: eu(ip) • eu({D}) = 3 • eu({C}) = 1 • Utility value in a transaction: • util({C, E, F}, T02) = util(C, T02) + util(E, T02) + util(F, T02) = 3X1+1X5+2X2=12 • Local utility: • Lutil({C, D}) = util({C, D}, T01) + util({C, D}, T04) + util({C, D}, T06) = 4 + 7 + 5 = 16

  14. Utility Mining (2/2) • Total utility: Tutil(DB) = • Tutil(DB) = 122 • The utility value of X in DB: UTIL(X)= • UTIL({C, D}) = 16/122 =13.1% • High utility itemset: if UTIL(X) >= minUtil, X is a high utility itemset

  15. Privacy-Preserving in Mining Frequent Itemsets • NP-hard problem (Atallah et al., 1999) • DB: database, DB’: released database • RI: the set of restrictive itemsets • ~RI: the set of non-restrictive itemsets • Misses cost = • Sanitization algorithms (Oliveira and Zaïane, 2002; Oliveira and Zaïane, 2003; Saygin et al., 2001)

  16. NFP-Tree (1/4) • NFP-growth Algorithm • NFP-tree construction

  17. NFP-Tree (2/4)  

  18. NFP-Tree (3/4)   

  19. NFP-Tree (4/4) Conditional NFP-tree of “D(3,4)”

  20. Experimental Results (1/3) • PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running windows 2000 professional • All algorithms were coded in VC++ 6.0 • Datasets: • Real: BMS-Web View-1, BMS-Web View-2, Connect 4 • Artificial: generated by IBM synthetic data generator

  21. Experimental Results (2/3)

  22. Experimental Results (3/3)

  23. Fast Share Measure (FSM) Algorithm • FSM: Fast Share Measure algorithm • ML: Maximum transaction length in DB • MV: Maximum measure valuein DB • min_lmv=minShare×Tmv • Level Closure Property: Given a minShare and a k-itemset X • Theorem 1. If lmv(X)+(lmv(X)/k)×MV < min_lmv, all supersets of X with length k + 1 are infrequent • Theorem 2. If lmv(X)+(lmv(X)/k)×MV ×k’< min_lmv, all supersets of X with length k+k’ are infrequent • Corollary 1. If lmv(X)+(lmv(X)/k)×MV ×(ML-k)< min_lmv, all supersets of X are infrequent

  24. minShare=30% • Let CF(X)=lmv(X)+(lmv(X)/k)×MV ×(ML-k) • Prune X if CF(X)<min_lmv • CF({ABC})=3+(3/3)×3×(6-3)=12<14.1=min_lmv

  25. ExperimentalResults (1/2) • T4.I2.D100k.N50.S10 • minShare = 0.8% • ML=14

  26. Experimental Results (2/2)

  27. Three Efficient Algorithms • EFSM (Enhanced FSM): instead of joining arbitrary two itemsets in RCk-1, EFSM joins arbitrary itemset of RCk-1 with a single item in RC1 to generate Ck efficiently • Reduce time complexity from O(n2k-2) to O(nk)

  28. Xk+1:arbitrary superset of X with length k+1 in DB • S(Xk+1): the set which contains all Xk+1 in DB • dbS(Xk+1): the set of transactions of which each transaction contains at least one Xk+1 • SuFSM and ShFSM from EFSM which prune the candidates more efficiently than FSM • SuFSM (Support-counted FSM): • Theorem 3. If lmv(X)+Sup(S(Xk+1))×MV×(ML –k)< min_lmv, all supersets of X are infrequent

  29. SuFSM (Support-counted FSM) • lmv(X)/k Sup(X) Sup(S(Xk+1)) • EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD}k+1))=2, • If there is no superset of X is an SH-frequent itemset, then the following four equations hold • lmv(X)+(lmv(X)/k)×MV×(ML - k) < min_lmv • lmv(X)+Sup(X) ×MV×(ML - k) < min_lmv • lmv(X)+Sup(S(Xk+1)) ×MV×(ML - k) < min_lmv

  30. ShFSM (Share-counted FSM) • ShFSM (Share-counted FSM): • Theorem 4. If Tmv(dbS(Xk+1)) < min_lmv, all supersets of X are infrequent • FSM:lmv(X)+(lmv(X)/k)×MV×(ML - k) < min_lmv • SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV×(ML - k) < min_lmv • ShFSM: Tmv(dbS(Xk+1)) < min_lmv

  31. ShFSM (Share-counted FSM) • Ex. X={AB} • Tmv(dbS(Xk+1)) = tmv(T01)+tmv(T05) =6+6=12 <14 = min_lmv

  32. Experimental Results (1/3) minShare=0.3%

  33. Experimental Results (2/3) minShare=0.3%

  34. Experimental Results (3/3) • T6.I4.D100k.N200.S10 • minShare = 0.1% • ML=20

  35. Direct Candidate Generation (DCG)Algorithm

  36. Experimental Results (1/3)

  37. Experimental Results (2/3)

  38. Experimental Results (3/3)

  39. Isolated Item Discarding Strategy (IIDS) for Utility Mining

  40. IIDS (1/2) ShFSM minUtil=30%

  41. IIDS (2/2) FUM minUtil=30%

  42. Experimental Results (1/5)

  43. Experimental Results (2/5)

  44. Experimental Results (3/5)

  45. Experimental Results (5/5) minUtil = 0.12% minUtil = 0.12%

  46. Maximum Item Conflict First (MICF) Sanitization Method Tdegree(Tq): the degree of conflict of a sensitive transaction Tq is the number of restrictive itemsets which are included in Tq, If Tdegree(Tq) > 1, Tq is a conflicting transaction

  47. Idegree({D}, {D, F}, T05)=1 • Idegree({F}, {D, F}, T05)=0 • MaxIdegree: store the maximum value of the conflict degree among items in a transaction • MICF: select an item with MaxIdegree to delete in each iteration

  48. 1 • Idegree({D}, {D, F}, T06)=1 • Idegree({F}, {D, F}, T06)=0 4

  49. Experimental Results (1/5)

More Related