1 / 26

Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach. Hongyan Liu Jiawei Han Dong Xin Zheng Shao2 From SDM06. Outline. Introduction Preliminaries Algorithm Experimental Conclusion. Introduction. What is very high data ?

shaw
Télécharger la présentation

Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach Hongyan Liu Jiawei Han Dong Xin Zheng Shao2 From SDM06 Yi-Husi Lee

  2. Outline • Introduction • Preliminaries • Algorithm • Experimental • Conclusion Yi-Husi Lee

  3. Introduction • What is very high data ? Different from transactional data set, very high dimensional data like microarray data usually does not have somany rows (samples) but have a large number of columns (genes). Yi-Husi Lee

  4. Preliminaries • Table and Transposed Table Let minsup = 2, Table T Transposed Table TT Yi-Husi Lee

  5. Preliminaries(cont.) • Closure Given an itemset I ⊆ I,a rowset S ⊆ S and R ⊆ S × I is a relation, we define r(I) = { ri ∈ S | ∀ij ∈ I , (ri, ij) ∈ R } i(S) = { ij∈ I | ∀ri ∈ S , (ri, ij) ∈ R } then C(I) = i(r(I)) C(S) = r(i(S)) • Closed itemset and closed rowset An itemset I is called a closed itemsetiff I = C(I). Likewise, a rowset S is called a closed rowsetiff S=C(S). Yi-Husi Lee

  6. Preliminaries(cont.) • Example EX1:{b1, c2} r({b1, c2}) = {2, 4} i({2, 4}) = {b1, c2, d2} Then C({b1, c2}) = {b1, c2, d2} ≠ {b1, c2} Therefore, {b1, c2} is not a closed itemset. EX2:{1,2} i({1, 2}) = {a1, b1} r({a1, b1}) = {1, 2, 3} Then C({1, 2}) = {1, 2, 3} ≠ {1,2} So rowset {1, 2} is not a Closed rowset, but apparently {1, 2, 3} is. Yi-Husi Lee

  7. Preliminaries(cont.) • Bottom-up Search Strategy By bottom-up we mean that along every search the row enumeration space from small rowsets to large ones. It is monotonic in terms of bottom-up search order, it is hard to prune the row enumeration search space early. Memory cost for this kind of bottom-up search is also big. As the minsup increase , the time need to complete the mining process can not decrease rapidly. Yi-Husi Lee

  8. Preliminaries(cont.) • Top-down Search Strategy A top-down search method and a novel row enumeration tree are propose to take advantage of the pruning power of minimum support threshold. A new method, called closeness-checking, is developed to check efficiently and effectively whether a pattern is closed. Unlike other existing closeness-checking methods, it does not need to scan the mining data set, nor the result set, and is easy to integrate with the top-down search process. Yi-Husi Lee

  9. Preliminaries(cont.) • X-excluded transposed table Given a rowset x = {ri1, ri2, …, rik} with an order such that ri1 > ri2 … > rik, a minimum support threshold minsup and its parent table TT|p, an x-excluded transposed table TT|x is a table in which each tuple contains rids less than any of rids in x, and at the same time contains all of the rids greater than any of rids in x. Rowset x is called an excluded rowset. Yi-Husi Lee

  10. Preliminaries(cont.) • Procedure to get X-excluded transposed table Example: find table TT|5, TT|54 & TT|4, let the minsup=2 The original transposed table TT|Φ Yi-Husi Lee

  11. Preliminaries(cont.) • EX1: TT|5 TT|5 is easy to find, for each tuple in table TT|Φ containing rid 5, delete 5 and copy it to TT|5, then chick if the size of every tuple is great than minsup. If not delete it from TT|5 TT|5 Yi-Husi Lee

  12. Preliminaries(cont.) • EX2: TT|54 First: Each tuple in table TT|5 is a candidate tuple of TT|54 as there is no rid greater than 5 Second: After excluding rids not less than 4, the table is show below Last: Since tuple c2 only contains one rid, it does not satisfy minsup and delete it from TT|54 after prune TT|54 without prune TT|54 Yi-Husi Lee

  13. Preliminaries(cont.) • EX3:TT|4 TT|4 is not the same of TT|54 and TT|5, because it is not contains the greatest rid. Follow the definition (and at the same time containsall of the rids greater than any of rids in x), so we can find the tuple which has both 4 and 5. But tuple a2 does not satisfy minsup after excluding rid 4, so only tuple left in TT|4. Although the current size of tuple c2 in TT|4 is 1, but it actual size is 2 since it contains rid 5 which is not listed in TT|4. after prune Yi-Husi Lee

  14. Algorithm • Closeness-Checking Method To avoid generating all frequent itemsets during the mining process, it is important to perform the closeness checking as early as possible during mining. • Lemma1. Given an itemset I ⊆ I and a rowset S ⊆ S, the following two equations hold: r(I) = r(i(r(I))) i(S) = i(r(i(S))) Yi-Husi Lee

  15. Algorithm(cont.) • Lemma2. In transposed table TT, a rowset S ⊆ S is closed iff it can be represented by an intersection of a set of tuples, that is: ∃I ⊆ I, s.t. S = r(I) = ∩j r({ij}) where ij ∈ I, and I = {i1, i2, …, il} • Lemma3. Given a rowset S ⊆ S, in transposed table TT, for every tuple ij containing S, which means ij ∈ i(S), if S ≠ ∩jr({ij}), where ij ∈ i(S), then S is not closed. Yi-Husi Lee

  16. Algorithm(cont.) • Skip-rowset and merge of x-excluded transposed table In order to speed up this kind of checking, we add some additional information for each x-excluded transposed table. TT|54 TT|54 with skip rowset TT|54 after merge Yi-Husi Lee

  17. Algorithm(cont.) • TD-Closed n: the number of rowset FCP: frequent closed patterns excludedsize: is the number of rids excluded from current table cMinsup: A dynamically changing minimum support threshold Yi-Husi Lee

  18. Algorithm(cont.) • Pruning strategy 1: If excludedSize is equal to or greater than (n–minsup), then there is no need to do any further recursive call of TopDownMine. ExcludedSize is the number of rids excluded from current table. If it is not less than (n–minsup), the size of each rowset in current transposed table must be less than minsup, so these rowsets are impossible to become large. Yi-Husi Lee

  19. Algorithm(cont.) • Pruning strategy 2: If an x-excluded transposed table contains only one tuple, it is not necessary to do further recursive call to deal with its child transposed tables. Of course, before return according to pruning strategy 2, the current rowset S might be a closed rowset, so if the skip-rowset is empty, we need to output it first. • EX1.(pruning strategy 2) For table TT|4, there is only one tuple in this table. After we check this tuple to see if it is closed (it is not closed apparently as its skip-rowset is not empty), we do not need to produce any child excluded transposed table from it. Yi-Husi Lee

  20. Algorithm(cont.) • Pruning strategy 3: Each tuple t containing rid y in TT|x will be deleted from TT|x∪y if size of t (that is the number of rids t contains) equals cMinsup. • We can get these two tables, TT|x∪y and TT|x’, bysteps as follows. First, for each tuple t in TT|x containing rid y, delete y from it, and copy it to TT|x’.And then check if the size of t is less than cMinsup. Ifnot, keep it and in the meantime put y in the skiprowset,otherwise get rid of it from TT|x. Finally TT|x becomes TT|x∪y, and at the same time TT|x’ is obtained. Yi-Husi Lee

  21. Algorithm(cont.) • Ex2.(pruning strategy 2) Let cMinsup=2 TT|x’ TT|x TT|x∪y Yi-Husi Lee

  22. Algorithm(cont.) • Pruning strategy 4: The step is the major output step. If there is a tuple with empty skip-rowset and the largest size, say k, and containing rid k in TT|x∪y, then output it. Since this tuple has the largest size among all of the tuples in TT|x∪y, there is no other tuple that will contribute to it.Therefore, it can be output. • EX3.(pruning strategy 3) In table TT|54, itemset a1b1 corresponding to rowset {1,2,3} can be output as its size is 3 which is larger than the size of the other two tuples, and it also contains the largest rid 3. • Pruning strategy 5: It conducts two recursive calls to deal with TT|x∪y and TT|x’ respectively. Yi-Husi Lee

  23. Experimental • Synthetic Data Set Yi-Husi Lee

  24. Experimental(cont.) Yi-Husi Lee

  25. Experimental(cont.) • Real Data Sets Yi-Husi Lee

  26. Conclusion • A new algorithm, TD-Close, is designed and implemented for mining a complete set of frequent closed itemsets from high dimensional data. • Several pruning strategies are developed to speed up searching. Both analysis and experimental study show that these methods are effective and useful. • Future work includes integrating top-down row enumeration method and column row enumeration method for frequent pattern mining from both long and deep large datasets, and mining classification rules based on association rules using top-down searching strategy. Yi-Husi Lee

More Related