1 / 20

Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set Dao-I Lin and Zvi M. Kedem

Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set Dao-I Lin and Zvi M. Kedem. Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University http://www/?. Department of Computer Science Courant Institute of Mathematical Sciences

jubal
Télécharger la présentation

Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set Dao-I Lin and Zvi M. Kedem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pincer-Search: A New Algorithm for Discovering the Maximum Frequent SetDao-I Lin and Zvi M. Kedem • Title • NameDepartment of Computer ScienceCourant Institute of Mathematical SciencesNew York University • http://www/? Department of Computer Science Courant Institute of Mathematical Sciences New York University

  2. Overview • The importance of maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and census databases • Conclusions

  3. Setting • Basic terms: • 1,2, …, n: The set of all items • Transaction: A set of items • Database: A set of transactions • User-defined threshold (suppmin): A number in [0,1] • Frequent itemset: A combination of items (an itemset) occurring in at least suppminfraction of the database • Maximum frequent set • An itemset is frequent if and only if it is a subset a maximal frequent itemset • Maximum frequent set: The set of all maximal frequent itemsets • Discovering the maximum frequent set is a key problem in many data mining applications • Association rules, strong rules, episodes, and minimal keys

  4. An Example • Database TransactionI itemset 1 {1,2,3,5} 2 {1,5} 3 {1,2} 4 {1,2,3} • Set suppminto 0.5 • Frequent itemsets are {1}, {2}, {3}, {5}, {1,2}, {1,3}, {1,5}, {2,3}, and {1,2,3} since they occur in at least 2 out of 4 transactions • Maximum frequent set is {{1,2,3},{1,5}} {1,2,3,4,5} {1,2,3} {1,2} {1,3} {2,3} {1,5} {4} {5} {1}{2}{3}

  5. An Example • Database Transaction Itemset 1 {1,2,3,4,5} 2 {1,3} 3 {1,2} 4 {1,2,3,4} • Set suppminto 0.5 • Frequent itemsets are {1}, {2}, {3}, {4}, {1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {3,4}, {1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}, and {1,2,3,4} since they occur in at least 2 out of 4 transactions • Maximum frequent set is {{1,2,3,4}} {1,2,3,4,5} {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1}{2}{3}{4} {5}

  6. Setting • Basic terms: • 1,2, …, n: The set of all items • Transaction: A set of items • Database: A set of transactions • User-defined threshold (suppmin): A number in [0,1] • Frequent itemset: A combination of items (an itemset) occurring in at least suppminfraction of the database • Maximum frequent set • An itemset is frequent if and only if it is a subset a maximal frequent itemset • Maximum frequent set: The set of all maximal frequent itemsets • Discovering the maximum frequent set is a key problem in many data mining applications • Association rules, strong rules, episodes, and minimal keys

  7. Two Observations • Let A and B be two itemsets and A B • Observation-1: A infrequent  B infrequent(if a transaction does not contain A, it cannot contain B) • Observation-2: B frequent  A frequent(if a transaction contains B, it must contain A) {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} A {5} B {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1} {2} {3}

  8. Computing the Maximum Frequent Set • Observation-1 leads to bottom-up search algorithms, such as AIS (AIS93), Apriori (AS94), OCD (MTV94), SETM (HS95), DHP (PCY95), Partition (SON95), ML-T2+ (HF95), Sampling (T96), DIC (BMUT97), Clique (ZPOL97) • Observation-2 leads to top-down search algorithms, such as TopDown (ZPOL97), guess-and-correct (MT97) {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {2,3} {1,4} {2,4} {3,4} {1} {2} {3} {4} {5} {1,2,3,4,5} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} {5}

  9. Complexity of One-Way Search • For bottom-up search, every frequent itemset is explicitly examined (in the example, until {1,2,3,4} is examined) • For top-down search, every infrequent itemset is explicitly examined (in the example until {5} is examined) {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {2,3} {1,4} {2,4} {3,4} {5} {1} {2} {3} {4} {1,2,3,4,5} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} {5}

  10. Pincer Search: CombiningTop-down and Bottom-up Searches • Use Observation-1 to eliminate candidates in the top-down search • Use Observation-2 to eliminate candidates in the bottom-up search • This example shows how combining both searches could dramatically reduce • the number of candidates examined • the pass of reading the database {1,2,3,4,5} {1,2,3,4} {1,3,4,5} {1,2,3,5} {1,2,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,2,3} {1,2,4} {1,3,4} {2,3,4} {1,5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets Green: itemsets not examined {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {2,5} {3,5} {4,5} {1} {2} {3} {4} {5}

  11. MFCS:A New Data Structure Maintained • For bottom-up search: Candidate set (as usual) • For top-down search: Use a new dynamically maintained data structure: maximum frequent candidate set (MFCS) • MFCS is a set of itemsets: • Union of its subsets contains all known frequent itemsets • Union of its subsets does not contain any currently known infrequent itemsets • It is of minimum cardinality • MFCS supports efficient coordination between bottom-up and top-down searches

  12. Pincer-Search: Search Path {1,2,3,4,5} By {2,5} {1,2,3,4} {1,3,4,5} By {3,5} {1,3,4} {1,4,5} By {4,5} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,5} {2,5} {3,5} {4,5} {1} {2} {3} {4} {5}

  13. Pincer-Search Algorithm 01. L0 := ; k := 1; C1 := {{ i } | i } 02. MFCS := {{1,2, ...,n}}; MFS :=  03. while Ck  04. read database and count supports for Ck and MFCS 05. MFS := MFS  { frequent itemsets in MFCS } 06. determine frequent set Lk and and infrequent set Sk 07. use Sk to update MFCS 08. generate new candidate set Ck+1 (join, recover, and prune) 09. k := k +1 10. return MFS

  14. Performance:Observations and Experiments • Non-monotone property of the maximum frequent set • Both the number of candidates and the number of of frequent itemsets increase as the suppmin decreases • NOT true for the number of maximal frequent itemsets • If MFS is {{1,2},{2,3},{3,4}} when suppmin is 9% • If suppmin decreases to 6% then MFS could become {{1,2,3}} • This property will NOT help bottom-up search algorithms • However, this property may help the Pincer-Search algorithm • Concentrated and scattered distributions • Concentrated: on each level, the frequent itemsets have many common items; the frequent items tend to cluster (Narrow and tall) • Scattered: the frequent itemsets do not have many common items (Wide and flat)

  15. Scattered Distributions

  16. Scattered Distributions

  17. Concentrated Distributions

  18. Concentrated Distributions

  19. Census Data

  20. Conclusions • Pincer-Search is good for concentrated distributions • In general, can use Adaptive Pincer-Search • More experiments on real-life databases needed

More Related