180 likes | 199 Vues
This paper proposes a method for finding recent frequent itemsets in online data streams using a prefix-tree lattice structure called monitoring lattice. The method decays old occurrence counts of itemsets over time and minimizes the number of significant itemsets through delayed-insertion and pruning operations.
 
                
                E N D
Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge Discovery and Data Ming, 2003. Adviser: Jia-Ling Koh Speaker: Shu-Ning Shin Date: 2004.8.12
Introduction • This paper proposes a method of finding recent frequent itemsets: • Significant itemsets are maintained by a prefix-tree lattice structure called monitoring lattice. • Decaying the old occurrence count of each itemset as time goes by. • Minimize the number of significant itemsets: • delayed-insertion • pruning operations
Preliminaries (1) • Data Stream can be defined: • I={i1, i2, …, in}:a set of current items. • e:itemset, a set of item. • Tid:transaction id, Tk generate at the kth turn. • Dk=<T1, T2, …, Tk>, When new transaction Dk is generated. • |D|k:the number of transactions in Dk. • Ck(e):the number of transactions in Dk that contain the itemset e. • Sk(e):Support of itemset e in Dk.
Preliminaries (2) • Decay rate:the reducing rate of a weight for a fixed decay-unit. • d=b-(1/h), (b>1, h≧1, b-1≦d<1) • decay-unit:the chunk of information to be decayed together. • decay-base b:the amount of weight reduction per a decay-unit and greater than 1. • decay-base-life h:defined by the number of decay-units that makes the current weight be b-1.
Preliminaries (3) • The total number of transactions |D|kin the current data stream Dk: • The value of |D|kconverges to 1/(1-d) as the value k increases infinitely. • The count Ck(e) of an itemset e in the current data stream Dk:
Count Estimation of an itemset (1) • The maximum possible count of an itemset is estimated by the minimum value among the maximum possible counts of all of its subsets.
Count Estimation of an itemset (2) • Definition 1: • :a set of itemset e’s subsets • :a set of e’s m-subsets • : a set of counts for e’s m-subsets • Definition 2: • Union-itemset is composed of all items that are members of either e1or e2. • Intersection-itemset is composed of all items that are members of both e1and e2.
Count Estimation of an itemset (3) • exclusively distributed (LED):the items of an itemset appear together in as many transactions as possible. • most exclusively distributed (MED):the items of an itemset appear exclusively as many transactions as possible. • The maximum count of n-itemset e:
Count Estimation of an itemset (4) • Two itemsets e1, e2: • The minimum count of Cmin(e) can be estimated by (n-1)-subset union: • Estimation error: • E(e)=Cmax(e)-Cmin(e)
estDec Method (1) • Every node in a monitoring lattice maintains a triple (cnt, err, MRtid) for its corresponding itemset e: • cnt:count of e. • err:maximum error count of e • Mrtid: the most recent transacrion id that contain e
estDec Method (2) • estDec Method is composed of four phase: • Phase Ⅰ:parameter updating phase • Phase Ⅱ:count updating phase • Phase Ⅲ:Delayed insertion phase • Phase Ⅳ:frequent itemset selection phase
estDec Method (3) • Phase II:the counts of those itemsets in ML that appear in Tk are updated. • Sprn:threshold for pruning. • If a 1-itemset is pruned from ML, it is impossible to estimate its count later. Phase I:|D|kis updated.
estDec Method (4) • Phase III: Find new itemset that has high possibility to become frequent. Two cases insert new itemset to a ML: • new 1-itemset, the cnt of 1-itemset is actual. • Itemset e Cmax(e)/|D|k ≧ Sins, Sins:threshold for delayed-insertion. • cntt_for_subsets=(1-d|e|-1)/(1-d) • max_xnt_before_subsets=Sins*(|D|k-(|e|-1))*d|e|-1) • Cupper(e)=Max_xnt_before_subsets+ Cntt_for_subsets
estDec Method (5) • Phase IV:produces all current frequent itemsets in ML. • itemset e is frequent if its current support (cnt * d(k-MRtid))/|D|k is greater than Smin • its current support error: • (err*d(k-MRtid))/|D|k
estDec Method (6) • Force-pruning operation: • all insignificant itemsets in ML can be pruned • perform when the current size of ML reaches a threshold.
Experimental (1) • Performance of the estDec method for the data set T10.I4.D1000K • Sins is denoted p%, the actual value=Smin*p%. • Force-pruning operation perform in every 1,000 transactions. • (a) memory usage (b) performance time of Phases I~III (c) performance time of Phases IV
Experimental (2) • Accuracy of mining result • Average support error • ASE(RestDec|RdApriori)
Experimental (3) • The adaptability of the estDec method for the change of information in a data stream. • Coverage rate CR(X) • |R|:total nmber of frequent itemdets in ML