1 / 22

Finding Maximal Frequent Itemsets over Online Data Streams Adaptively

Finding Maximal Frequent Itemsets over Online Data Streams Adaptively. Daesu Lee,Wonsuk Lee IEEE,ICDM ’ 05 報告者:林靜怡 2006/05/05. Introduction. Confine the memory usage of a data mining process estDec - prefix tree estDec+ - CP-tree. CP-tree. Compressed-prefix tree

mariojones
Télécharger la présentation

Finding Maximal Frequent Itemsets over Online Data Streams Adaptively

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Maximal Frequent Itemsets over Online Data Streams Adaptively Daesu Lee,Wonsuk Lee IEEE,ICDM’05 報告者:林靜怡 2006/05/05

  2. Introduction • Confine the memory usage of a data mining process • estDec - prefix tree • estDec+ - CP-tree

  3. CP-tree • Compressed-prefix tree • Effectively used in finding frequent or maximal frequent itemsets • a node of a CP-tree can maintain the information of several itemsets together • size of a CP-tree can be flexibly controlled by merging or splitting nodes

  4. CP-tree(Conti.) • Two consecutive nodes by a prefix tree are merged in a CP-tree when the current support difference between their corresponding itemsets is less than or equal to a merging gap threshold δ (0,1)

  5. Definition • :a prefix tree • S :a subtree of • :the itemset represented by the root of S • :an itemset represented by a node of S • δ:merging gap threshold • |S|:number of nodes in S • :the total number of transactions in the current data stream • subtree S is a mergeable subtree and compressed into a node of a CP-tree that is equivalent to

  6. CP-node structure • A node m of CP-tree maintains four entries m(τ, π, , ) • Item-listτ: - m.τ[1]:root node of S,the shortest itemset of the node m and denoted by - m.τ[|S|]:the right-most leaf node in the lowest level of S,the longest itemset of the node m and denoted by

  7. CP-node structure • Parent-index list π - y’s parent is x - • Largest count - the current count of the shortest itemset • Smallest count - the current count of the longest itemset

  8. CP-tree |10-9|/10 = 0.1 0.1 <= 0.2 |10-5|/10 = 0.5 0.5 > 0.2

  9. Merged-count Estimation • Given the item-list of a node m in a CP-tree • :the itemset represented by (1<=j<=n) • the current counts of the remaining itemsets can be estimated by a formula • f(m, j):a count estimation function that can model the count in terms of the counts and of the shortest and longest itemsets

  10. CP-tree Maintenance • :the parent node of m • Node-merge - - a new significant itemset e is identified by the inserting-count estimation process, so that a new node for the itemset needs to be inserted as a child of the node m. • Node-split -

  11. Finding Maximal Frequent Itemsets • estDec+ Method • Adaptive Memory Utilization

  12. estDec+ Method • Parameter updating phase • Count updating & node restructuring phase • Itemset insertion phase • Maximal frequent itemset selection phase

  13. Parameter updating phase The total number of transactions in the current data stream is updated. • Count updating & node restructuring phase :m’s parent prune - split - merge -

  14. Itemset insertion phase • insert any new significant itemset which has not been maintained in • any insignificant item whose current support is less than is filtered out in the transaction => • Traversed to find out any new significant itemset induced by the items in

  15. Maximal frequent itemset selection phase • retrieves all the currently frequent or maximal frequent itemsets by traversing the monitoring tree • Force-pruning - all the nodes whose largest counts are less than - performed periodically

  16. Adaptive Memory Utilization • In order to minimize the estimation error caused by the merged-count estimation, it is very important to keep the value ofδas small as possible. • The size of a CP-tree is inversely proportional to the value of δ. • the value of should be dynamically changed in the parameter update phase

  17. Adaptive Memory Utilization • :the upper bound of desired memory usage • :the lower bound of desired memory usage • :Confined memory space • :current memory usage

  18. Experiment • Data sets:T10.I4.D1000K and WebLog • 1.8 GHz Pentium PC • 512 MB Memory • Linux 7.3 • = 0.001 • Count estimation function f(m,j)

  19. Experiment

  20. Experiment

  21. Experiment

More Related