Finding Maximal Frequent Itemsets over Online Data Streams Adaptively

Finding Maximal Frequent Itemsets over Online Data Streams Adaptively Daesu Lee,Wonsuk Lee IEEE,ICDM’05 報告者：林靜怡 2006/05/05

Introduction • Confine the memory usage of a data mining process • estDec - prefix tree • estDec+ - CP-tree

CP-tree • Compressed-prefix tree • Effectively used in finding frequent or maximal frequent itemsets • a node of a CP-tree can maintain the information of several itemsets together • size of a CP-tree can be flexibly controlled by merging or splitting nodes

CP-tree(Conti.) • Two consecutive nodes by a prefix tree are merged in a CP-tree when the current support difference between their corresponding itemsets is less than or equal to a merging gap threshold δ (0,1)

Definition • ：a prefix tree • S ：a subtree of • ：the itemset represented by the root of S • ：an itemset represented by a node of S • δ：merging gap threshold • |S|：number of nodes in S • ：the total number of transactions in the current data stream • subtree S is a mergeable subtree and compressed into a node of a CP-tree that is equivalent to

CP-node structure • A node m of CP-tree maintains four entries m(τ, π, , ) • Item-listτ： - m.τ[1]：root node of S，the shortest itemset of the node m and denoted by - m.τ[|S|]：the right-most leaf node in the lowest level of S，the longest itemset of the node m and denoted by

CP-node structure • Parent-index list π - y’s parent is x - • Largest count - the current count of the shortest itemset • Smallest count - the current count of the longest itemset

CP-tree |10-9|/10 = 0.1 0.1 <= 0.2 |10-5|/10 = 0.5 0.5 > 0.2

Merged-count Estimation • Given the item-list of a node m in a CP-tree • ：the itemset represented by (1<=j<=n) • the current counts of the remaining itemsets can be estimated by a formula • f(m, j)：a count estimation function that can model the count in terms of the counts and of the shortest and longest itemsets

CP-tree Maintenance • ：the parent node of m • Node-merge - - a new significant itemset e is identified by the inserting-count estimation process, so that a new node for the itemset needs to be inserted as a child of the node m. • Node-split -

Finding Maximal Frequent Itemsets • estDec+ Method • Adaptive Memory Utilization

estDec+ Method • Parameter updating phase • Count updating & node restructuring phase • Itemset insertion phase • Maximal frequent itemset selection phase

Parameter updating phase The total number of transactions in the current data stream is updated. • Count updating & node restructuring phase ：m’s parent prune - split - merge -

Itemset insertion phase • insert any new significant itemset which has not been maintained in • any insignificant item whose current support is less than is filtered out in the transaction => • Traversed to find out any new significant itemset induced by the items in

Maximal frequent itemset selection phase • retrieves all the currently frequent or maximal frequent itemsets by traversing the monitoring tree • Force-pruning - all the nodes whose largest counts are less than - performed periodically

Adaptive Memory Utilization • In order to minimize the estimation error caused by the merged-count estimation, it is very important to keep the value ofδas small as possible. • The size of a CP-tree is inversely proportional to the value of δ. • the value of should be dynamically changed in the parameter update phase

Adaptive Memory Utilization • ：the upper bound of desired memory usage • ：the lower bound of desired memory usage • ：Confined memory space • ：current memory usage

Experiment • Data sets：T10.I4.D1000K and WebLog • 1.8 GHz Pentium PC • 512 MB Memory • Linux 7.3 • = 0.001 • Count estimation function f(m,j)

Experiment

Finding Maximal Frequent Itemsets over Online Data Streams Adaptively

Finding Maximal Frequent Itemsets over Online Data Streams Adaptively

Presentation Transcript

Parallel Mining of Maximal Frequent Itemsets form Databases

Algorithms for Mining Maximal Frequent Itemsets -- A Survey

Finding Frequent Items in Data Streams

LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets

Mining Frequent Itemsets over Uncertain Databases

Frequent Pattern Mining in Data Streams

The Concept of Maximal Frequent Itemsets

Finding Frequent Items in Distributed Data Streams

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams

Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton]

A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Streams

Mining Frequent Itemsets over Uncertain Databases

Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data

Finding the Frequent Items in Streams of Data

Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data

Finding Frequent Items in Data Streams