2000 년 6 월 23 일 DE Lab. 윤지영

Mining Frequent Patterns without Candidate Generation Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'00), Dallas, TX, May 2000.Jiawei Han, Jian Pei, and Yiwen Yin 2000년 6월 23일 DE Lab. 윤지영

1. Introduction • The core of the Apriori algorithm : If any length k pattern is not frequent in the database, its length(K+1) super-pattern can never be freguent -> Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets • Huge candidate sets: - 104 frequent 1-itemset will generate 107 candidate 2-itemsets - To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100  1030 candidates. • Multiple scans of database: - Needs (n +1 ) scans, n is the length of the longest pattern

Our Approach : Mining Frequent PatternsWithout Candidate Generation • Construct frequent pattern tree(FP-tree)- extended prefix-tree structure • Develop FP-tree-based pattern fragment growth mining method- start from a frequent length-1 pattern(as an initial suffix pattern)- constructs its conditional FP-tree- the pattern growth is achieved via concatenation of the suffix pattern from a conditional FP-tree • Using Partitioning-based, divide-and-conquer method

{} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 2. Frequent Pattern Tree(Design and Construction) • <Example 1> Table 1 (ξ = 3) , Fig 1 TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} • < Algorithm 1 > • Scan DB once, find frequent 1-itemset (single item pattern) and Sort frequent items in support descending order • Scan DB again, Construnt FP-tree

Definition of FP-tree • Consist of - One root(null)- the children node(a set of item prefix subtrees)- a frequent-item header table • Each node of children consists of three field : item-name, count, node-link • Each entry in the frequent-item header table consist of two field : item-name, head of node link, which points to the first node

2.2 Completeness and Compactness of FP-tree • < Lemma 2.1 > (Completeness) • never breaks a long pattern of any transaction • preserves complete information for frequent pattern mining • < Lemma 2.2> (Compactness) • reduce irrelevant information—infrequent items are gone • frequency descending ordering: more frequent items are more likely to be shared • never be larger than the original database (if not count node-links and counts)

3. Mining Frequent Patterns using FP-tree • General idea (divide-and-conquer) • Recursively grow frequent pattern path using the FP-tree • Method • For each item, construct its conditional pattern-base, and then its conditional FP-tree • Repeat the process on each newly created conditional FP-tree • Until the resulting FP-tree is empty, or it containsonly one path(single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)

conditional pattern base & conditional FP-tree • Construct conditional pattern base for each node in the FP-treeex) For node m, m’s conditional pattern base=> {(f:4,c:3,a:3,m:2), (f:4,c:3,a:3,b:1,m:1)} • Construct conditional FP-tree from each conditional pattern-baseex) m’s conditional FP-tree => {(f:3,c:3,a:3)} ｜m

Property <Property 3.1> (Node-link property) For any frequent item ai,all the possible frequent patterns that contain ai can be obtained by following ai's node-links, starting from ai's head in the FP-tree header <Property 3.2> (Prefix-path property) To calculate the frequent patterns for a node ai in a path P, only the prefix sub-path of ai in P need to be accumulated, and its frequency count should carry the same count as node ai.

{} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Step 1: From FP-tree to Conditional Pattern Base • Starting at the frequent header table in the FP-tree • Traverse the FP-tree by following the link of each frequent item • Accumulate all of transformed prefix paths of that item to form a conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1

{} f:3 c:3 a:3 m-conditional FP-tree Step 2: Construct Conditional FP-tree • For each pattern-base • Accumulate the count for each item in the base • Construct the FP-tree for the frequent items of the pattern base • conditional pattern base of “m”: • fca:2, fcab:1 {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam c:3 b:1 b:1   a:3 p:1 m:2 b:1 p:2 m:1

Table 2

Lemma 3.1 <Lemma 3.1> (Fragment growth) Let  be a frequent itemset in DB, B be 's conditional pattern base, and  be an itemset in B. Then the support of    in DB is equivalent to the support of  in B < Corollary 3.1 > (Pattern growth) Let  be a frequent itemset in DB, B be 's conditional pattern base, and  be an itemset in B. Then    is a frequent itemset in DB if  is frequent in B.

{} f:3 c:3 am-conditional FP-tree {} f:3 c:3 a:3 m-conditional FP-tree Step 3: recursively mine the conditional FP-tree Cond. pattern base of “am”: (fc:3) Header Table Item head of node-links f c a {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree {} Cond. pattern base of “cam”: (f:3) f:3 cam-conditional FP-tree

Lemma 3.2 <Lemma 3.2> (Single FP-tree path pattern generation) Suppose an FP-tree T has a single path P. The complete set of the frequent patterns of T can be generated by the enumeration of all the combinations of the subpaths of P with the support being the minimum support of the items contained in the subpath.

Step 4: Single FP-tree Path Generation • Suppose an FP-tree T has a single path P • The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P {} All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam f:3  c:3 a:3 m-conditional FP-tree

Algorithm 2 (FP-growth) Procedure FP-growth(Tree,  ) { (1) If Tree contains a single path P (2) Then for each combination (denoted as ) of the nodes in the path p do (3) Generate pattern    with support = minimum support of nodes in ; (4) Else for each ai in the header of Tree do { (5) Generate pattern =ai   with Support = ai.support; (6) Construct ’s conditional pattern base and then ’s conditional FP-tree tree ; (7) If tree  ≠ 0 (8) Then call FP-growth(tree , ) } }

FP-growth vs. Apriori: Scalability With the Support Threshold Data set (D1) T25I20D10K / Data set (D2) T25I20D100K

FP-growth vs. Apriori: Scalability With Number of Transactions Data set (D2) T25I20D100K (1.5%)

Run time of FP-growth per itemset vs. support threshold Data set (D1) T25I20D10K / Data set (D2) T25I20D100K

2000 년 6 월 23 일 DE Lab. 윤지영