1.47k likes | 2.02k Vues
Data Mining: Concepts and Techniques — Chapter 5 — Mining Frequent Patterns. Slide credits: Jiawei Han and Micheline Kamber George Kollios. Chapter 5: Mining Frequent Patterns, Association and Correlations. Basic concepts Efficient and scalable frequent itemset mining methods
E N D
Data Mining:Concepts and Techniques— Chapter 5 —Mining Frequent Patterns Slide credits: Jiawei Han and Micheline Kamber George Kollios Data Mining: Concepts and Techniques
Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysis Constraint-based association mining Summary Data Mining: Concepts and Techniques
What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set Frequent sequential pattern Frequent structured pattern Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. Data Mining: Concepts and Techniques
Why Is Freq. Pattern Mining Important? Discloses an intrinsic and important property of data sets Forms the foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: associative classification Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications Data Mining: Concepts and Techniques
Frequent Itemset Mining Frequent itemset mining: frequent set of items in a transaction data set First proposed by Agrawal, Imielinski, and Swami in SIGMOD 1993 SIGMOD Test of Time Award 2003 “This paper started a field of research. In addition to containing an innovative algorithm, its subject matter brought data mining to the attention of the database community … even led several years ago to an IBM commercial, featuring supermodels, that touted the importance of work such as that contained in this paper. ” Apriori algorithm in VLDB 1994 #4 in the top 10 data mining algorithms in ICDM 2006 R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD ’93. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94.
Basic Concepts: Frequent Patterns and Association Rules Itemset: X = {x1, …, xk} (k-itemset) Frequent itemset: X with minimum support count Support count (absolute support): count of transactions containing X Association rule: A B with minimum support and confidence Support: probability that a transaction contains A B s = P(A B) Confidence: conditional probability that a transaction having A also contains B c = P(A | B) Association rule mining process Find all frequent patterns (more costly) Generate strong association rules Customer buys both Customer buys diaper Customer buys beer Data Mining: Concepts and Techniques
Illustration of Frequent Itemsets and Association Rules • Frequent itemsets (minimum support count = 3) ? • Association rules (minimum support = 50%, minimum confidence = 50%) ? • {A:3, B:3, D:4, E:3, AD:3} A D (60%, 100%) D A (60%, 75%) Data Mining: Concepts and Techniques
Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysis Constraint-based association mining Summary Data Mining: Concepts and Techniques
Scalable Methods for Mining Frequent Patterns Scalable mining methods for frequent patterns Apriori (Agrawal & Srikant@VLDB’94) and variations Frequent pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) Algorithms using vertical format Closed and maximal patterns and their mining methods FIMI Workshop and implementation repository Data Mining: Concepts and Techniques
Apriori – Apriori Property The apriori property of frequent patterns Any nonempty subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Data Mining: Concepts and Techniques
Apriori: Level-Wise Search Method • Level-wise search method: • Initially, scan DB once to get frequent 1-itemset • Generate length (k+1) candidate itemsets from length k frequent itemsets • Test the candidates against DB • Terminate when no frequent or candidate set can be generated Data Mining: Concepts and Techniques
The Apriori Algorithm Pseudo-code: Ck: Candidate k-itemset Lk : frequent k-itemset L1 = frequent 1-itemsets; for (k = 2; Lk-1 !=; k++) Ck = generate candidate set from Lk-1; for each transaction t in database find all candidates in Ck that are subset of t; increment their count; Lk+1 = candidates in Ck+1 with min_support returnkLk; Data Mining: Concepts and Techniques
The Apriori Algorithm—An Example Supmin = 2 Transaction DB L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan Data Mining: Concepts and Techniques
Important Details of Apriori How to generate candidate sets? How to count supports for candidate sets? Data Mining: Concepts and Techniques
Candidate Set Generation Step 1: self-joining Lk-1 Assuming items and itemsets are sorted in order, joinable only if the first k-2 items are in common Step 2: pruning Prune if it has infrequent subset Example: Generate C4 from L3={abc, abd, acd, ace, bcd} Step 1: Self-joining: L3*L3 abcd from abc and abd acde from acd and ace abce? Step 2: Pruning: acde is removed because ade is not in L3 C4={abcd} Data Mining: Concepts and Techniques
How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Build a hash-tree for candidate itemsets Leaf node contains a list of itemsets and counts Interior node contains a hash function determining which branch to follow Subset function: for each transaction, find all the candidates contained in the transaction using the hash tree Data Mining: Concepts and Techniques
Example: Counting Supports of Candidates hash function 3,6,9 1,4,7 2,5,8 2 3 4 5 6 7 3 6 7 3 6 8 1 4 5 3 5 6 3 5 7 6 8 9 3 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 Transaction: 2 3 5 6 7 3 2 5 6 5 Data Mining: Concepts and Techniques
Improving Efficiency of Apriori Bottlenecks Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Reduce number of transactions Shrink number of candidates Facilitate support counting of candidates Data Mining: Concepts and Techniques
Partitioning: Reduce Number of Scans Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database in n partitions and find local frequent patterns (minimum support count?) Scan 2: determine global frequent patterns from the collection of all local frequent patterns A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95 Data Mining: Concepts and Techniques
DIC: Reduce Number of Scans DIC (Dynamic itemset counting): add new candidate itemsets at partition points Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins ABCD ABC ABD ACD BCD AB AC BC AD BD CD Transactions 1-itemsets B C D A 2-itemsets Apriori … {} Itemset lattice 1-itemsets S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97 2-items DIC 3-items Data Mining: Concepts and Techniques
DHP: Reduce the Number of Candidates DHP (Direct hash and pruning): hash k-itemsets into buckets and a k-itemset whose bucket count is below the threshold cannot be frequent Especially useful for 2-itemsets Generate a hash table of 2-itemsets during the scan for 1-itemset If minimum support count is 3, the itemsets in bucket 0,1,3,4 should not be included in candidate 2-itemsets J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95 Data Mining: Concepts and Techniques
Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within samples using Apriori Scan database once to verify frequent itemsets found in sample, only closure of frequent patterns are checked Example: check abcd instead of ab, ac, …, etc. Use a lower support threshold than minimum support Tradeoff accuracy against efficiency H. Toivonen. Sampling large databases for association rules. In VLDB’96 Data Mining: Concepts and Techniques
Assignment 2 Implementation and evaluation of Apriori Performance competition with prizes! Data Mining: Concepts and Techniques
Scalable Methods for Mining Frequent Patterns Scalable mining methods for frequent patterns Apriori (Agrawal & Srikant@VLDB’94) and variations Frequent pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) Algorithms using vertical format Closed and maximal patterns and their mining methods FIMI Workshop and implementation repository August 6, 2014 Data Mining: Concepts and Techniques 24
Mining Frequent Patterns WithoutCandidate Generation • Basic idea: grow long patterns from short ones using local frequent items • “abc” is a frequent pattern • Get all transactions having “abc”: DB|abc • “d” is a local frequent item in DB|abc abcd is a frequent pattern • FP-Growth • Construct FP-tree • Divide compressed database into a set of conditional databases and mines them separately Data Mining: Concepts and Techniques
Construct FP-tree from a Transaction Database {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 3 • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order, f-list • Scan DB again, construct FP-tree F-list=f-c-a-b-m-p Data Mining: Concepts and Techniques
Prefix Tree (Trie) • Prefix tree • Keys are usually strings • All descendants of one node have a common prefix • Advantages • Fast looking up (O?) • Less space with a large number of short strings • Help with longest-prefix matching • Applications • Storing dictionary • Approximate matching algorithms, including spell checking Data Mining: Concepts and Techniques
Benefits of the FP-tree Structure Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field) For Connect-4 DB, compression ratio could be over 100 Data Mining: Concepts and Techniques
Mining Frequent Patterns With FP-trees Idea: Frequent pattern growth Recursively grow frequent patterns by pattern and database partition Method For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern Data Mining: Concepts and Techniques
Partition Patterns and Databases Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p Patterns containing p Patterns having m but no p … Patterns having c but no a nor b, m, p Pattern f Completeness and non-redundency Data Mining: Concepts and Techniques
Find Patterns Having P From P-conditional Database Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 Data Mining: Concepts and Techniques
From Conditional Pattern-bases to Conditional FP-trees Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base Repeat the process on each newly created conditional FP-tree until the resulting FP-tree is empty, or only one path {} c:3 p-conditional pattern base: fcam:2, cb:1 p-conditional FP-tree (min-support =3) {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 All frequent patterns containing p p, cp c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Data Mining: Concepts and Techniques
Finding Patterns Having m {} f:3 c:3 a:3 m-conditional pattern base: fca:2, fcab:1 m-conditional FP-tree (min-support =3) {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Construct m-conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree until the resulting FP-tree is empty, or only one path Data Mining: Concepts and Techniques
Recursion: Mining Each Conditional FP-tree {} f:3 c:3 am-conditional FP-tree {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree Cond. pattern base of “fm”: (f:3) {} f-conditional FP-tree Data Mining: Concepts and Techniques
FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K Data Mining: Concepts and Techniques
Why Is FP-Growth the Winner? Divide-and-conquer: Decompose both mining task and DB and leads to focused search of smaller databases Use least frequent items as suffix (offering good selectivity) and find shorter patterns recursively and concatenate with suffix Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and building sub FP-tree, no pattern search and matching Data Mining: Concepts and Techniques
Scalable Methods for Mining Frequent Patterns Scalable mining methods for frequent patterns Apriori (Agrawal & Srikant@VLDB’94) and variations Frequent pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) Algorithms using vertical format (ECLAT) Closed and maximal patterns and their mining methods FIMI Workshop and implementation repository August 6, 2014 Data Mining: Concepts and Techniques 37
ECLAT M. J. Zaki. Scalable algorithms for association mining. IEEE TKDE, 12, 2000. For each item, store a list of transaction ids (tids) A B C D E TID Items 1 1 2 2 1 1 A,B,E 4 2 3 4 3 2 B,C,D 5 5 4 5 6 3 C,E 6 7 8 9 4 A,C,D 7 8 9 5 A,B,C,D 8 10 6 A,E 9 7 A,B 8 A,B,C TID-list 9 A,C,D Horizontal Data Layout Vertical Data Layout 10 B
ECLAT Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets. 3 traversal approaches: top-down, bottom-up and hybrid Advantage: very fast support counting Disadvantage: intermediate tid-lists may become too large for memory
Scalable Methods for Mining Frequent Patterns Scalable mining methods for frequent patterns Apriori (Agrawal & Srikant@VLDB’94) and variations Frequent pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) Algorithms using vertical data format (ECLAT) Closed and maximal patterns and their mining methods Concepts Max-patterns: MaxMiner, MAFIA Closed patterns: CLOSET, CLOSET+, CARPENTER FIMI Workshop Data Mining: Concepts and Techniques
Closed Patterns and Max-Patterns A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains ____ sub-patterns! Solution: Mine “boundary” patterns An itemset Xis closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD’98) Closed pattern is a lossless compression of freq. patterns and support counts Reducing the # of patterns and rules Data Mining: Concepts and Techniques
Max-patterns Frequent patterns without frequent super patterns BCDE, ACD are max-patterns BCD is not a max-pattern Min_sup=2
Max-Patterns Illustration An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Infrequent Itemsets Border
Closed Patterns An itemset is closed if none of its immediate supersets has the same support as the itemset
Exercise: Closed Patterns and Max-Patterns DB = {<a1, …, a100>, < a1, …, a50>} Min_sup = 1. What is the set of closed itemset? What is the set of max-pattern? What is the set of all patterns? !! • <a1, …, a100>: 1 • < a1, …, a50>: 2 • <a1, …, a100>: 1 Data Mining: Concepts and Techniques
Scalable Methods for Mining Frequent Patterns Scalable mining methods for frequent patterns Apriori (Agrawal & Srikant@VLDB’94) and variations Frequent pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) Algorithms using vertical data format (ECLAT) Closed and maximal patterns and their mining methods Concepts Max-pattern mining: MaxMiner, MAFIA Closed pattern mining: CLOSET, CLOSET+, CARPENTER FIMI Workshop August 6, 2014 Data Mining: Concepts and Techniques 47
MaxMiner: Mining Max-patterns R. Bayardo. Efficiently mining long patterns from databases. In SIGMOD’98 Idea: generate the complete set-enumeration tree one level at a time, while prune if applicable. Bayardo et al. Data Privacy through Optimal k-anonymization. ICDE 2005 A (BCD) C (D) D () B (CD) AB (CD) AC (D) AD () BC (D) BD () CD () ABC (C) ABD () ACD () BCD () ABCD () (ABCD)
Algorithm MaxMiner Initially, generate one node N= , where h(N)= and t(N)={A,B,C,D}. Recursively expanding N Local pruning If h(N)t(N) is frequent, do not expand N. If for some it(N), h(N){i} is NOT frequent, remove i from t(N) before expanding N. Global pruning (ABCD)
Local Pruning Techniques (e.g. at node A) Check the frequency of ABCD and AB, AC, AD. If ABCD is frequent, prune the whole sub-tree. If AC is NOT frequent, remove C from the parenthesis before expanding. A (BCD) C (D) D () B (CD) AB (CD) AC (D) AD () BC (D) BD () CD () ABC (C) ABD () ACD () BCD () ABCD () (ABCD)