Mining Association Rules

Mining Association Rules Charis Ermopoulos Qian Yang Yong Yang Hengzhi Zhong

Outline • Basic concepts and road map • Scalable frequent pattern mining methods • Association rules generation • Research Problems

Frequent Pattern Analysis • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • Frequent pattern mining • Finding inherent regularities in data • Foundations of many data mining tasks (association, correlation, classification etc.) • Applications: Basket data analysis, cross marketing, catalog design, web log analysis.

What is association rule mining ? • Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) • Find: all rules that correlate the presence of one set of items with that of another set of items • E.g., 98% of people who purchase tires and auto accessories also get automotive services done • Itemset X = {x1, …, xk} • Find all the rules X  Ywith minimum support and confidence

What is association rule mining ? • Support = The rule X => Y has supports in the transaction set D if s% of transaction in D contain X union Y • Confidence = The rule X => Y has confidence c if c% of transactions in D that contain X also contain Y • Association rule mining • Find all sets of items that meet the minimum support - Frequent pattern mining • Generate association rules from these sets

Generate Frequent Itemsets • 3 major approaches • Apriori (Agrawal & Srikant@VLDB’94) • Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) • Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)

Apriori • Initially, scan DB once to get frequent 1-itemset • Generate length (k+1) candidate itemsets from length k frequent itemsets • Testthe candidates against DB • Terminate when no frequent or candidate set can be generated

Supmin = 2 L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan

Apriori candidate generation • How to generate candidates? • Step 1: self-joining Lk • Step 2: pruning • Example of Candidate-generation • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4={abcd}

Drawbacks • Multiple scans of transaction database • Multiple database scans are costly • Huge number of candidates • To find frequent itemset i1i2…i100 • # of scans: 100 • # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 !

FPGrowth • Uses the Apriori Pruning Principle • Scan DB only twice! • Once to find frequent 1-itemset (single item pattern) • Once to construct FP-tree, the data structure of FPGrowth

FPGrowth TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o, w} 400 {b, c, k, s, p} 500{a, f, c, e, l, p, m, n} Header Table Item frequency f 4 c 4 a 3 b 3 m 3 p 3 TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p} {} f:1 c:1 a:1 m:1 p:1

FPGrowth TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o, w} 400 {b, c, k, s, p} 500{a, f, c, e, l, p, m, n} Header Table Item frequency f 4 c 4 a 3 b 3 m 3 p 3 TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p} {} f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1

{} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 FPGrowth TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p}

{} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 FPGrowth Conditional pattern bases Item cond. pattern base freq. itemset p fcam:2, cb:1 fp, cp, ap, mp, pfc, pfa, pfm, pfm, pca, pcm, pam, pfcam m fca:2, fcab:1 fm, cm, am, fcm, fam, cam, fcam b fca:1, f:1, c:1 … a fc:3 … c f:3 …

FPGrowth vs Apriori • no candidate generation, no candidate test • compressed database: FP-tree structure • no repeated scan of entire database • basic ops—counting local freq items and building sub FP-tree, no pattern search and matching

FPGrowth vs Apriori Data set T25I20D10K

FPGrowth vs Apriori • Dense dataset (http://www.cs.yorku.ca/course_archive/2005-06/F/6412/lecnotes/assorule3-2.pdf)

Scaling • DB projection • FP-tree cannot fit in memory? • Partition a database into a set of projected DBs • Construct and mine FP-tree for each projected DB

Tran. DB fcamp fcabm fb cbp fcamp p-proj DB fcam cb fcam m-proj DB fcab fca fca b-proj DB f cb … a-proj DB fc … c-proj DB f … f-proj DB … am-proj DB fc fc fc cm-proj DB f f f … DB Projection

Charm: Closed Itemset Mining • A frequent itemset with size s has 2s-2 frequent subsets • S is large in many real world problems • biosequences, census data, etc • Only generate itemsets that cannot be subsumed others with the same support • If A B, and sup(A) = sup (B), A will not be in the result • But sup(A) can be inferred from B and others

Closed Itemset • An itemset X is closed if X is frequent and X does not have a superset Y such that supt(Y) = support(X) • Lossless compression • Divide frequent itemsets into equivalence classes

Charm: Search in IT-Tree • Each node has a Itemset and its Tidset

Charm properties to prune • Itemset Xi, Xj, and theri Tidset t(Xi), t(Xj) • If t(Xi) = t(Xj), sup(Xi) = sup(Xj) = sup(Xi Xj) • Xi, Xj always occurs together • If t(Xi) t(Xj), sup(Xj) != sup(Xi) = sup(Xi Xj) • If Xi occurs, Xj also occurs • If t(Xj) t(Xi), sup(Xi) != sup(Xj) = sup(Xi Xj) • If Xj occurs, Xi also occurs • If t(Xj)!= t(Xi), sup(Xi) != sup(Xj) != sup(Xi Xj)

T x 1356 A x 1345 D x 2456 C x 123456 W x 12345 TA x 135 TW x 135 DT x 56 DW x 245 DA x 45 Minimum support = 3 WC x 12345 AW X 1345 AWC x 1345 TC x 1356 DC x 2456 TAC x 135 TWC x 135 DWC X 245 TAWC x 135

Tids Itemsets Charm: diffset for fast couting • Maintain a disk-based tidset for each item • Vertical • Easy to compute support: cardinality • Intersection is expensive when tidset is large • Diffset • Track the differences in tids of a child node from its parent • Save memory when tidsets are large and differences are little X d(XY) Y

Association Rules AB …A implies B The easiest way to mine for association rules, is to first mine for frequent itemsets.

Mining Association Rules • Quantity Problem • Many frequent itemsets, many rules • Redundant rules • Quality Problem • Not all rules are “interesting”

Association Rules • Frequent itemset : AB • Derived rules : AB and BA Support (AB) = P(A,B)=|AB|/N Confidence (AB) = P(B|A)=P(B,A)/P(A)=|AB|/|A| |AB|: count of (AB) |A|: count of (A) |B|: count of (B) N: total number of records

Other measures of Interestingness • play basketball eat cereal [40%, 66.7%] is misleading • The overall % of students eating cereal is 75% > 66.7%. • play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence • Measure of dependent/correlated events: lift

Various Kinds of Association Rules • multi-level association Data categorized in a hierarchy • multi-dimensional association age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”) • quantitative association Rules with numerical attributes • “interesting” correlation patterns Eg. Some items (e.g. diamonds) may occur rarely but are valuable

Constraint-based Mining • Finding all the patterns in a database autonomously? — unrealistic! • The patterns could be too many but not focused! • Data mining should be an interactive process • User directs what to be mined using a data mining query language (or a graphical user interface) • Constraint-based mining • User flexibility: provides constraints on what to be mined • System optimization: explores such constraints for efficient mining—constraint-based mining

Constraints in Data Mining • Data constraint— using SQL-like queries • find product pairs sold together in stores in Chicago in Dec.’02 • Dimension/level constraint • in relevance to region, price, brand, customer category • Rule (or pattern) constraint • small sales (price < $10) • Interestingness constraint • strong rules: min_support  3%, min_confidence  60%

Anti-Monotonicity in Constraint Pushing • Anti-monotonicity • When an intemset S violates the constraint, so does any of its superset • sum(S.Price)  v is anti-monotone • sum(S.Price)  v is not anti-monotone • Example. C: range(S.profit)  15 is anti-monotone • Itemset ab violates C • So does every superset of ab

Monotonicity for Constraint Pushing • Monotonicity • When an intemset S satisfies the constraint, so does any of its superset • sum(S.Price)  v is monotone • min(S.Price)  v is monotone • Example. C: range(S.profit)  15 • Itemset ab satisfies C • So does every superset of ab

Database D L1 C1 Scan D C2 C2 L2 Scan D L3 C3 Scan D Constraint: Sum{S.price} < 5 The Constrained Apriori Algorithm: Push an Anti-monotone Constraint Deep

Converting “Tough” Constraints • Convert tough constraints into anti-monotone or monotone by properly ordering items • Examine C: avg(S.profit)  25 • Order items in value-descending order • <a, f, g, d, b, h, c, e> • If an itemset afb violates C • So does afbh, afb* • It becomes anti-monotone!

Constraint-Based Mining—A General Picture

Most slides are from Jiawei Han

Mining Association Rules