Créer une présentation
Télécharger la présentation

Télécharger la présentation
## DATA MINING LECTURE 3

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**DATA MININGLECTURE 3**Frequent Itemsets Association Rules**Frequent Itemsets**• Given a set of transactions, find combinations of items (itemsets) that occur frequently : support, number of transactions that contain I minsup Market-Basket transactions Examples of frequent itemsets {Diaper, Beer} : 3{Milk, Bread} : 3{Milk, Bread, Diaper}: 2**Mining Frequent Itemsets task**• Input: A set of transactions T, over a set of items I • Output: All itemsets with items in I having • support ≥ minsupthreshold • Problem parameters: • N = |T|: number of transactions • d = |I|: number of (distinct) items • w: max width of a transaction • Number of possible itemsets= 2d**The itemsetlattice**Given d items, there are 2dpossible itemsets Too expensive to test all!**Reduce the number of candidates**• Apriori principle (Main observation): • If an itemset is frequent, then all of its subsets must also be frequent • If an itemset is not frequent, then all of its supersets cannot be frequent • The support of an itemsetnever exceeds the support of its subsets • This is known as the anti-monotone property of support**Illustration of the Apriori principle**Frequent subsets Found to be frequent**Found to be Infrequent**Pruned Illustration of the Apriori principle Infrequent supersets**The Apriori algorithm**Level-wise approach • Find frequent 1-items and put them toLk(k=1) • Use Lkto generate a collection of candidate itemsetsCk+1 with size (k+1) • Scan the database to find which itemsets in Ck+1 are frequent and put them into Lk+1 • If Lk+1 is not empty • k=k+1 • Goto step 2 R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules", Proc. of the 20th Int'l Conference on Very Large Databases, 1994.**The Apriori algorithm**Ck: Candidate itemsetsof size k Lk: frequent itemsetsof size k L1= {frequent 1-itemsets}; for(k = 2; Lk !=; k++) Ck+1= GenerateCandidates(Lk) for eachtransaction t in database do increment count of candidates in Ck+1that are contained in t endfor Lk+1= candidates in Ck+1 with support ≥min_sup endfor returnkLk;**Generate Candidates Ck+1**• Assume the items in Lkare listed in an order (e.g., lexicographic) • Step 1: self-joiningLk(IN SQL) insert intoCk+1 select p.item1, p.item2, …, p.itemk, q.itemk from Lk p, Lkq where p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk< q.itemk Create an itemset of size k+1, by joining two itemsets of size k, that share the first k-1 items**Example I**• L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abc and abd • acde from acd and ace • p.item1=q.item1,p.item2=q.item2, p.item3< q.item3**Example I**• L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abc and abd • acde from acd and ace • p.item1=q.item1,p.item2=q.item2, p.item3< q.item3**{a,b,c}**{a,b,d} {a,b,c,d} Example I • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abc and abd • acde from acd and ace • p.item1=q.item1,p.item2=q.item2, p.item3< q.item3**{a,c,d}**{a,c,e} {a,c,d,e} Example I • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abc and abd • acde from acd and ace • p.item1=q.item1,p.item2=q.item2, p.item3< q.item3**Generate Candidates Ck+1**• Assume the items in Lk are listed in an order (e.g., alphabetical) • Step 1: self-joiningLk(IN SQL) insert intoCk+1 select p.item1, p.item2, …, p.itemk, q.itemk from Lk p, Lkq where p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk< q.itemk • Step 2:pruning forallitemsets c in Ck+1do forallk-subsets s of cdo if (s is not in Lk) then delete c from Ck+1 All itemsets of size k that are subsets of a new (k+1)-itemset should be frequent**{a,c,d}**{a,b,c} {a,c,e} {a,b,d} {a,c,d,e} {a,b,c,d} bcd abc abd acd X Example I • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abcand abd • acde from acd and ace • Pruning: • abcdis kept since all subset itemsets are in L3 • acdeis removed because ade is not in L3 • C4={abcd} cde acd ace ade**The Apriori algorithm**Ck: Candidate itemsetsof size k Lk: frequent itemsetsof size k L1= {frequent 1-itemsets}; for(k = 2; Lk !=; k++) Ck+1= GenerateCandidates(Lk) for eachtransaction t in database do increment count of candidates in Ck+1that are contained in t endfor Lk+1= candidates in Ck+1 with support ≥min_sup endfor returnkLk;**Reducing Number of Comparisons**• Candidate counting: • Scan the database of transactions to determine the support of each candidate itemset • To reduce the number of comparisons, store the candidates in a hash structure • Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets**Association Rule Mining**• Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk}, Implication means co-occurrence, not causality!**Example:**Definition: Association Rule • Association Rule • An implication expression of the form X Y, where X and Y are itemsets • Example: {Milk, Diaper} {Beer} • Rule Evaluation Metrics • Support (s) • Fraction of transactions that contain both X and Y • the probability P(X,Y) that X and Y occur together • Confidence(c) • Measures how often items in Y appear in transactions thatcontain X • the conditional probability P(X|Y) that X occurs given that Y has occurred.**Association Rule Mining Task**• Input: A set of transactions T, over a set of items I • Output: All rules with items in I having • support ≥ minsupthreshold • confidence ≥ minconfthreshold**Mining Association Rules**• Two-step approach: • Frequent Itemset Generation • Generate all itemsets whose support minsup • Rule Generation • Generate high confidence rules from each frequent itemset, where each rule is a partitioning of a frequent itemset into Left-Hand-Side (LHS) and Right-Hand-Side (RHS) Frequent itemset: {A,B,C,D} Rule:ABCD**Rule Generation**• We have all frequent itemsets, how do we get the rules? • For every frequent itemsetS, we find rules of the form L S – L , where L S, thatsatisfy the minimum confidence requirement • Example: L = {A,B,C,D} • Candidate rules: A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BD AC, CD AB, ABC D, BCD A, BC AD, • If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L)**Rule Generation**• How to efficiently generate rules from frequent itemsets? • In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) • But confidence of rules generated from the same itemset has an anti-monotone property • e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) • Confidence is anti-monotone w.r.t. number of items on the RHS of the rule**Pruned Rules**Rule Generation for Apriori Algorithm Low Confidence Rule Lattice of rules created by the RHS**Rule Generation for Apriori Algorithm**• Candidate rule is generated by merging two rules that share the same prefixin the RHS • join(CDAB,BDAC)would produce the candidaterule D ABC • Prune rule D ABCif itssubset ADBCdoes not havehigh confidence • Essentially we are doing Apriori on the RHS**Compact Representation of Frequent Itemsets**• Some itemsets are redundant because they have identical support as their supersets • Number of frequent itemsets • Need a compact representation**Maximal Frequent Itemset**An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Maximalitemsets = positive border Infrequent Itemsets Border**Negative Border**Itemsets that are not frequent, but all their immediate subsets are frequent. Infrequent Itemsets Border Minimal: Smallest itemsets with this property**Border**• Border = Positive Border + Negative Border • Itemsets such that all their immediate subsets are frequent and all their immediate supersets are infrequent. • Either the positive, or the negative border is sufficient to summarize all frequent itemsets.**Closed Itemset**• An itemset is closed if none of its immediate supersets has the same support as the itemset**Maximal vs Closed Itemsets**Transaction Ids Not supported by any transactions**Maximal vs Closed Frequent Itemsets**Closed but not maximal Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4**Pattern Evaluation**• Association rule algorithms tend to produce too many rules • many of them are uninteresting or redundant • Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence • Interestingness measures can be used to prune/rank the derived patterns • In the original formulation of association rules, support & confidence are the only measures used**f11: support of X and Yf10: support of X and Yf01: support**of X and Yf00: support of X and Y Computing Interestingness Measure • Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X Y Used to define various measures • support, confidence, lift, Gini, J-measure, etc.**Association Rule: Tea Coffee**• Confidence= P(Coffee|Tea) = 0.75 • but P(Coffee) = 0.9 • Although confidence is high, rule is misleading • P(Coffee|Tea) = 0.9375 Drawback of Confidence**Statistical Independence**• Population of 1000 students • 600 students know how to swim (S) • 700 students know how to bike (B) • 420 students know how to swim and bike (S,B) • P(SB) = 420/1000 = 0.42 • P(S) P(B) = 0.6 0.7 = 0.42 • P(SB) = P(S) P(B) => Statistical independence**Statistical Independence**• Population of 1000 students • 600 students know how to swim (S) • 700 students know how to bike (B) • 500students know how to swim and bike (S,B) • P(SB) = 500/1000 = 0.5 • P(S) P(B) = 0.6 0.7 = 0.42 • P(SB) > P(S) P(B) => Positively correlated**Statistical Independence**• Population of 1000 students • 600 students know how to swim (S) • 700 students know how to bike (B) • 300students know how to swim and bike (S,B) • P(SB) = 300/1000 = 0.3 • P(S) P(B) = 0.6 0.7 = 0.42 • P(SB) < P(S) P(B) => Negatively correlated**Statistical-based Measures**• Measures that take into account statistical dependence Text mining: Pointwise Mutual Information**Example: Lift/Interest**• Association Rule: Tea Coffee • Confidence= P(Coffee|Tea) = 0.75 • but P(Coffee) = 0.9 • Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)**Drawback of Lift & Interest**Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 Rare co-occurrences are deemed more interesting**ALTERNATIVE FREQUENT ITEMSET COMPUTATION**Slides taken from Mining Massive Datasets course by AnandRajaraman and Jeff Ullman.**Efficient computation of pairs**• The operation that takes most resources (mostly in terms of memory, but also time) is the computation of frequent pairs. • How can we make this faster?**PCY Algorithm**• During Pass 1 (computing frequent items) of A-priori, most memory is idle. • Use that memory to keep counts of buckets into which pairs of items are hashed. • Just the count, not the pairs themselves.**Needed Extensions**• Pairs of items need to be generated from the input file; they are not present in the file. • We are not just interested in the presence of a pair, but we need to see whether it is present at least s (support) times.**PCY Algorithm – (2)**• A bucket is frequent if its count is at least the support threshold. • If a bucket is not frequent, no pair that hashes to that bucket could possibly be a frequent pair. • On Pass 2 (frequent pairs), we only count pairs that hash to frequent buckets.