Créer une présentation
Télécharger la présentation

Télécharger la présentation
## CSE 634 Data Mining Concepts and Techniques Association Rule Mining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**CSE 634Data Mining Concepts and TechniquesAssociation Rule**Mining Barbara Mucha Tania Irani Irem Incekoy Mikhail Bautin Course Instructor: Prof. Anita Wasilewska State University of New York, Stony Brook Group 6**References**• Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber • Presentation Slides of Prateek Duble • Presentation Slides of the Course Book. • Mining Topic-Specific Concepts and Definitions on the Web • Effective Personalization Based on Association Rule Discovery from Web Usage Data**Overview**• Basic Concepts of Association Rule Mining • Association & Apriori Algorithm • Paper: Mining Topic-Specific Concepts and Definitions on the Web • Paper: Effective Personalization Based on Association Rule Discovery from Web Usage Data Barbara Mucha**Outline**• What is association rule mining? • Methods for association rule mining • Examples • Extensions of association rule Barbara Mucha**What Is Association Rule Mining?**• Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database • Frequent pattern mining: finding regularities in data • What products were often purchased together? • Beer and diapers?! • What are the subsequent purchases after buying a car? • Can we automatically profile customers? Barbara Mucha**Basic Concepts of Association Rule Mining**• Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) • Find: all rules that correlate the presence of one set of items with that of another set of items • E.g., 98% of people who purchase tires and auto accessories also get automotive services done • Applications • * MaintenanceAgreement (What the store should do to boost Maintenance Agreement sales) • Home Electronics * (What other products should the store stocks up?) • Attached mailing in direct marketing Barbara Mucha**Association Rule Definitions**• Set of items: I={I1,I2,…,Im} • Transactions:D = {t1, t2,.., tn} be a set of transactions, where a transaction,t, is a set of items • Itemset: {Ii1,Ii2, …, Iik} I • Support of an itemset: Percentage of transactions which contain that itemset. • Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. Barbara Mucha**Rule Measures: Support & Confidence**• An association rule is of the form : X Y where X, Y are subsets of I, and X INTERSECT Y = EMPTY • Each rule has two measures of value, support, and confidence. • Support indicates the frequencies of the occurring patterns, and confidence denotes the strength of implication in the rule. • The support of the rule X Y is support (X UNION Y) c is the CONFIDENCE of rule X Y if c% of transactions that contain X also contain Y, which can be written as the radio: • support(X UNION Y)/support(X) Barbara Mucha**Support & Confidence : An Example**Let minimum support 50%, and minimum confidence 50%, then we have, • A C (50%, 66.6%) • C A (50%, 100%) Barbara Mucha**Types of Association Rule Mining**• Boolean vs. quantitative associations (Based on the types of values handled) • buys(x, “computer”) buys(x, “financial software”) [.2%, 60%] • age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%] • Single dimension vs. multiple dimensional associations • buys(x, “computer”) buys(x, “financial software”) [.2%, 60%] • age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%] Barbara Mucha**Types of Association Rule Mining**• Single level vs. multiple-level analysis • What brands of beers are associated with what brands of diapers? • Various extensions • Correlation, causality analysis • Association does not necessarily imply correlation or causality • Constraints enforced • E.g., small sales (sum < 100) trigger big buys (sum > 1,000)? Barbara Mucha**Association Discovery**• Given a user specified minimum support (called MINSUP) and minimum confidence (called MINCONF), an important • PROBLEM is to find all high confidence, large itemsets (frequent sets, sets with high support). (where support and confidence are larger than minsup and minconf). • This problem can be decomposed into two subproblems: • 1. Find all large itemsets: with support > minsup (frequent sets). • 2. For a large itemset, X and B X (or YX) , find those rules, X\{B} => B ( X-Y Y) for which confidence > minconf. Barbara Mucha**Basics**• Itemset: a set of items • E.g., acm={a, c, m} • Support of itemsets • Sup(acm)=3 • Given min_sup=3, acm is a frequent pattern • Frequent pattern mining: find all frequent patterns in a database Transaction database TDB Barbara Mucha**Mining Association Rules—An Example**Min. support 50% Min. confidence 50% For rule AC: support = support({A&C}) = 50% confidence = support({A&C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent**Rules from frequent sets**• X = {mustard, sausage, beer}; frequency = 0.4 • Y = {mustard, sausage, beer, chips}; frequency = 0.2 • If the customer buys mustard, sausage, and beer, then the probability that he/she buys chips is 0.5 Barbara Mucha**Applications**• Mine: • Sequential patterns • find inter-transaction patterns such that the presence of a set of items is followed by another item in the time-stamp ordered transaction set. • Periodic patterns • It can be envisioned as a tool for forecasting and prediction of the future behavior of time-series data. • Structural Patterns • Structural patterns describe how classes and objects can be combined to form larger structures. Barbara Mucha**Application Difficulties**• Wal-Mart knows that customers who buy Barbie dolls have a 60% likelihood of buying one of three types of candy bars. • What does Wal-Mart do with information like that? 'I don't have a clue,' says Wal-Mart's chief of merchandising, Lee Scott www.kdnuggets.com/news/98/n01.html • Diapers and beer urban legend http://web.onetel.net.uk/~hibou/Beer%20and%20Nappies.html Barbara Mucha**Thank You!**Barbara Mucha**CSE 634Data Mining Concepts and Techniques**Association & Apriori Algorithm Tania Irani (105573836) Course Instructor: Prof. Anita Wasilewska State University of New York, Stony Brook**References**• Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber • Presentation Slides of Prof. Anita Wasilewska**Agenda**• The Apriori Algorithm (Mining single-dimensional boolean association rules) • Frequent-Pattern Growth (FP-Growth) Method • Summary**The Apriori Algorithm: Key Concepts**• K-itemsets: An itemset having k items in it. • Support or Frequency: Number of transactions that contain a particular itemset. • Frequent Itemsets: An itemset that satisfies minimum support. (denoted by Lk for frequent k-itemset). • Apriori Property: All non-empty subsets of a frequent itemset must be frequent. • Join Operation: Ck, the set of candidate k-itemsets is generated by joining Lk-1 with itself. (L1: frequent 1-itemset, Lk: frequent k-itemset) • Prune Operation: Lk, the set of frequent k-itemsets is extracted from Ck by pruning it – getting rid of all the non-frequent k-itemsets in Ck Iterative level-wise approach: k-itemsets used to explore (k+1)-itemsets. The Apriori Algorithm finds frequent k-itemsets.**How is the Apriori Property used in the Algorithm?**• Mining single-dimensional Boolean association rules is a 2 step process: • Using the Apriori Property find the frequent itemsets: • Each iteration will generate Ck (candidate k-itemsets from Ck-1) and Lk (frequent k-itemsets) • Use the frequent k-itemsets to generate association rules.**Finding frequent itemsets using the Apriori Algorithm:**Example • Consider a database D, consisting of 9 transactions. • Each transaction is represented by an itemset. • Suppose min. support required is 2 (2 out of 9 = 2/9 =22 % ) • Say min. confidence required is 70%. • We have to first find out the frequent itemset using Apriori Algorithm. • Then, Association rules will be generated using min. support & min. confidence.**Step 1: Generating candidate and frequent 1-itemsets with**min. support = 2 Compare candidate support count with minimum support count Scan D for count of each candidate C1 L1 • In the first iteration of the algorithm, each item is a member of the set of candidates Ck along with its support count. • The set of frequent 1-itemsets L1, consists of the candidate 1-itemsets satisfying minimum support.**Step 2: Generating candidate and frequent 2-itemsets with**min. support = 2 Compare candidate support count with minimum support count Generate C2 candidates from L1 x L1 Scan D for count of each candidate L2 Note: We haven’t used Apriori Property yet! C2 C2**Step 3: Generating candidate and frequent 3-itemsets with**min. support = 2 Compare candidate support count with min support count Generate C3 candidates from L2 Scan D for count of each candidate L3 C3 Contains non-frequent (2-itemset) subsets C3 • The generation of the set of candidate 3-itemsets C3, involves use of the Apriori Property. • When Join step is complete, the Prune step will be used to reduce the size of C3. Prune step helps to avoid heavy computation due to large Ck.**Step 4: Generating frequent 4-itemset**• L3 Join L3C4 = {{I1, I2, I3, I5}} • This itemset is pruned since its subset {{I2, I3, I5}} is not frequent. • Thus, C4 = φ, and the algorithm terminates, having found all of the frequent items. • This completes our Apriori Algorithm. What’s Next ? • These frequent itemsets will be used to generate strong association rules (where strong association rules satisfy both minimum support & minimum confidence).**Step 5: Generating Association Rules from frequent**k-itemsets • Procedure: • For each frequent itemset l, generate all nonempty subsets of l • For every nonempty subset s of l, output the rule “s (l - s)” if support_count(l) / support_count(s) ≥ min_conf where min_conf is minimum confidence threshold. 70% in our case. • Back To Example: • Lets take l = {I1,I2,I5} • The nonempty subsets of Lets take l are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}**Step 5: Generating Association Rules from frequent**k-itemsets [Cont.] • The resulting association rules are: • R1: I1 ^ I2 I5 • Confidence = sc{I1,I2,I5} / sc{I1,I2} = 2/4 = 50% • R1 is Rejected. • R2: I1 ^ I5 I2 • Confidence = sc{I1,I2,I5} / sc{I1,I5} = 2/2 = 100% • R2 is Selected. • R3: I2 ^ I5 I1 • Confidence = sc{I1,I2,I5} / sc{I2,I5} = 2/2 = 100% • R3 is Selected.**Step 5: Generating Association Rules from Frequent Itemsets**[Cont.] • R4: I1 I2 ^ I5 • Confidence = sc{I1,I2,I5} / sc{I1} = 2/6 = 33% • R4 is Rejected. • R5: I2 I1 ^ I5 • Confidence = sc{I1,I2,I5} / {I2} = 2/7 = 29% • R5 is Rejected. • R6: I5 I1 ^ I2 • Confidence = sc{I1,I2,I5} / {I5} = 2/2 = 100% • R6 is Selected. We have found three strong association rules.**Agenda**• The Apriori Algorithm (Mining single dimensional boolean association rules) • Frequent-Pattern Growth (FP-Growth) Method • Summary**Mining Frequent Patterns Without Candidate Generation**• Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure • Highly condensed, but complete for frequent pattern mining • Avoid costly database scans • Develop an efficient, FP-tree-based frequent pattern mining method • A divide-and-conquer methodology: • Compress DB into FP-tree, retain itemset associations • Divide the new DB into a set of conditional DBs – each associated with one frequent item • Mine each such database seperately • Avoid candidate generation**FP-Growth Method : An Example**• Consider the previous example of a database D, consisting of 9 transactions. • Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % ) • The first scan of the database is same as Apriori, which derives the set of 1-itemsets & their support counts. • The set of frequent items is sorted in the order of descending support count. • The resulting set is denoted as L = {I2:7, I1:6, I3:6, I4:2, I5:2}**FP-Growth Method: Construction of FP-Tree**• First, create the root of the tree, labeled with “null”. • Scan the database D a second time (First time we scanned it to create 1-itemset and then L), this will generate the complete tree. • The items in each transaction are processed in L order (i.e. sorted order). • A branch is created for each transaction with items having their support count separated by colon. • Whenever the same node is encountered in another transaction, we just increment the support count of the common node or Prefix. • To facilitate tree traversal, an item header table is built so that each item points to its occurrences in the tree via a chain of node-links. • Now, The problem of mining frequent patterns in database is transformed to that of mining the FP-Tree.**FP-Growth Method: Construction of FP-Tree**null{} I2:7 I1:2 I1:4 I4:1 I3:2 I3:2 An FP-Tree that registers compressed, frequent pattern information I3:2 I4:1 I5:1 I5:1**Mining the FP-Tree by Creating Conditional (sub) pattern**bases • Start from each frequent length-1 pattern (as an initial suffix pattern). • Construct its conditional pattern base which consists of the set of prefix paths in the FP-Tree co-occurring with suffix pattern. • Then, construct its conditional FP-Tree & perform mining on this tree. • The pattern growth is achieved by concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-Tree. • The union of all frequent patterns (generated by step 4) gives the required frequent itemset.**FP-Tree Example Continued**Now, following the above mentioned steps: • Lets start from I5. I5 is involved in 2 branches namely {I2 I1 I5: 1} and {I2 I1 I3 I5: 1}. • Therefore considering I5 as suffix, its 2 corresponding prefix paths would be {I2 I1: 1} and {I2 I1 I3: 1}, which forms its conditional pattern base. Mining the FP-Tree by creating conditional (sub) pattern bases**FP-Tree Example Continued**• Out of these, only I1 & I2 is selected in the conditional FP-Tree because I3 does not satisfy the minimum support count. For I1, support count in conditional pattern base = 1 + 1 = 2 For I2, support count in conditional pattern base = 1 + 1 = 2 For I3, support count in conditional pattern base = 1 Thus support count for I3 is less than required min_sup which is 2 here. • Now, we have a conditional FP-Tree with us. • All frequent pattern corresponding to suffix I5 are generated by considering all possible combinations of I5 and conditional FP-Tree. • The same procedure is applied to suffixes I4, I3 and I1. • Note: I2 is not taken into consideration for suffix because it doesn’t have any prefix at all.**Why Frequent Pattern Growth Fast ?**• Performance study shows • FP-growth is an order of magnitude faster than Apriori • Reasoning • No candidate generation, no candidate test • Use compact data structure • Eliminate repeated database scans • Basic operation is counting and FP-tree building**Agenda**• The Apriori Algorithm (Mining single dimensional boolean association rules) • Frequent-Pattern Growth (FP-Growth) Method • Summary**Summary**• Association rules are generated from frequent itemsets. • Frequent itemsets are mined using Apriori algorithm or Frequent-Pattern Growth method. • Apriori property states that all the subsets of frequent itemsets must also be frequent. • Apriori algorithm uses frequent itemsets, join & prune methods and Apriori property to derive strong association rules. • Frequent-Pattern Growth method avoids repeated database scanning of Apriori algorithm. • FP-Growth method is faster than Apriori algorithm.**Mining Topic-Specific Concepts and Definitions on the Web**Irem Incekoy May 2003, Proceedings of the 12th International conference on World Wide Web, ACM Press Bing Liu, University of Illinois at Chicago, 851 S. Morgan Street Chicago IL 60607-7053 Chee Wee Chin, Hwee Tou Ng, National University of Singapore 3 Science Drive 2 Singapore**References**• Agrawal, R. and Srikant, R. “Fast Algorithm for Mining Association Rules”, VLDB-94, 1994. • Anderson, C. and Horvitz, E. “Web Montage: A Dynamic Personalized Start Page”, WWW-02, 2002. • Brin, S. and Page, L. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, WWW7, 1998.**Introduction**• When one wants to learn about a topic, one reads a book or a survey paper. • One can read the research papers about the topic. • None of these is very practical. • Learning from web is convenient, intuitive, and diverse.**Purpose of the Paper**• This paper’s task is “mining topic-specific knowledge on the Web”. • The goal is to help people learn in-depth knowledge of a topic systematically on the Web.**Learning about a New Topic**• One needs to find definitions and descriptions of the topic. • One also needs to know the sub-topics and salient concepts of the topic. • Thus, one wants the knowledge as presented in a traditional book. • The task of this paper can be summarized as “compiling a book on the Web”.**Proposed Technique**• First, identify sub-topics or salient concepts of that specific topic. • Then, find and organize the informative pages containing definitions and descriptions of the topic and sub-topics.**Why are the current search tecnhiques not sufficient?**• For definitions and descriptions of the topic: Existing search engines rank web pages based on keyword matching and hyperlink structures. NOT very useful for measuring the informative value of the page. • For sub-topics and salient concepts of the topic: A single web page is unlikely to contain information about all the key concepts or sub-topics of the topic. Thus, sub-topics need to be discovered from multiple web pages. Current search engine systems do not perform this task.