1 / 103

Advanced Topics in Data Mining: Association Rules

Advanced Topics in Data Mining: Association Rules. What Is Association Mining?. Association Rule Mining Finding frequent patterns, associations, correlations, or causal structures among item sets in transaction databases, relational databases, and other information repositories Applications

hewitt
Télécharger la présentation

Advanced Topics in Data Mining: Association Rules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Topics in Data Mining:Association Rules

  2. What Is Association Mining? • Association Rule Mining • Finding frequent patterns, associations, correlations, or causal structures among item sets in transaction databases, relational databases, and other information repositories • Applications • Market basket analysis (marketing strategy: items to put on sale at reduced prices), cross-marketing, catalog design, shelf space layout design, etc • Examples • Rule form: Body ® Head [Support, Confidence]. • buys(x, “Computer”) ® buys(x, “Software”) [2%, 60%] • major(x, “CS”) ^ takes(x, “DB”) ® grade(x, “A”) [1%, 75%]

  3. Market Basket Analysis Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold.

  4. Rule Measures: Support and Confidence • Let minimum support 50%, and minimum confidence 50%, we have • A  C [50%, 66.6%] • C  A [50%, 100%]

  5. Support & Confidence

  6. Association Rule: Basic Concepts • Given • (1) database of transactions, • (2) each transaction is a list of items (purchased by a customer in a visit) • Find all rules that correlate the presence of one set of items with that of another set of items • Find all the rules A  B with minimum confidence and support • support, s, P(A  B) • confidence, c, P(B|A)

  7. Association Rule Mining:A Road Map • Boolean vs. quantitative associations (Based on the types of values handled in the rule set) • buys(x, “SQLServer”) ^ buys(x, “DM Book”)  buys(x, “DBMiner”) [0.2%, 60%] • age(x, “30..39”) ^ income(x, “42..48K”)  buys(x, “PC”) [1%, 75%] • Single dimension vs. multiple dimensional associations • Single level vs. multiple-level analysis (Based on the levels of abstractions involved in the rule set)

  8. Terminologies • Item • I1, I2, I3, … • A, B, C, … • Itemset • {I1}, {I1, I7}, {I2, I3, I5}, … • {A}, {A, G}, {B, C, E}, … • 1-Itemset • {I1}, {I2}, {A}, … • 2-Itemset • {I1, I7}, {I3, I5}, {A, G}, …

  9. Terminologies • K-Itemset • If the length of the itemset is K • Frequent K-Itemset • If the length of the itemset is K and the itemset satisfies a minimum support threshold. • Association Rule • If a rule satisfies both a minimum support threshold and a minimum confidence threshold

  10. Analysis • The number of itemsets of a given cardinality tends to grow exponentially

  11. Mining Association Rules: Apriori Principle Min. support 50% Min. confidence 50% • For rule A  C: • support = support({A C}) = 50% • confidence = support({A C})/support({A}) = 66.6% • The Apriori principle: • Any subset of a frequent itemset must be frequent

  12. Mining Frequent Itemsets: the Key Step • Find the frequent itemsets: the sets of items that have minimum support • A subset of a frequent itemset must also be a frequent itemset • i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset • Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) • Use the frequent itemsets to generate association rules

  13. Example

  14. Apriori Algorithm

  15. Apriori Algorithm

  16. Apriori Algorithm

  17. Example of Generating Candidates • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4={abcd}

  18. C1 count 1 2 2 3 3 3 4 1 5 3 L1 1 2 3 5 Database D 1 3 4 2 3 5 1 2 3 5 2 5 generate L1 scan D count C1 C2 12 13 15 23 25 35 C2 count 12 1 13 2 15 1 23 2 25 3 35 2 L2 13 23 25 35 generate C2 generate L2 scan D count C2 C3 235 C3 count 235 2 L3 235 generate C3 generate L3 scan D count C3 Another Example 1

  19. Another Example 2

  20. Is Apriori Fast Enough? — Performance Bottlenecks • The core of the Apriori algorithm: • Use frequent (k–1)-itemsets to generate candidate frequent k-itemsets • Use database scan to collect counts for the candidate itemsets • The bottleneck of Apriori: • Huge candidate sets: • 104 frequent 1-itemset will generate 107 candidate 2-itemsets • To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100  1030 candidates. • Multiple scans of database: • Needs (n +1) scans, n is the length of the longest pattern

  21. Demo-IBM Intelligent Minner

  22. Demo Database

  23. Methods to Improve Apriori’s Efficiency • Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent • Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans • Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB • Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness

  24. Partitioning

  25. Hash-Based Itemset Counting = * +

  26. Compare Apriori & DHP (Direct Hash & Pruning) Apriori

  27. Compare Apriori & DHP DHP

  28. DHP: Database Trimming

  29. DHP (Direct Hash & Pruning) • A database has four transactions • Let min_sup = 50%

  30. Example: Apriori

  31. Example: DHP

  32. Example: DHP

  33. Example: DHP

  34. Mining Frequent Patterns Without Candidate Generation • Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure • highly condensed, but complete for frequent pattern mining • avoid costly database scans • Develop an efficient, FP-tree-based frequent pattern mining method • A divide-and-conquer methodology: decompose mining tasks into smaller ones • Avoid candidate generation & sub-database test only!

  35. Construct FP-tree from a Transaction DB

  36. Construction Steps • Scan DB once, find frequent 1-itemset (single item pattern) • Order frequent items in frequency descending order • Sorting DB according to the frequency descending order • Scan DB again, construct FP-tree

  37. Benefits of the FP-Tree Structure • Completeness • never breaks a long pattern of any transaction • preserves complete information for frequent pattern mining • Compactness • reduce irrelevant information—infrequent items are gone • frequency descending ordering: more frequent items are more likely to be shared • never be larger than the original database (if not count node-links and counts) • Compression ratio could be over 100

  38. For I5 {I1, I5}, {I2, I5}, {I2, I1, I5} For I4 {I2, I4} For I3 {I1, I3}, {I2, I3}, {I2, I1, I3} For I1 {I2, I1} Frequent Pattern Growth Order frequent items in frequency descending order

  39. For I5 {I1, I5}, {I2, I5}, {I2, I1, I5} For I4 {I2, I4} For I3 {I1, I3}, {I2, I3}, {I2, I1, I3} For I1 {I2, I1} Frequent Pattern Growth Sub DB Trimming Databases Sub DB Sub DB FP-tree Sub DB Conditional FP-tree from Conditional Pattern-Base

  40. Conditional FP-tree Conditional FP-tree from Conditional Pattern-Base for I3

  41. For I5 (不產生NULL) Conditional Pattern Base {(I2I1:1), (I2I1I3:1)} Conditional FP-tree Generate Frequent Itemsets I2:2 Rule: I2I5:2 I1:2 Rule: I1I5:2 I2I1:2 Rule: I2I1I5:2 Mining Results Using FP-tree ◎ NULL: 2 ◎ “I2”: 2 ◎ “I1”: 2

  42. For I4 Conditional Pattern Base {(I2I1:1), (I2:1)} Conditional FP-tree Generate Frequent Itemsets I2:2 Rule: I2I4:2 Mining Results Using FP-tree ◎ NULL: 2 ◎ “I2”: 2

  43. For I3 Conditional Pattern Base {(I2I1:2), (I2:2), (I1:2)} Conditional FP-tree Mining Results Using FP-tree ◎ NULL: 4 ◎ “I2”: 4 ◎ “I1”: 2 ◎ “I1”: 2

  44. For I1/I3 Conditional Pattern Base {(NULL:2), (I2:2)} Conditional FP-tree Generate Frequent Itemsets Null:4 Rule: I1I3:4 I2:2 Rule: I2I1I3:2 Mining Results Using FP-tree ◎ NULL: 4 ◎ “I2”: 2

  45. For I2/I3 Conditional Pattern Base {(NULL:4)} Conditional FP-tree Generate Frequent Itemsets Null Rule: I2I3:4 Mining Results Using FP-tree ◎ NULL: 4

  46. For I1 Conditional Pattern Base {(NULL:2), (I2:4)} Conditional FP-tree Generate Frequent Itemsets I2:4 Rule: I2I1:4 Mining Results Using FP-tree ◎ NULL: 6 ◎ “I2”: 4

  47. Mining Results Using FP-tree

More Related