1 / 18

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996)

This paper discusses Toivonen's approach to sampling large databases for association rules. It covers the algorithm, analysis, and experimental results.

mindyh
Télécharger la présentation

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sampling Large Databases for Association Rules(Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

  2. Outline • Introduction • Preliminaries • Definitions, and Problem Statement • Two General Approaches • Sampling Method for Mining Association Rules • The algorithm • Analysis • Experimental Results

  3. Introduction • Problem: Discovery of Association Rules • Domain: Very Large Databases • Bottleneck: Time • Main Memory Processes: Ignorable • Disk I/O: An Influential Factor • Suggestion: Minimize the Number of Scans of the Database Only One Full Pass Over the Database

  4. Introduction(Con’t)Overview of Toivonen’s Method Main Steps: • Pick a random sample from the database. • Use the sample to determine all probable association rules. • Verify the results with the rest of the database, i.e. Eliminated incorrectly detected association rules and add missing association rules. The Main Contribution: To show that all exact frequencies can be found efficiently, by analyzing first a random sample and then the whole database with the proposed method.

  5. Preliminaries • Items • I={I1,I2,…,Im} • Transactions • r={t1,t2, …, tn}, tj I • Support of an itemset • Percentage of transactions which contain that itemset. • Frequent Itemsets • Association Rules • Strong Association Rules

  6. Preliminaries • Association Rule: implication X  Y where X,Y  I and X  Y = Ø; • Support of Association Rule X  Y: Percentage of transactions that contain X Y • Confidence of Association Rule X  Y: Ratio of number of transactions that contain X  Y to the number that contain X • Problem: Find the strong association rules of a given set I with respect to threshold min_fr and confidence min_conf.

  7. Algorithms for Mining Association Rules • Level-wise Algorithms Idea: If a set is not frequent then its supersets can not be frequent. On level k, candidate itemsets X of size k are generated such that all subsets of X are frequent. • Partition Algorithm Idea: Partition the data to sections small enough to be handled in main memory. First Pass: Find locally frequent Itemsets. Second Pass: Union of the local frequent itemsets

  8. Sampling for Frequent Sets • Major Steps • Random sampling • Finding the frequent itemsets of the sample • Finding other probable candidates using the concept of Negative Border • Using the rest of the database to check the candidates

  9. Negative Border • All sets which are not in our frequent itemsets, but all their subsets are. minimal itemsets not in S, where S is the collection of frequent itemsets • Example: • S = {{A}, {B}, {C}, {F}, {A,B}, {A,C}, {A,F}, {C,F}, {A,C,F}} • = {{B, C}, {B, F}, {D}, {E}}

  10. Frequent Set Discovery • Intuition: Given a collection S of sets that are frequent, the negative border contains the closest itemsets that could be frequent too. • After finding the collection of frequent itemsets, S, we check negative border of S: • If no frequent items are added=> We can conclude that all frequent sets are already found. (Why?) • Decrease minimum support to increase the chance of success. • If at least one frequent itemset is found in negative border => We can conclude that some of its supersets may be frequent.(Why?) • In the case of failure, we can either report failure and stop, or scan the database again and check the supersets to find the exact result. Success Failure

  11. Toivonen’s Algorithm

  12. Failure Handling • In the fraction of cases where a possible failure is reported, all frequent sets can be found by making a second pass over the database: The algorithm simply computes the collection of all sets that could possibly be frequent.

  13. Analysis of Sampling • Sample Size and Probability of Failure

  14. Experimental Results

  15. Conclusion • Advantages: Reduced failure probability, while keeping candidate-count low enough for memory • Disadvantages: Potentially large number of candidates insecond pass

  16. References [1] H. Toivonen, Sampling Large Databases for Association Rules, Proc. of VLDB Conference, India, 1996.

  17. Questions ?

  18. Thank you

More Related