1 / 30

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules. S.D. Lee David W. Cheung Ben Kao The University of Hong Kong Data Mining and Knowledge Discovery , 1998. Presenter: Tri Tran CS331 – Spring 2006. Outline. Introduction

aurek
Télécharger la présentation

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong Data Mining and Knowledge Discovery, 1998 Presenter: Tri Tran CS331 – Spring 2006

  2. Outline • Introduction • Problem Descriptions and Solutions • Mining of Association Rules • Update of Association Rules • Scheduling Update of Association Rules • DELI Algorithm • Example of DELI Algorithm • Experimental Results • Conclusions

  3. Introduction • Applicability of data mining in many areas, such as decision support, market strategy and financial forecasts • Data mining enables us to find out useful information from huge databases • It enables marketers to develop and implement customized marketing programs and strategies • Mining of association rules is one of the most common data mining problems

  4. Introduction (cont.) • Database keeps changing overtime, hence, the set of discovered association rules needs to be updated to reflect the changes  Maintenance of discovered association rules is also an important problem • Existing solutions scan database multiple times to discover exactly the association rules • Apriori algorithm: discover a set of association rules • FUP2 algorithm: update the discovered association rules efficiently when transactions are added to, deleted from or modified in the database. • Authors propose an algorithm DELI to determine when rule update should be applied. • The algorithm estimates the maximum amount of changes in the set of rules due to newly added transactions using sampling techniques.

  5. Problem 1: Mining of Association Rules • Given a database D of transactions and a set of possible items, find the large itemsets • Large Itemsets: itemsets which have a transaction support above a pre-specified support, s% • Transaction: a non-empty set of items • Association Rule: X => Y, X and Y are itemsets • By examining large itemsets, find association rules that their confidence are above a confidence threshold, c%

  6. Solution: Apriori Algorithm • Finds out the large itemsets iteratively • At iteration k: • Use large (k-1)-itemsets, Lk-1, find candidate itemsets of size k, Ck • Check which ones have a support above pre-specified and add them to large k-itemsets • At every iteration, it scans the database to count the transactions which contain each candidate itemset • A large amount of time is spent in scanning the whole database

  7. Problem 2: Update of Association Rules • After some updates have been applied to a database, find the new large itemsets and their support counts in an efficient manner • All database updates are either insertions or deletions • Association Rule Maintenance Problem • Efficiently update the discovered association rules by using the old database mining results

  8. Δ- (δ-X) D (σX) D* D’ (σ’X) Δ+ (δ+X) Update of Association Rules Δ- :set of deleted transactions Δ+:set of added transactions D :old database D' :update database D*:set of unchanged transactions σX: support count of itemset X σ’X: new support count of itemset X δX-: support count of itemset X in Δ- δX+: support count of itemset X in Δ+ • D’ = (D - Δ-) U Δ+ • σ’X = (σX - δX-) + δX+

  9. FUP2 Algorithm • Addresses maintenance problem • Apriori fails to use old data mining result • FUP2 reduces the amount of work that needs to be done • FUP2 works similarly to Apriori by generating large itemsets iteratively • It scans only the updated part of the database for old large itemsets • For the rest, it scans the whole database

  10. FUP2 Algorithm • Finds out the large itemsets iteratively by reusing the results of previous mining • At iteration k: • Use the new large (k-1)-itemsets L’k-1 (w.r.t. D’) to find candidate itemsets of size k, Ck • Find support count of the candidate itemsets in Ck • Divide Ck into two partitions: Pk = Ck۸ Lk and Qk = Ck – Pk • For X in Pk, calculate σ’X = (σX - δX-) + δX+ • For X in Qk, eliminate candidates with δX+ - δX-< (Δ+ -Δ-)s%, • For the remaining candidates X in Qk,scan D* to find counts and add to δX+ to get σ’X

  11. Problem 3: Find the difference between the old and new association rules • Before doing the update to find L’, we want to know the difference between L and L’ • Symmetric difference: measure how many large itemsets have been added and deleted after the database update • If too many => time to update association rules • If too few => old association rules are a good approximation of the updated database

  12. DELI Algorithm • Difference Estimation for Large Itemsets • Purpose: to estimate the difference between the association rules in a database before and after it is updated • Decides whether to update the association rules • Key idea: it approximatesupper bound of the size of the association rule change byexamining samples of the database • Advantage: DELI saves machine resources and time

  13. DELI Algorithm • Input: old support counts, D, Δ+ and Δ- • Output: a Boolean value indicating whether a rule-update is needed • Iterative algorithm: construct Ck from ~Lk-1 which is an approximation itemsets of L’k-1 • In each iteration, estimate the support count of itemsets in Ck using a sample S of m random transactions drawn from database D

  14. Ck Lk Qk Pk DELI Algorithm – Step 1 • Obtain a random sample S of size m from database D • In each iteration: • generate a candidate set Ck = I (all 1-itemsets), k=1 apriori_gen(~Lk-1), k>1 • divide Ck into 2 partitions: Pk = Ck۸ Lk and Qk = Ck – Pk

  15. Δ- (δ-X) D (σX) D* D’ (σ’X) Δ+ (δ+X) DELI Algorithm – Step 2 • Pk - the itemsets of size k that were large (>|D|s%) in old database and potentially large in the new one • For each itemset X Pk : • σ’X = (σX - δX-) + δX+ (scan only Δ- andΔ+) • If (σ’X >= |D’| * s%), then add X to Lk» (Lk» - itemsets, large both in old and new databases)

  16. DELI Algorithm – Step 3 • Qk - the itemsets of size k that were not large (<|D|s%) in old database and potentially large in the new one (>|D’|s%) • For each itemset XQ : • If (δX+ - δX-) <= (| Δ+ | - | Δ- |)*s%, then delete X from Q • Prune away candidate itemsets that its support counts is not large (<|D’|s%) in the new database • For each remaining itemset XQ : • Find support count of X in the sample S, Tx (a binomially distributed random variable) • Estimate support count of X in D, σX and obtain an interval [ax, bx] with a 100(1-)% confidence • σ’X [ax + x, bx + x], where x = δX+ - δX- Reason: σ’X = σX + (δX+ - δX-)

  17. DELI Algorithm – Step 3 • For each itemset XQ : • Compare estimated interval σ’X [ax + x, bx + x] with |D’|*s% • Lk> - itemsets that were not large in D but are large in D’ with 100(1-)% confidence • Lk≈ - itemsets that were not large in D, maybe large in D’ Lk> Lk≈ aX + X bX + X |D’|*s% |D’|*s%

  18. ~Lk Ck Lk> Lk» Lk Lk≈ Qk Pk DELI Algorithm – Step 4 • Obtain the estimated set of large itemsets of size k ~Lk = Lk» Lk>  Lk≈ Itemsets: Lk» - large in D, large in D’ (Step 2) Lk> - not large in D, large in D’ with a certain confidence (Step 3) Lk≈ - not large in D, maybe large in D’ (Step 3) • ~Lk is an approximation of new Lk. ~Lk is an overestimated itemsets, therefore, the difference between ~Lk and Lk gives an upper bound.

  19. DELI Algorithm – Step 5 • Decide whether an association rule update is needed • IF uncertainty ( Lk≈/~Lk ) is too large => DELI halts, update is needed • IF symmetric difference of large itemsets is too large => DELI halts, update is needed • IF ~Lk is empty => DELI halts, no update is necessary • IF ~Lk is non-empty => k = k + 1, go to Step 1

  20. DELI Algorithm – Example|D|=106 | -|=9000 | +|=10000 S%=2%

  21. DELI Algorithm – Example • k=1: • C1 = {A, B, C, D, E, F}, P1 = {A, B, C, D, E}, Q = {F} • P1: |D’|*s% = 20020 => L1 = {A, B, C, D, E} • Q1: (δX+ - δX-) = 17 (| Δ+ | - | Δ- |)*s% = 20 17 < 20 => drop F • ~L1= L1= {A, B, C, D, E} • Update? No. • k = 2, proceed to Step 1

  22. DELI Algorithm – Example • k=2: • ~L1= {A, B, C, D, E}, P2={AB, AC, AD, AE, BC, BD, CD}, Q2={BE, CE, DE} • P2: |D’|*s% = 20020 => L2» = {AB, AC, AD, BC, BD, CD} • Q2: drop CE, DE; because (δX+ - δX-) <= (| Δ+ | - | Δ- |)*s% For BE: Assume support count of BE in S, Tx=202 => σX =20200 95% confidence interval [20200-2757, 20200+2757] for σX For σ’X , confidence interval: [17677, 23191] 17677 < |D’|*s% < 23191 => L2≈ ={BE} L2> = Ø

  23. DELI Algorithm – Example • k=2: 4) ~L2 = {AB, AC, AD, BC, BD, CD, BE} 5) Update? No. (uncertainty=1/7 and difference=2/15). k = 3, proceed to Step 1. • k=3: … 4) ~L3 = {ABC, ACD, BCD} 5) Update? No. (uncertainty=0 and difference=2/15) • k=4: C4 = Ø STOP. • Returns: False (no update of association rules is needed).

  24. Experimental Results • Synthetic databases – generate D, Δ+, Δ- • Use Apriori to find large itemsets • FUP2 is invoked to find large itemsets in the updated database – record time • Run DELI – record time • |D| = 100000, |Δ+|=| Δ-|= 5000, confidence = 95%, • s% = 2%, m = 20000

  25. Experimental Results

  26. Experimental Results 90% ----------level of confidence--------- 99%

  27. Experimental Results

  28. Conclusions • Real-world databases get updated constantly, therefore the knowledge extracted from them changes too • The authors proposed DELI algorithm to determine if the change is significant so that when to update the extracted association rules • The algorithm applies sampling techniques and statistic methods to efficiently estimate an approximate large itemsets

  29. Final Exam Questions • Q1: Compare and contrast FUP2 and DELI • Both algorithms are used in Association Analysis • Goal: DELI decides when to update the association rules while FUP2 provides an efficient way of updating them • Technique: DELI scans a small portion of the database (sample) and approximates the large itemsets whereas FUP2 scans the whole database and returns the large itemsets exactly • DELI saves machine resources and time

  30. Final Exam Questions • Q2: Difference Estimation for Large Itemsets • Q3 Difference between Apriori and FUP2: • Apriori scans the whole database to find association rules, and does not use old data mining results • For most itemsets, FUP2 scans only the updated part of the database and takes advantage of the old association analysis results

More Related