1 / 22

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007. Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling. OUTLINE. Introduction Problem Statement Property of Max-Frequency Algorithm Experiments Conclusion. INTRODUCTION.

haroun
Télécharger la présentation

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MINING FREQUENT ITEMSETS IN A STREAMTOON CALDERS, NELE DEXTERS, BART GOETHALSICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling

  2. OUTLINE • Introduction • Problem Statement • Property of Max-Frequency • Algorithm • Experiments • Conclusion

  3. INTRODUCTION • Most previous work on mining frequently occurring itemsets over data streams either focuses on • The sliding window model • The time-fading model • The landmark model • Each of these models requires a fixed window length or decay factor given by the user • In many applications, however, choosing such parameters that are most appropriate for every itemset at every timepoint in an evolving stream is almost impossible

  4. INTRODUCTION • We propose to consider for each itemset the window in which it has the highest frequency • We define the current frequency of an itemset as the maximum over all windows from the past until the current state that satisfy a minimal size constraint • When a stream evolves, the length of the window containing the highest frequency for a given itemset can change continuously • This new stream measure turns out to be very suitable to early detect sudden bursts of occurrences of itemsets, while still taking into account the history of the itemset

  5. PROBLEM STATEMENT* STREAMS AND MAX-FREQUENCY • : a stream 〈I1 I2 … In〉 is a sequence of itemsets • is the length of the stream • I1 is considered the first and oldest itemset in the stream, and In the latest and most recent • : the number of sets in a stream that contain itemset I • : the sub-stream of the window 〈Is Is+1 … It〉 • : the sub-stream of consisting of the last k items of ,

  6. PROBLEM STATEMENT* STREAMS AND MAX-FREQUENCY • Definition 1. • Given a minimal window size mwl, the max-frequency of itemset I in a stream is defined as the maximum of the frequencies of I over all windows, of size at least mwl, extending the end of the stream; that is • If the length of the stream is less than mwl, the max-frequency is defined to be 0

  7. PROBLEM STATEMENT* STREAMS AND MAX-FREQUENCY • Definition 1. (cont.) • The longest window in which the maximum frequency is reached is called the maximal window for I in , and its starting point is denoted • That is, is the smallest index such that • mwl will be omitted when clear form the context

  8. PROPERTIES OF MAX-FREQUENCY

  9. PROPERTIES OF MAX-FREQUENCY

  10. ALGORITHM * THE SUMMARY • Let p1 < p2 < … < pr be the borders for itemset A in the stream , ordered from oldest to most recent • Let be the number of occurrences of the target itemset A in between two subsequent border positions pi and pi+1( for i = 1, …, r-1 ). Denotes the number of occurrences of A since the last border • The summary St of is defined as the array

  11. ALGORITHM * THE SUMMARY • We can easily compute the frequencies of itemset A for any of the border positions form this summary:

  12. ALGORITHM * THE SUMMARY • The fractions in the blocks in between two subsequent border positions are increasing, and as a consequence, among all borders pi, we have that is maximal for i equal to r

  13. ALGORITHM * THE SUMMARY

  14. ALGORITHM * MINIMAL FREQUENCY • Until now, we assumed that for the target itemset we need to be able to report its frequency exactly. We will now relax this requirement by setting a minimal frequency threshold minfreq • Let be a stream with , and suppose that • Then we can remove( p1, a1 ) from the left-side of the summary

  15. ALGORITHM * MINIMAL WINDOW LENGTH • In the algorithm without minimal window length, a border q in stream can be pruned of we can find two blocks and such that the frequency of the target in is higher than • When we are working with a minimal window length, it could be the case that the suffix of the stream starting at r + 1 does not meet the minimal window length requirement • In that case, even though the window starting at q has lower frequency than the window starting r + 1, it can still have the highest frequency of all windows that meet the minimal window requirement!

  16. ALGORITHM * MINIMAL WINDOW LENGTH

  17. ALGORITHM * MINIMAL WINDOW LENGTH • In order to know the maximal frequency with a minimal window length mwl, it suffices to apply the method without any minimal window length to keep track of the borders for the stream • Then, when we need the max-frequency, we check the borders of in the complete stream , and the minimal window itself,

  18. ALGORITHM * MINING ALL ITEMSETS

  19. ALGORITHM * MINING ALL ITEMSETS • We do not need to maintain the summaries of all itemsets, but only those that were once frequent in the minimal window, and that are, at the same time, frequent now within the part of the stream • Furthermore, we need to find the frequent itemsets in the mwl windows

  20. EXPERIMENTS

  21. EXPERIMENTS

  22. CONCLUSION • We presented a new frequency measure for itemsets in streams that does not rely on a fixed window length or a time-decaying factor • An experimental evaluation supported the claim that the new measure can be computed from a summary with extremely small memory requirements, that can be maintained and updated efficiently • The summary of the stream consists of the borders and their corresponding frequencies

More Related