300 likes | 344 Vues
Approximate Frequency Counts over Data Streams. Gurmeet Singh Manku, Bajeev Motwani Proceeding of the 28 th VLDB Conference , 2002. 報告人 : 吳建良. Motivation. In some new applications, data come as a continuous “stream” The sheer volume of a stream over its lifetime is huge
E N D
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Bajeev Motwani Proceeding of the 28th VLDB Conference , 2002 報告人:吳建良
Motivation • In some new applications, data come as a continuous “stream” • The sheer volume of a stream over its lifetime is huge • Response times of queries should be small • Examples: • Network traffic measurements • Market data
ALERT: RED flow exceeds 1% of all traffic through me, check it!!! Network Traffic Management • Frequent Items: Frequent Flow identification at IP router • short-term monitoring • long-term management
Mining Market Data Among 100 million records: (1) at least 1% customers buy both beer and diaper at same time (2) 51% customers who buy beer also buy diaper! … • Frequent Itemsets at Supermarket • store layout • catalog design …
Challenges • Single pass • Limited Memory (network management) • Enumeration of itemsets (mining market Data)
Summary in Memory Data Streams Stream Processing Engine (Approximate) Answer General Solution
Approximate Algorithm • Propose two algorithms for frequent item • Sticky Sampling • Lossy Counting • Propose one algorithm for frequent itemset • Extended Lossy Counting for frequent itemsets
Property of proposed algorithm • All item(set)s whose true frequency exceeds sN are output • No item(set) whose true frequency is less than is output • Estimated frequencies are less than the true frequencies by at most
Sticky Sampling Algorithm • User input includes three parameters, namely: • Support threshold s • Error parameter • Probability of failure • Counts are kept in a data structure S • Each entry in S is in the form (e,f), where: • e is the item • f is the estimated frequency of e in the stream • When queried about the frequent items, all entries (e,f) such that f (s - )N • N denote the current length of the stream
Sticky Sampling Algorithm (cont’d) Example Empty S Stream
Prune S的時機: at sampling rate change Sticky Sampling Algorithm(cont’d) • S ; N 0; t 1/ log (1/s); r 1 • e next item; N N + 1 • if (e,f) exists in S do • increment the count f • else if random(0,1) > 1/r do • insert (e,1) to S • endif • if N = 2t 2n do • r 2r • Prune(S); • endif • Goto 2; S: The set of all counts e: item N: Curr. len. of stream r: Sampling rate t: 1/ log (1/s)
Sticky Sampling Algorithm: Prune S • function Prune(S) • for every entry (e,f) in S do • while random(0,1) < 0.5 and f > 0 do • f f – 1 • if f = 0 do • remove the entry from S • endif
Lossy Counting Algorithm • Incoming data stream is conceptually divided into buckets of w=1/ transactions • Current bucket id denote as bcurrent = N/w • fe: the true frequency of e in the stream • Counts are kept in a data structure D • Each entry in D is in the form (e, f, ), where: • e is the item • f is the estimated frequency of e in the stream • is the maximum possible error in f
Lossy Counting Algorithm(cont’d) Example: =0.2,w=5, N=17, bcurrent=4 Bucket 1 Bucket 2 Bucket 3 bcurrent= 4 A B C A B E A C C D D A B E D F C D D D D (A,2,0) (B,2,0) (C,1,0) (A,3,0) (B,2,0) (C,2,1) (E,1,1) (D,1,1) (A,4,0) (B,1,2) (C,2,1) (D,2,2) (E,1,2) (A,4,0) (C,1,3) (D,2,2) (F,1,3) Prune D Prune D Prune D D D D (A,4,0) (D,2,2) (A,2,0) (B,2,0) (A,3,0) (C,2,1)
Prune D的時機: at bucket boundary Lossy Counting Algorithm(cont’d) • D ; N 0 • w 1/; bcurrent 1 • e next item; N N + 1 • if (e,f,) exists in D do • f f + 1 • else do • insert (e,1, bcurrent-1) to D • endif • if N mod w = 0 do • prune(D, bcurrent); • bcurrent bcurrent + 1 • endif • Goto 3; D: The set of all counts N: Curr. len. of stream e: item w: Bucket width bcurrent: Current bucket id
Lossy Counting Algorithm: prune D • function prune(D, bcurrent) • for each entry (e,f,) in D do • if f + bcurrent do • remove the entry from D • endif
Lossy Counting Algorithm (cont’d) • Four Lemmas Lemma1: Whenever deletions occur, bcurrent N Lemma2: Whenever an entry (e,f,) gets deleted, fe bcurrent Lemma3: If e does not appear in D, then fe N Lemma4: If (e,f,) D, then f fe f+N
Extended Lossy Counting for Frequent Itemsets • Incoming data stream is conceptually divided into buckets of w= 1/ transactions • Counts are kept in a data structure D • Multiple buckets ( of them say) are processed in a batch • Each entry in D is in the form (set, f, ), where: • set is the itemset • f is the approximate frequency of set in the stream • is the maximum possible error in f
Bucket 1 Bucket 2 Bucket 3 Extended Lossy Counting for Frequent Itemsets (cont’d) Put 3 buckets of data into main memory one time
Overview of the algorithm • D is updated by the operations UPDATE_SET and NEW_SET • UPDATE_SET updates and deletes entries in D • For each entry (set, f, ), count occurrence of set in the batch and update the entry • If an updated entry satisfies f + bcurrent, the entry is removed from D • NEW_SET inserts new entries into D • If a set set has frequency f in the batch and set does not occur in D, create a new entry (set, f, bcurrent-)
Implementation • Challenges: • Not to enumerate all subsets of a transaction • Data structure must be compact for better space efficiency • 3 major modules: • Buffer • Trie • SetGen
Implementation(cont’d) • Buffer: repeatedly reads in a batch of buckets of transactions, into available main memory • Trie: maintains the data structure D • SetGen: generates subsets of item-id’s along with their frequency counts in the current batch • Not all possible subsets need to be generated • If a subset S is not inserted into D after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered
Example Main Memory bucket3 bucket4 ACE BCD AB ABC AD BCE ACE: AC, A, C BCD: BC, B, C AB: AB, A, B ABC: AB, AC, BC, A, B, C AD: A BCE: BC, B, C UPDATE_SET SetGen D D (A,5,0) (B,3,0) (C,3,0) (D,2,0) (AB,2,0) (AC,3,0) (AD,2,0) (BC,2,0) (A,9,0) (B,7,0) (C,7,0) (AC,5,0) (BC,5,0) NEW_SET Add (AB,2,2) into D
IBM synthetic dataset T10.I4.1000K N = 1Million Avg Tran Size = 10 Input Size = 49MB IBM synthetic dataset T15.I6.1000K N = 1Million Avg Tran Size = 15 Input Size = 69MB Frequent word pairs in 100K web documents N = 100K Avg Tran Size = 134 Input Size = 54MB Frequent word pairs in 806K Reuters newsreports N = 806K Avg Tran Size = 61 Input Size = 210MB Experiments
Varying support s and BUFFER B Time in seconds Time in seconds S = 0.004 S = 0.008 S = 0.001 S = 0.012 S = 0.002 S = 0.016 S = 0.004 S = 0.020 S = 0.008 BUFFER size in MB BUFFER size in MB IBM 1M transactions Reuters 806K docs Fixed: Stream length N Varying: BUFFER size B Support threshold s
Varying length N and support s S = 0.001 S = 0.002 S = 0.001 Time in seconds S = 0.004 Time in seconds S = 0.002 S = 0.004 Length of stream in Thousands Length of stream in Thousands IBM 1M transactions Reuters 806K docs Fixed: BUFFER size B Varying: Stream length N Support threshold s
Varying BUFFER B and support s Time in seconds Time in seconds B = 4 MB B = 4 MB B = 16 MB B = 16 MB B = 28 MB B = 28 MB B = 40 MB B = 40 MB Support threshold s Support threshold s IBM 1M transactions Reuters 806K docs Fixed: Stream length N Varying: BUFFER size B Support threshold s
Comparison with fast A-priori Dataset: IBM T10.I4.1000K with 1M transactions, average size 10.
No of counters No of counters N (stream length) Sticky Sampling Expected: 2/ log 1/s Lossy Counting Worst Case: 1/ log N Support s = 1% Error ε = 0.1% Log10 of N (stream length)