Approximate Frequency Counts over Data Streams

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Bajeev Motwani Proceeding of the 28th VLDB Conference , 2002 報告人:吳建良

Motivation • In some new applications, data come as a continuous “stream” • The sheer volume of a stream over its lifetime is huge • Response times of queries should be small • Examples: • Network traffic measurements • Market data

ALERT: RED flow exceeds 1% of all traffic through me, check it!!! Network Traffic Management • Frequent Items: Frequent Flow identification at IP router • short-term monitoring • long-term management

Mining Market Data Among 100 million records: (1) at least 1% customers buy both beer and diaper at same time (2) 51% customers who buy beer also buy diaper! … • Frequent Itemsets at Supermarket • store layout • catalog design …

Challenges • Single pass • Limited Memory (network management) • Enumeration of itemsets (mining market Data)

Summary in Memory Data Streams Stream Processing Engine (Approximate) Answer General Solution

Approximate Algorithm • Propose two algorithms for frequent item • Sticky Sampling • Lossy Counting • Propose one algorithm for frequent itemset • Extended Lossy Counting for frequent itemsets

Property of proposed algorithm • All item(set)s whose true frequency exceeds sN are output • No item(set) whose true frequency is less than is output • Estimated frequencies are less than the true frequencies by at most

Sticky Sampling Algorithm • User input includes three parameters, namely: • Support threshold s • Error parameter  • Probability of failure  • Counts are kept in a data structure S • Each entry in S is in the form (e,f), where: • e is the item • f is the estimated frequency of e in the stream • When queried about the frequent items, all entries (e,f) such that f (s - )N • N denote the current length of the stream

Sticky Sampling Algorithm (cont’d) Example Empty S Stream

Prune S的時機: at sampling rate change Sticky Sampling Algorithm(cont’d) • S ; N  0; t  1/ log (1/s); r 1 • e  next item; N  N + 1 • if (e,f) exists in S do • increment the count f • else if random(0,1) > 1/r do • insert (e,1) to S • endif • if N = 2t  2n do • r  2r • Prune(S); • endif • Goto 2; S: The set of all counts e: item N: Curr. len. of stream r: Sampling rate t: 1/ log (1/s)

Sticky Sampling Algorithm: Prune S • function Prune(S) • for every entry (e,f) in S do • while random(0,1) < 0.5 and f > 0 do • f f – 1 • if f = 0 do • remove the entry from S • endif

Lossy Counting Algorithm • Incoming data stream is conceptually divided into buckets of w=1/ transactions • Current bucket id denote as bcurrent = N/w • fe: the true frequency of e in the stream • Counts are kept in a data structure D • Each entry in D is in the form (e, f, ), where: • e is the item • f is the estimated frequency of e in the stream •  is the maximum possible error in f

Lossy Counting Algorithm(cont’d) Example: =0.2,w=5, N=17, bcurrent=4 Bucket 1 Bucket 2 Bucket 3 bcurrent= 4 A B C A B E A C C D D A B E D F C D D D D (A,2,0) (B,2,0) (C,1,0) (A,3,0) (B,2,0) (C,2,1) (E,1,1) (D,1,1) (A,4,0) (B,1,2) (C,2,1) (D,2,2) (E,1,2) (A,4,0) (C,1,3) (D,2,2) (F,1,3) Prune D Prune D Prune D D D D (A,4,0) (D,2,2) (A,2,0) (B,2,0) (A,3,0) (C,2,1)

Prune D的時機: at bucket boundary Lossy Counting Algorithm(cont’d) • D ; N  0 • w  1/; bcurrent  1 • e  next item; N  N + 1 • if (e,f,) exists in D do • f  f + 1 • else do • insert (e,1, bcurrent-1) to D • endif • if N mod w = 0 do • prune(D, bcurrent); • bcurrent  bcurrent + 1 • endif • Goto 3; D: The set of all counts N: Curr. len. of stream e: item w: Bucket width bcurrent: Current bucket id

Lossy Counting Algorithm: prune D • function prune(D, bcurrent) • for each entry (e,f,) in D do • if f +   bcurrent do • remove the entry from D • endif

Lossy Counting Algorithm (cont’d) • Four Lemmas Lemma1: Whenever deletions occur, bcurrent N Lemma2: Whenever an entry (e,f,) gets deleted, fe bcurrent Lemma3: If e does not appear in D, then fe N Lemma4: If (e,f,) D, then f fe f+N

Extended Lossy Counting for Frequent Itemsets • Incoming data stream is conceptually divided into buckets of w= 1/ transactions • Counts are kept in a data structure D • Multiple buckets ( of them say) are processed in a batch • Each entry in D is in the form (set, f, ), where: • set is the itemset • f is the approximate frequency of set in the stream • is the maximum possible error in f

Bucket 1 Bucket 2 Bucket 3 Extended Lossy Counting for Frequent Itemsets (cont’d) Put 3 buckets of data into main memory one time

Overview of the algorithm • D is updated by the operations UPDATE_SET and NEW_SET • UPDATE_SET updates and deletes entries in D • For each entry (set, f, ), count occurrence of set in the batch and update the entry • If an updated entry satisfies f +   bcurrent, the entry is removed from D • NEW_SET inserts new entries into D • If a set set has frequency f  in the batch and set does not occur in D, create a new entry (set, f, bcurrent-)

Implementation • Challenges: • Not to enumerate all subsets of a transaction • Data structure must be compact for better space efficiency • 3 major modules: • Buffer • Trie • SetGen

Implementation(cont’d) • Buffer: repeatedly reads in a batch of buckets of transactions, into available main memory • Trie: maintains the data structure D • SetGen: generates subsets of item-id’s along with their frequency counts in the current batch • Not all possible subsets need to be generated • If a subset S is not inserted into D after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered

Example Main Memory bucket3 bucket4 ACE BCD AB ABC AD BCE ACE: AC, A, C BCD: BC, B, C AB: AB, A, B ABC: AB, AC, BC, A, B, C AD: A BCE: BC, B, C UPDATE_SET SetGen D D (A,5,0) (B,3,0) (C,3,0) (D,2,0) (AB,2,0) (AC,3,0) (AD,2,0) (BC,2,0) (A,9,0) (B,7,0) (C,7,0) (AC,5,0) (BC,5,0) NEW_SET Add (AB,2,2) into D

IBM synthetic dataset T10.I4.1000K N = 1Million Avg Tran Size = 10 Input Size = 49MB IBM synthetic dataset T15.I6.1000K N = 1Million Avg Tran Size = 15 Input Size = 69MB Frequent word pairs in 100K web documents N = 100K Avg Tran Size = 134 Input Size = 54MB Frequent word pairs in 806K Reuters newsreports N = 806K Avg Tran Size = 61 Input Size = 210MB Experiments

Varying support s and BUFFER B Time in seconds Time in seconds S = 0.004 S = 0.008 S = 0.001 S = 0.012 S = 0.002 S = 0.016 S = 0.004 S = 0.020 S = 0.008 BUFFER size in MB BUFFER size in MB IBM 1M transactions Reuters 806K docs Fixed: Stream length N Varying: BUFFER size B Support threshold s

Varying length N and support s S = 0.001 S = 0.002 S = 0.001 Time in seconds S = 0.004 Time in seconds S = 0.002 S = 0.004 Length of stream in Thousands Length of stream in Thousands IBM 1M transactions Reuters 806K docs Fixed: BUFFER size B Varying: Stream length N Support threshold s

Varying BUFFER B and support s Time in seconds Time in seconds B = 4 MB B = 4 MB B = 16 MB B = 16 MB B = 28 MB B = 28 MB B = 40 MB B = 40 MB Support threshold s Support threshold s IBM 1M transactions Reuters 806K docs Fixed: Stream length N Varying: BUFFER size B Support threshold s

Comparison with fast A-priori Dataset: IBM T10.I4.1000K with 1M transactions, average size 10.

No of counters No of counters N (stream length) Sticky Sampling Expected: 2/ log 1/s Lossy Counting Worst Case: 1/ log N Support s = 1% Error ε = 0.1% Log10 of N (stream length)

Approximate Frequency Counts over Data Streams - Summary and Algorithms

Approximate Frequency Counts over Data Streams - Summary and Algorithms

Presentation Transcript

Managing Data Streams

Data Streams

Frequency Counts

Approximate Data Exchange

Approximate Counting of Cycles in Streams

Data Streams

Data Streams

Mining Data Streams

Approximate Counts and Quantiles over Sliding Windows

Continuous Queries over Data Streams

Continuously Maintaining Order Statistics Over Data Streams

Approximate Query Processing (AQP) in Data Streams

Multiple Aggregations Over Data Streams

Optimal Approximations of the Frequency Moments of Data Streams

Data Streams

Approximate Selection Queries over Imprecise Data

Adaptive Frequency Counting over Bursty Data Streams

dQUOB: SQL queries over data streams

Multiple Aggregations Over Data Streams