1 / 31

Efficient Computation of Frequent and Top- k Elements in Data Streams

Efficient Computation of Frequent and Top- k Elements in Data Streams. Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University of California, Santa Barbara. Motivation. Motivated by Internet advertising commissioners

cornellm
Télécharger la présentation

Efficient Computation of Frequent and Top- k Elements in Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University of California, Santa Barbara

  2. Motivation • Motivated by Internet advertising commissioners • Before rendering an advertisement for user, query clicks stream for advertisements to display. • If the user's profile is not a frequent “clicker”, then s/he will probably not click any displayed advertisement. • Show Pay-Per-Impression advertisements. • If the user's profile is a frequent “clicker”, then s/he may click a displayed advertisement. • Show Pay-Per-Click advertisements. • Retrieve top advertisements to choose what to display.

  3. Problem Definition • Given alphabet A, stream S of size N, a frequent element, E, is an element whose frequency, F, exceeds a user specified support, φN • Top-k elements are the k elements with highest frequency • Both problems: • Very related, though, no integrated solution has been proposed • Exact solution is O(min(N,A)) space • approximate variations

  4. φN (φ - ) N Practical Frequent Elements • -Deficient Frequent Elements [Manku ‘02]: • All frequent elements output should have F > (φ - )N, where  is the user-defined error.

  5. F4 (1 - ) F4 Practical Top-k • FindApproxTop(S, k, ) [Charikar ‘02]: • Retrieve a list of k elements such that every element, Ei, in the list has Fi > (1 - ) Fk, where Ek is the kthranked element.

  6. Related Work • Algorithms Classification • Counter-Based techniques • Keep an individual counter for each element • If the observed ID is monitored, its counter is updated • If the observed ID is not monitored, algorithm dependent action • Sketch-Based techniques • Estimate frequency for all elements using bit-maps of counters • Each element is hashed into the counters’ space using a family of hash functions. • Hashed-to counters are queried for the frequencies

  7. Recent Work (Comparison)

  8. Outline • Problem Definition • Space-Saving: Summarizing the Data Stream • Answering Frequent Elements Queries • Answering Top-k Queries • Experimental Results • Conclusion

  9. The Space-Saving Algorithm • Space-Saving is counter-based • Monitor only m elements • Only over-estimation errors • Frequency estimation is more accurate for significant elements • Keep track of max. possible errors

  10. Space-Saving By Example A B B A C A B B D D B E C Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error

  11. Space-Saving Observations S = ABBACABBDDBEC N = 13 • Observations: • The summation of the Counts is N • Minimum number of hits, min ≤ N/m • In this example, min = 4 • The minimum number of hits, min, is an upper bound on the error of any element

  12. Space-Saving Proved Properties S = ABBACABBDDBEC N = 13 S = ABBACABBDDBEC N = 13 • If Element E has frequency F > min, then E must be in Stream-Summary. F(B) = F1 = 5, min = 4. The Count at position i in Stream-Summary is no less than Fi, the frequency of the ith ranked element. F(A) = F2 = 3, Count2 = 4.

  13. Space-Saving Data Structure • We need a data structure that • Increments counters in constant time • Keeps elements sorted by their counters • We propose the Stream-Summary structure, similar to the data structure in [Demaine ’02]

  14. Frequent Elements Queries • Traverse Stream-Summary, and report all elements that satisfy the user support • Any element whose guaranteed hits = (Count – error) > φN is guaranteed to be a frequent element

  15. Frequent Elements Example • For N = 73,m = 8,φ = 0.15: • Frequent Elements should have support of 11 hits. • Candidate Frequent Elements are B, D, and G. • Guaranteed Frequent Elements are B, and D, since their guaranteed hits > 11.

  16. Frequent Elements Space Bounds

  17. Top-k Elements Queries • Traverse the Stream-Summary, and report top-k elements. • From Property 2, we assert: • Guaranteed top-k elements: • Any element whose guaranteed hits = (Count – error) ≥ Countk+1, is guaranteed to be in the top-k. • Guaranteed top-k’ (where k’≈k): • The top-k’ elements reported are guaranteed to be the correct top-k’ iff for every element in the top-k’, guaranteed hits = (Count – error) ≥ Countk’+1.

  18. Top-k Elements Example • For k = 3,m = 8: • B, D, and G are the top-3candidates. • B, and D are guaranteed to be in the top-3. • B , D, G and A are guaranteed to be the top-4. Here k’ = 4. • B , and D are guaranteed to be the top-2. Another k’ = 2.

  19. Top-k Elements Space Bounds

  20. Outline • Problem Definition • Space-Saving: Summarizing the Data Stream • Answering Frequent Elements Queries • Answering Top-k Queries • Experimental Results • Conclusion

  21. Experimental Results - Setup • Synthetic data: • Zipf(α), α varied: 0.0, 0.5, 1.0, …, 2.5, 3.0 • N =107 hits. • Real Data (ValueClick, Inc.): Similar results • Precision: • number of correct elements found / entire output • Recall: • number of correct elements found / number of actual correct • Run time: • Processing Stream + Query Time • Space used: • Including hash table

  22. Frequent Elements Results • Query: φ = 10-2,  = 10-4, and δ = 10-2 • We compared with • GroupTest and Frequent • All algorithms had a recall of 1. • That is, they all output the correct elements among their output. • Space-Saving was able to guarantee all its output to be correct

  23. Frequent Elements Precision

  24. Frequent Elements Run Time

  25. Frequent Elements Space Used

  26. Top-k Elements Results • Query: k = 100,  = 10-4, and δ = 10-2 • We compared with • CountSketch: CountSketch was re-run several times. The hidden constant was estimated to be 16, in order to have output of competitive quality. • Probabilistic-InPlace: was allowed the same number of counters as Space-Saving • Space-Saving was able to guarantee all its output to be correct

  27. Top-k Elements Precision

  28. Top-k Elements Recall

  29. Top-k Elements Run Time

  30. Top-k Elements Space Used

  31. Conclusion • Contributions: • An integrated approach to solve an interesting family of problems • Strict error bounds using little space • Guarantees on results • Special attention was given to Zipfian data • Experimental validation • Future Work: • Incremental frequent and top-k elements reporting

More Related