1 / 22

Streaming Algorithms

Joe Kelley Data Engineer. July 2013. Streaming Algorithms. Leading Provider of. Data Science & Engineering for Big Analytics . Accelerating Your Time to Value. IMAGINE. ILLUMINATE. IMPLEMENT. Strategy and Roadmap. Training and Education. Hands-On Data Science and Data Engineering.

kovit
Télécharger la présentation

Streaming Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Joe Kelley Data Engineer July 2013 Streaming Algorithms

  2. Leading Provider of Data Science & Engineering for Big Analytics Accelerating Your Time to Value IMAGINE ILLUMINATE IMPLEMENT Strategy and Roadmap Training and Education Hands-On Data Science and Data Engineering

  3. What is a Streaming Algorithm? • Operates on a continuous stream of data • Unknown or infinite size • Only one pass; options: • Store it • Lose it • Store an approximation • Limited processing time per item • Limited total memory

  4. Why use a Streaming Algorithm? • Compare to typical “Big Data” approach: store everything, analyze later, scale linearly • Streaming Pros: • Lower latency • Lower storage cost • Streaming Cons: • Less flexibility • Lower precision (sometimes) • Answer? • Why not both?

  5. General Techniques • Tunable Approximation • Sampling • Sliding window • Fixed number • Fixed percentage • Hashing: useful randomness

  6. Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries storing 1% is good enough

  7. Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries storing 1% is good enough • Algorithm: for each element e: with probability 0.01: store e else: throw out e Can lead to some insidious statistical “bugs”…

  8. Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries storing 1% is good enough • Query: • How many errors has the average device encountered? • Answer: • SELECT AVG(n) FROM ( • SELECT COUNT(*) AS n FROM events • WHERE event = 'ERROR' • GROUP BY device_id • ) • Simple… but off by up to 100x. Each device had only 1% of its events sampled. • Can we just multiply by 100?

  9. Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries storing 1% is good enough • Better Algorithm: for each element e: if (hash(e.device_id) mod 100) == 0 store e else: throw out e Choose how to hash carefully... or hash every different way

  10. Example 2: Sampling fixed number Want to sample a fixed count (k), not a fixed percentage. Algorithm: Let arr = array of size k for each element e: if arr is not yet full: add e to arr else: with probability p: replace a random element of arr with e else: throw out e • Choice of p is crucial: • p = constant  prefer more recent elements. Higher p = more recent • p = k/n  sample uniformly from entire stream

  11. Example 2: Sampling fixed number

  12. Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Naïve approach: • Store all user_id’s in a list/tree/hashtable • Millions of users = lot of memory • Better approach: • Store all user_id’s in a database • Good, but maybe it’s not fast enough… • What if an approximate count is ok?

  13. Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Approximate count is ok • Flajolet-Martin Idea: • Hash each user_id into a bit string • Count the trailing zeros • Remember maximum number of trailing zeros seen

  14. Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Intuition: • If we had seen 2 distinct users, we would expect 1 trailing zero • If we had seen 4, we would expect 2 trailing zeros • If we had seen , we would expect • In general, if there has been a maximum of trailing zeros, is a reasonable estimation of distinct users • Want more precision? User more independent hash functions, and combine the results • Median = only get powers of two • Mean = subject to skew • Median of means of groups works well in practice

  15. Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Flajolet-Martin, all together: arr = int[k] for each item e: for i in 0...k-1: z = trailing_zeros(hashi(e)) if z > arr[i]: arr[i] = z means = group_means(arr) median = median(means) return pow(2, median)

  16. Example 3: Counting unique users • Flajolet-Martin in practice • Devil is in the details • Tunable precision • more hash functions = more precise • See the paper for bounds on precision • Tunable latency • more hash functions = higher latency • faster hash functions = lower latency • faster hash functions = more possibility of correlation = less precision • Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer

  17. Example 4: Counting Individual Item Frequencies • Want to keep track of how many times each item has appeared in the stream • Many applications: • How popular is each search term? • How many times has this hashtag been tweeted? • Which IP addresses are DDoS’ing me? • Again, two obvious approaches: • In-memory hashmap of itemcount • Database • But can we be more clever?

  18. Example 4: Counting Individual Item Frequencies • Want to keep track of how many times each item has appeared in the stream • Idea: • Maintain array of counts • Hash each item, increment array at that index • To check the count of an item, hash again and check array at that index • Over-estimates because of hash “collisions”

  19. Example 4: Counting Individual Item Frequencies • Count-Min Sketch algorithm: • Maintain 2-d array of size w x d • Choose d different hash functions; each row in array corresponds to one hash function • Hash each item with every hash function, increment the appropriate position in each row • To query an item, hash it d times again, take the minimum value from all rows

  20. Example 4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Count-Min Sketch, all together: arr = int[d][w] for each item e: for i in 0...d-1: j = hashi(e) mod w arr[i][j]++ def frequency(q): min = +infinity for i in 0...d-1: j = hashi(e) mod w if arr[i][j] < min: min = arr[i][j] return min

  21. Example 4: Counting Individual Item Frequencies • Count-Min Sketch in practice • Devil is in the details • Tunable precision • Bigger array = more precise • See the paper for bounds on precision • Tunable latency • more hash functions = higher latency • Better at estimating more frequent items • Can subtract out estimation of collisions • Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer

  22. Questions? • Feel free to reach out • www.thinkbiganalytics.com • joe.kelley@thinkbiganalytics.com • www.slideshare.net/jfkelley1 • References: • http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf • http://infolab.stanford.edu/~ullman/mmds.html • We’re hiring! Engineers and Data Scientists

More Related