1 / 59

Introduction to Data Stream

Introduction to Data Stream. DBG@UNSW. Acknowledgement. Some of the slides are modified from. Nikos Koudas (Toronto U) Minos Garofalakis ( yahoo! research) Divesh Srivastava (AT & T) S. Muthukrishnan (Rutgers) Georges Hébrail (ENST Paris). Outline.

hamlin
Télécharger la présentation

Introduction to Data Stream

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Data Stream DBG@UNSW introduction to data stream DBG@UNSW

  2. Acknowledgement Some of the slides are modified from • Nikos Koudas (Toronto U) • Minos Garofalakis ( yahoo! research) • Divesh Srivastava (AT & T) • S. Muthukrishnan (Rutgers) • Georges Hébrail (ENST Paris) introduction to data stream DBG@UNSW

  3. Outline • Introduction and Applications • Data Stream Management System • Modeling for Data Streams • Approximation Technique • Data Stream Computation introduction to data stream DBG@UNSW

  4. What is a data stream ? • Golab & Oszu (2003): “A data stream is a real-tme, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.” • Structured records  audio or video data • Massive volumes of data, records arrive at a high rate introduction to data stream DBG@UNSW

  5. Data stream applications. • Transactional data streams: log interactions between entities : •  Credit card: purchases by consumers from merchants •  Telecommunications: phone calls by callers to dialed parties •  Web: accesses by clients of resources at servers • Measurement data streams: monitor evolution of entity states •  IP network: traffic at router interfaces  Sensor networks: physical phenomena, road traffic  Earth climate: temperature, moisture at weather stations introduction to data stream DBG@UNSW

  6. Network supervision Center Applications :Network Management involves monitoring and configuring network hardware and software to ensure smooth operation • Quickly detect faults, congestion, attack • Qos • Load balancing, improve utilization of network resources introduction to data stream DBG@UNSW

  7. ( more details ) • Traffic estimation • What fraction network IP addresses are active? • How many bytes were sent between a pair of IP addresses? • List the top 100 IP addresses in terms of traffic • Traffic analysis • What is the average duration of an IP session? • What is the median of the number of bytes in each IP session? • Fraud detection • List all sessions that transmitted more than 1000 bytes • Identify all sessions whose duration was more than twice the normal • Security/Denial of Service • List all IP addresses that have witnessed a sudden spike in traffic • Identify IP addresses involved in more than 1000 sessions introduction to data stream DBG@UNSW

  8. Application : stock monitoring • Stream of price and sales volume of stocks over time • Technical analysis/charting for stock investors • Support trading decisions • Notify me when some stock goes up by at least 5%. • Notify me when the price of any stock increases monotonically for ≥ 40 min. introduction to data stream DBG@UNSW

  9. Challenges • Massive in volume or even infinite • AT&T long-distance: ~300M call tuples/day • AT&T IP backbone: ~50B IP flows/day • Rapid arriving rate • Real-time monitoring (response) required • Continuous query introduction to data stream DBG@UNSW

  10. Outline • Introduction and Applications • Data Stream Management System • Modeling for Data Streams • Approximation Technique • Data Stream Computation introduction to data stream DBG@UNSW

  11. DSMS DBMS-Data Base Management System • Data model ( relational ) • Data isstored on disk • SQL language • Creating structures • Inserting/updating/deleting data • Retrieving data (query) • Good performance evenwith large volumes of data DSMS - Data Stream Management System • Data model ( streams and permanent relations) • Permanent relations are stored on diskbut streams are processed on the fly • SQL likequerylanguage • Standard SQL on permanent relations • Extended SQL on streamswithwindowingfeatures • New paradigm of queries (continuousqueries) • Tools for capturing input streams and producing output streams • Good performance: optimization of computer resources introduction to data stream DBG@UNSW

  12. Existing DSMS Principal specialized DSMS’s • Gigascope and Hancock : AT&T • Network monitoring • Analysis of telecommunication calls • NiagaraCQ : University of Wisconsin-Madison • Large number of continuous queries on web content (XML-QL) • Tradebot (finance), Statstream (statistics) Principal general-purpose DSMS’s • STREAM : University of Stanford • TelegraphCQ : University of Berkeley • Aurora : Brown University, MIT, Brandeis Sensor network • Cougar : Cornell University • TinyDB : University of Berkeley introduction to data stream DBG@UNSW

  13. Streamed Result Stored Result Register Query DSMS Input streams Archive Scratch Store Stored Relations STREAM from stanford introduction to data stream DBG@UNSW

  14. STREAM ( cont. ) • General-purpose DSMS for streams and stored data • Relational(unlikely to change) • Centralized server model (likely to change) • Single-threaded and parallel versions • Declarative language for registering continuous queries (CQL) • Query optimization with good memory management • Approximate answer with synopses management introduction to data stream DBG@UNSW

  15. STREAM ( cont. ) Some Implementation Issues • Designed to cope with: • Stream rates that may be high, variable, bursty • Continuous query loads that may be high, volatile • Primary coping techniques • Continuous self-monitoring and reoptimization • Graceful approximation as necessary • Careful resource allocation and use introduction to data stream DBG@UNSW

  16. Outline • Introduction and Applications • Data Stream Management System • Modeling for Data Streams • Approximation Technique • Data Stream Computation introduction to data stream DBG@UNSW

  17. Models for data streams • Structure of a stream • Infinite sequence of items (elements) • One item: structured information, i.e. tuple or object • Same structure for all items in a stream • Timestamping • Explicit ( date field in data ) • Implicit  ( timestamp given when items arrive ) • Representation of time • Physical  (date) • Logical  (integer) introduction to data stream DBG@UNSW

  18. Models for data streams (cont.) • One-dimensional array A[1…N] with values A[i] all initially zero • Signal is implicitly represented via a stream of updates • j-th update is <k, c[j]> implying A[k] := A[k] + c[j] (c[j] can be >=0, <0) • Goal: Compute functions on A[ ] subject to • Small space • Fast processing of updates • Fast function computation • … introduction to data stream DBG@UNSW

  19. Models for data streams (cont.) • Time-Series Model Only j-th update updates A[j] (i.e., A[j] := c[j]) • Cash-Register Model • c[j] is always >= 0 (i.e., increment-only) • Typically, c[j]=1, so we see a multi-set of items in one pass • Turnstile Model • Most general streaming model • c[j] can be >=0 or <0 (i.e., increment or decrement) Problem difficulty varies depending on the model • E.g., MIN/MAX in Time-Series vs. Turnstile! introduction to data stream DBG@UNSW

  20. Window on the stream Beginning of the stream t Current date Windowing Applying queries/mining tasks to the whole stream (from beginning to current time) Applying queries/mining tasks to a portion of the stream introduction to data stream DBG@UNSW

  21. Windowing ( cont.) Definition of windows of interest on streams • Fixed windows: September 2007 • Sliding windows: last 3 hours ( n of N window ) • Landmark windows: from September 1st, 2007 Window specification • Physical time: last 3 hours • Logical time: last 1000 items Refreshing rate • Rate of producing results (every item, every 10 items, every minute, …) introduction to data stream DBG@UNSW

  22. Synopsis in Memory Data Streams (Approximate) Answer Stream Processing Engine Computation Model • Stream processing requirements • Single pass: Each record is examined at most once • Bounded storage: Limited Memory (M) for storing synopsis • Real-time: Per record processing time (to maintain synopsis) must be low • Data Independent : no priori knowledge required about data set (size, range, distribution, order) introduction to data stream DBG@UNSW

  23. Outline • Introduction and Applications • Data Stream Management System • Modeling for Data Streams • Approximation Technique • Data Stream Computation introduction to data stream DBG@UNSW

  24. Approximation • Exact answer is too expensive to compute • May need too large memory to afford ( distinct, median ) • May need too long time to complete • Approximate answer is acceptable in many applications ε-approximate answers [ Absolute error / Relative error ] Like: E = 100 , ε=0.1 then [90 , 110] are acceptable answers • Only small size of memory is needed • Compute very quickly • Error is guaranteed to be small introduction to data stream DBG@UNSW

  25. Approximation (cont.) • Deterministic approximate methods • Deterministic algorithms carefully controls error. • Non-deterministic approximate methods. • Randomization, Sampling … etc. Provides good approximation with high probability. introduction to data stream DBG@UNSW 25

  26. Basic synopses Basic stream synopses computation • Samples: Answering queries using samples.Reservoir sampling, inverse sampling • Histograms: Equi-depth histograms, On-line quantile computation • Wavelets: Haar-wavelet histogram construction & maintenance • Sketch : AMS Sketch, CM Sketch, FM Sketch introduction to data stream DBG@UNSW

  27. Sampling • Idea: A small random sample S of the data often well-represents • all the data • For a fast approximate answer, apply “modified” query to S • Example: select agg from R (n=12) • If agg is avg, return average of the elements in S • Number of odd elements ? Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 Sample S: 9 5 1 8 answer: 11.5 introduction to data stream DBG@UNSW

  28. Probabilistic Guarantees • Example: Actual answer is within 11.5 ± 1 with prob  0.9 • Randomized algorithms:Answer returned is a specially-built random variable. • Use Tail Inequalities to give probabilistic bounds on returned answer • Markov Inequality • Chebyshev’s Inequality • Chernoff/Hoeffding Bound introduction to data stream DBG@UNSW

  29. Probability distribution Tail probability Basic Tools: Tail Inequalities • General bounds on tail probability of a random variable (that is, probability that a random variable deviates far from its expectation) • Basic Inequalities: Let X be a random variable with expectation and variance Var[X]. Then for any Markov: Chebyshev: introduction to data stream DBG@UNSW

  30. Tail Inequalities ( cont.) • Hoeffding’s Inequality: Let X1, ..., Xm be independent random variables with 0<=Xi <= r. Let and be the expectation of . Then, for any • Chernoff Bound (… ) introduction to data stream DBG@UNSW

  31. Histogram Histograms approximate the frequency distribution of element values in a stream A histogram (typically) consists of • A partitioning of element domain values into buckets • A count per bucket B (of the number of elements in B) Count for bucket Domain values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 introduction to data stream DBG@UNSW

  32. [1.5, 4] [0.5, 0] [2.75] [-1.25] Haar wavelet decomposition: [2.75, -1.25, 0.5, 0, 0, -1, -1, 0] Wavelet Wavelets: Mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: Simplest wavelet, easy to understand and implement • Recursive pairwise averaging and differencing at different resolutions Resolution Averages Detail Coefficients 3 [2, 2, 0, 2, 3, 5, 4, 4] ---- 2 [2, 1, 4, 4] [0, -1, -1, 0] 1 0 introduction to data stream DBG@UNSW

  33. Outline • Introduction and Applications • Data Stream Management System • Modeling for Data Streams • Approximation Technique • Data Stream Computation introduction to data stream DBG@UNSW

  34. Find elements that occupy 0.1% of the tail. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% What is the frequency of element 3? What is the total frequency of elements between 8 and 14? Frequency RelatedProblems. Top-k most frequent elements How many elements have non-zero frequency? (distinctnumber ) introduction to data stream DBG@UNSW

  35. 2 9 9 9 7 6 4 9 9 9 3 9 An Old Chestnut: Majority • A sequence of N items. • You have constant memory. • In one pass, decide if some item is in majority (occurs > N/2 times)? N = 12; item 9 is majority Any Idea ? introduction to data stream DBG@UNSW

  36. Misra-Gries Algorithm (‘82) • A counter and an ID. • If new item is same as stored ID, increment counter. • Otherwise, decrement the counter. • If counter 0, store new item with count = 1. • If counter > 0, then its item is the only candidate for majority. introduction to data stream DBG@UNSW

  37. ID ID1 ID2 . . . . IDk count . . A generalization: Frequent Items(Karp 03) Find k items, each occurring at least N/(k+1) times. • Algorithm: • Maintain k items, and their counters. • If next item x is one of the k, increment its counter. • Else if a zero counter, put x there with count = 1 • Else (all counters non-zero) decrement all k counters introduction to data stream DBG@UNSW

  38. Frequent Elements: Analysis • A frequent item’s count is decremented if all counters are full: it erases k+1 items. • If x occurs > N/(k+1) times, then it cannot be completely erased. • Similarly, x must get inserted at some point, because there are not enough items to keep it away. introduction to data stream DBG@UNSW

  39. Problem of False Positives • False positives in Misra-Gries(MG) algorithm • It identifies all true heavy hitters, but not all reported items are necessarily heavy hitters. • How can we tell if the non-zero counters correspond to true heavy hitters or not? • A second pass is needed to verify. • False positives are problematic if heavy hitters are used for billing or punishment. • What guarantees can we achieve in one pass? introduction to data stream DBG@UNSW

  40. Approximation Guarantees • Find heavy hitters with a guaranteed approximation error [MM02] • Manku-Motwani ( Lossy Counting ) • Suppose you want -heavy hitters --- items with freq > N • An approximation parameter , where << .(E.g.,  = .01 and  = .0001;  = 1% and  = .01% ) • Identify all items with frequency >  N • No reported item has frequency < ( - )N • The algorithm uses O(1/ log (N)) memory introduction to data stream DBG@UNSW

  41. Window 1 Window 2 Window 3 MM02 Algorithm 1: Lossy Counting Step 1: Divide the stream into ‘windows’ Window-size W is function of support s – specify later… introduction to data stream DBG@UNSW

  42. Frequency Counts + First Window At window boundary, decrement all counters by 1 Lossy Counting in Action ... Empty introduction to data stream DBG@UNSW

  43. Frequency Counts + Next Window At window boundary, decrement all counters by 1 Lossy Counting continued ... introduction to data stream DBG@UNSW

  44. Error Analysis How much do we undercount? If current size of stream = N and window-size W = 1/ε then# windows = εN frequency error Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1% introduction to data stream DBG@UNSW

  45. Putting it all together… Output: Elements with counter values exceeding (s-ε)N Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N How many counters do we need? • Worst case bound: 1/ε log εN counters introduction to data stream DBG@UNSW

  46. 2 1 1 1 0 f(1) f(2) f(3) f(4) f(5) Data stream: 3, 1, 2, 4, -2, 3, 5, . . . Frequent items ( Turnsile ) • Ask for f(1) = ? f(4) = ? - AMS based algorithm - Count Min sketch. introduction to data stream DBG@UNSW

  47. AMS ( sketch ) based algorithm. Key Intuition: Use randomized linear projections of f() to define random variable Z such that For given element A[i] E( Z ) = ||A[i]|| = fi Similar, we have E( Z ) = fj Basic Idea: Define a family of 4-wise independent {-1, +1} random variables Pr[ = +1] = Pr[ = -1] = ½ Let Z = So E( Z ) Example : 0 1 introduction to data stream DBG@UNSW

  48. AMS cont. • Keep an array of w X d counters for Zij • Use d hash functions to map element x to [1..w] W h1(a) d a hd(a) Est(fa) = median i (Z[i,hi(a)] ) Z[i, hi(a)] += introduction to data stream DBG@UNSW 48

  49. The Count Min (CM) Sketch • Simple sketch idea, can be used for point queries ( fi), range queries, quantiles, join size estimation • Creates a small summary as an array of w X d counters C • Use d hash functions to map element to [1..w] W = Bloom Filter Technique. d = 49 introduction to data stream DBG@UNSW

  50. +1 +1 +1 +1 CM Sketch Structure • Each element xi is mapped to one counter per row • C[ k,hk(xi)] = C[k, hk(xi)]+1 ( -1 for deletion ) or +c[j] if income is <j, c[j]> • Estimate A[j] by taking mink C[k,hk(j)] h1(xi) d xi hd(xi ) w introduction to data stream DBG@UNSW

More Related