1 / 32

Streaming Pattern Discovery in Multiple Time-Series

Discover hidden variables to summarize main trends, forecast outliers, and efficiently compute incremental, real-time summaries in multiple time-series data streams.

Télécharger la présentation

Streaming Pattern Discovery in Multiple Time-Series

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University

  2. Motivation • Co-evolving time series (data streams) appear in many different applications—e.g.: • Disk access traffic in network clusters • Internet flow traffic in a network • Temperatures in a large building • Chlorine concentration in water distribution network Values are typically correlated Would be very useful if we could summarize them on the fly http://www.pdl.cmu.edu/

  3. Phase 1 Phase 2 Phase 3 : : : : : : chlorine concentrations : : : : : : Example sensors near leak sensors away from leak water distribution network normal operation time http://www.pdl.cmu.edu/

  4. Goals • Discover “hidden” (latent) variables for: • Summarization of main trends for users • Efficient forecasting, spotting outliers/anomalies • Incremental, real-time computation • Limited memory requirements http://www.pdl.cmu.edu/

  5. Phase 1 Phase 2 Phase 3 : : : : : : : : : : : : Example: chlorine measurements sensors near leak chlorine concentrations sensors away from leak water distribution network normal operation major leak http://www.pdl.cmu.edu/

  6. Phase 1 Phase 1 : : : : : : chlorine concentrations k = 1 : : : : : : Example: hidden variable actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends http://www.pdl.cmu.edu/

  7. : : : : : : : : : : : : Example: hidden variable tracking Phase 1 Phase 2 Phase 1 Phase 2 chlorine concentrations k = 2 actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends http://www.pdl.cmu.edu/

  8. : : : : : : : : : : : : Example: hidden variable tracking Phase 1 Phase 2 Phase 3 Phase 1 Phase 2 Phase 3 chlorine concentrations k = 1 actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends http://www.pdl.cmu.edu/

  9. Method outline • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? http://www.pdl.cmu.edu/

  10. time 1. How to capture correlations? • First sensor 30oC Temperature T1 20oC http://www.pdl.cmu.edu/

  11. time 1. How to capture correlations? • First sensor • Second sensor 30oC Temperature T2 20oC http://www.pdl.cmu.edu/

  12. 1. How to capture correlations • Correlations: • Let’s take a closer look at the first three value-pairs… 30oC Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1

  13. time=3 time=2 time=1 1. How to capture correlations • First three lie (almost) on a line in the space of value-pairs… 30oC Temperature T2 offset = “hidden variable”  O(n) numbers for the slope, and  One number for each value-pair (offset on line) 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1

  14. 1. How to capture correlations • Other pairs also follow the same pattern: they lie (approximately) on this line 30oC Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1

  15. Method outline • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? http://www.pdl.cmu.edu/

  16. Experiments: chlorine concentration from sensor Measurements Reconstruction From hidden variables 166 streams 2 hidden variables (~4% error) [CMU Civil Engineering] http://www.pdl.cmu.edu/

  17. Experiments: chlorine concentration • Both capture global, periodic pattern • Second: ~ first, but “phase-shifted” • Can express any “phase-shift”… hidden variables [CMU Civil Engineering] http://www.pdl.cmu.edu/

  18. Conclusion • Many settings with hundreds of streams, but • Stream values are, by nature, related • We proposed a method to • discover hidden variables as summarization of main trends for users • require only incremental computation without buffering of any past data • Future work: • Apply on more applications: e.g, performance monitoring for storage system, network system. http://www.pdl.cmu.edu/

  19. Related work • Stream SVD [Guha, Gunopulos, Koudas / KDD03] • StatStream [Zhu, Shasha / VLDB02] • Clustering • [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE], • [Lin, Vlachos, Keogh, Gunopulos / EDBT04], • Classification • [Wang, Fan, et al/KDD03], [Hulten,Spencer,Domingos/KDD01] • Piecewise approximations • [Palpanas, Vlachos, Keogh, etal / ICDE 2004] http://www.pdl.cmu.edu/

  20. Experiments: Light measurements measurement reconstruction 54 sensors 2-4 hidden variables (~6% error) http://www.pdl.cmu.edu/

  21. Experiments: Light measurements • 1 & 2: main trend (as before) • 3 & 4: potential anomalies and outliers intermittent intermittent hidden variables http://www.pdl.cmu.edu/

  22. Stream correlations • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? http://www.pdl.cmu.edu/

  23. New value 2. Incremental update • For each new point • Project onto current line • Estimate error 30oC error Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1

  24. New value 2. Incremental update • For each new point • Project onto current line • Estimate error • Rotate line in the direction of the error and in proportion to its magnitude • O(n) time 30oC error Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1

  25. 2. Incremental update • For each new point • Project onto current line • Estimate error • Rotate line in the direction of the error and in proportion to its magnitude 30oC Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1

  26. Stream correlationsPrincipal Component Analysis (PCA) • The “line” is the first principal component (PC) vector • This line is optimal: it minimizes the sum of squared projection errors http://www.pdl.cmu.edu/

  27. x w1 updated e1 w1 y1 2. Incremental updateGiven number of hidden variables k • Assuming k is known • We know how to update the slope • (detailed equations in paper) • For each new point x and for i = 1, …, k : • yi := wiTx (proj. onto wi) • didi + yi2 (energy  i-th eigenval.) • ei := x – yiwi (error) • wiwi + (1/di) yiei (update estimate) • xx – yiwi (repeat with remainder) http://www.pdl.cmu.edu/

  28. Stream correlations • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust k, the number of hidden variables? http://www.pdl.cmu.edu/

  29. 3. Number of hidden variables • If we had three sensors with similar measurements • Again: points would lie on a line (i.e., one hidden variable, k=1), but in 3-D space T2 T3 T1 value-tuple space http://www.pdl.cmu.edu/

  30. 3. Number of hidden variables • Assume one sensor intermittently gets stuck • Now, no line can give a good approximation T2 T3 T1 value-tuple space http://www.pdl.cmu.edu/

  31. 3. Number of hidden variables • Assume one sensor intermittently gets stuck • Now, no line can give a good approximation • But a plane will do (two hidden variables, k = 2) T2 T3 T1 value-tuple space http://www.pdl.cmu.edu/

  32. Number of hidden variables (PCs) • Keep track of energy maintained by approximation with k variables (PCs): • Reconstruction accuracy, w.r.t. total squared error • Increment (or decrement) k if fraction of energy maintained goes below (or above) a threshold • If below 95%, k k  1 • If above 98%, k k  1 http://www.pdl.cmu.edu/

More Related