1 / 64

Time-Series Data Management

Time-Series Data Management. Yonsei University 2 nd Semester, 2014 Sanghyun Park * The slides were extracted from the material presented at ICDM’01 by Eamonn Keogh. Contents. Introduction, motivation Utility of similarity measurements Indexing time series Summary, conclusions. 29. 28.

Télécharger la présentation

Time-Series Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Time-Series Data Management Yonsei University 2nd Semester, 2014 Sanghyun Park * The slides were extracted from the material presented at ICDM’01by Eamonn Keogh

  2. Contents • Introduction, motivation • Utility of similarity measurements • Indexing time series • Summary, conclusions

  3. 29 28 27 26 25 24 23 0 50 100 150 200 250 300 350 400 450 500 What Are Time Series? • A time series is a collection of observations made sequentially in time 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750 .. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500

  4. Time Series Are Ubiquitous (1/2) • People measure things … • The presidents approval rating • Their blood pressure • The annual rainfall in Riverside • The value of their Yahoo stock • The number of web hits per second • And things change over time and thus time series occur in virtually every medical, scientific and business domain

  5. Time Series Are Ubiquitous (2/2) • A random sample of 4,000 graphics from 15 of the world’s newspapers published from 1974 to 1989 found that more than 75% of all graphics were time series

  6. Time Series Similarity • Defining the similarity between two time series is at the heart of most time series data mining applications/tasks • Thus time series similarity will be the primary focus of this lecture

  7. Classification Clustering Utility Of Similarity Search (1/2)

  8. Rule Discovery 10  s = 0.5 c = 0.3 Query by Content Query Q (template) Utility Of Similarity Search (2/2)

  9. Challenges Of Research On Time Series (1/3) • How do we work with very large databases? • 1 hour of ECG data: 1 gigabyte • Typical web log: 5 gigabytes per week • Space shuttle database: 158 gigabytes and growing • Macho database: 2 terabytes, updated with 3 gigabytes per day • Since most of the data lives on disk (or tape), we need a representation of the data we can efficiently manipulate

  10. Challenges Of Research On Time Series (2/3) • We are dealing with subjective notions of similarity • The definition of similarity depends on the user, the domain, and the task at hand. We need to handle this subjectivity

  11. Challenges Of Research On Time Series (3/3) • Miscellaneous data handling problems • Differing data formats • Differing sampling rates • Noise, missing values, etc

  12. Whole Matching vs.Subsequence Matching (1/2) • Whole matchingGiven a query Q, a reference database C, anda distance measure, find the Ci that best matches Q Query Q (template) 6 1 7 2 8 3 C6is the best match 9 4 10 5 Database C

  13. Whole Matching vs.Subsequence Matching (2/2) • Subsequence matchingGiven a query Q, a reference database C, and a distance measure, find the location that best matches Q Query Q (template) Database C The best matching subsection

  14. Motivation Of Similarity Search • You go to the doctor because of chest pains. Your ECG looks strange … • Your doctor wants to search a database to find similar ECGs, in the hope that they will offer clues about your condition … • Two questions • How do we define similar? • How do we search quickly?

  15. Defining Distance Measures • Definition: Let O1 and O2 be two objects from the universe of possible objects. Their distance is denoted as D(O1,O2) • What properties should a distance measure have? • D(A,B) = D(B,A) Symmetry • D(A,A) = 0 Constancy of self-similarity • D(A,B) = 0 IIf A=B Positivity • D(A,B) ≤ D(A,C) + D(B,C) Triangluar inequality

  16. D(Q,C) The Minkowski Metrics p = 1 Manhattan (Rectilinear, City Block) p = 2 Euclidean p =  Max (Supremum, “sup”)

  17. Given two time seriesQ=q1…qn and C=c1…cn,their Euclidean distance is defined as: C Q D(Q,C) Euclidean Distance Metric

  18. Processing The Data Before Distance Calculation • If we naively try to measure the distance between two “raw” time series, we may get very unintuitive results • This is because Euclidean distance is very sensitive to some distortions in the data • For most problems these distortions are not meaningful, and thus we can and should remove them • Four most common distortions • Offset translation • Amplitude scaling • Linear trend • Noise

  19. 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Offset Translation D(Q,C) Q = Q - mean(Q) C = C - mean(C) D(Q,C) 0 50 100 150 200 250 300

  20. 0 100 200 300 400 500 600 700 800 900 1000 Amplitude Scaling 0 100 200 300 400 500 600 700 800 900 1000 Q = (Q - mean(Q)) / std(Q) C = (C - mean(C)) / std(C) D(Q,C)

  21. 5 4 3 2 12 1 10 0 8 -1 6 -2 4 -3 0 20 40 60 80 100 120 140 160 180 200 2 0 -2 5 -4 0 20 40 60 80 100 120 140 160 180 200 4 3 2 1 0 -1 -2 -3 0 20 40 60 80 100 120 140 160 180 200 Linear Trend Removed offset translation Removed amplitude scaling Removed linear trend Removed offset translation The intuition behind removing linear trend is this: Fit the best fitting straight line to the time series, then subtract that line from the time series Removed amplitude scaling

  22. 8 8 6 6 4 4 2 2 0 0 -2 -2 -4 -4 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Noise The intuition behind removing noise is this: Average each datapoint value with its neighbors Q = smooth(Q) C = smooth(C) D(Q,C)

  23. Fixed Time Axis Sequences are aligned “one to one”. “Warped” Time Axis Nonlinear alignments are possible. Dynamic Time Warping • We will first see the utility of DTW, then see how it is calculated

  24. Cylinder-Bell-Funnel Cylinder Funnel Bell Utility of DTW: Example I,Machine Learning • This dataset has been studied in a machine learning context by many researchers • Recall that, by definition, the instances of Cylinder-Bell-Funnel are warped in the time axis

  25. Classification Experiment onC-B-F Dataset (1/2) • Experimental settings • Training data consists of 10 exemplars from each class • (One) Nearest neighbor algorithm • “Leaving-one-out” evaluation, averaged over 100 runs • Results • Error rate using Euclidean Distance: 26.10% • Error rate using Dynamic Time Warping: 2.87% • Time to classify one instance using Euclidean Distance: 1 sec • Time to classify one instance using Dynamic Time Warping: 4,320 sec

  26. Classification Experiment onC-B-F Dataset (2/2) • Dynamic time warping can reduce the error rate by an order of magnitude • Its classification accuracy is competitive with sophisticated approaches like decision tree, boosting, neural networks, and Bayesian techniques • But, it is slow …

  27. Sunday Friday Saturday Thursday Monday Tuesday Wednesday Wednesday was a national holiday Utility of DTW: Example II,Data Mining • Power-demand time series: each sequence corresponds to a week’s demand for power in a Dutch research facility in 1997

  28. 4 5 3 6 7 2 1 Hierarchical Clustering withEuclidean Distance The two 5-day weeks are correctly grouped. Note however, that the three 4-day weeks are not clustered together. Also, the two 3-day weeks are also not clustered together.

  29. 6 4 7 5 3 2 1 Hierarchical Clustering withDynamic Time Warping The two 5-day weeks are correctly grouped. The three 4-day weeks are clustered together. The two 3-day weeks are also clustered together.

  30. Time Taken to Create Hierarchical Clustering of Power-Demand Time Series • Time to create dendrogram using Euclidean Distance: 1.2 seconds • Time to create dendrogram using Dynamic Time Warping: 3.40 hours

  31. Q wk p j C w1 1 1 n i Computing the Dynamic Time Warp Distance (1/2) • Note that the input sequences can be of different lengths Q |n| |p| C

  32. Computing the Dynamic Time Warp Distance (2/2) Q • Every possible mapping from Q to C can be represented as a warping path in the search matrix • We simply want to find the cheapest one … • Although there are exponentially many such paths,we can find one in only quadratic time using dynamic programming |n| |p| C (i,j) = d(qi,cj) + min{ (i-1,j-1) , (i-1,j ) , (i,j-1) }

  33. Q wk p j C w1 1 1 n i Fast Approximation to Dynamic Time Warping Distance (1/2) • Simple idea: approximate the time series with some compressed or downsampled representation, and do DTW on the new representation • How well does this work …

  34. Fast Approximation to Dynamic Time Warping Distance (2/2) • … Strong visual evidence to suggest it works well 22.7 sec 1.3 sec

  35. Weighted Distance Measures (1/3) • Intuition: for some queries different parts of the sequence are more important

  36. D(Q,C) D(Q,C,W) Weighted Distance Measures (2/3) The height of this histogram indicates the relative importance of that part of the query W

  37. Term Vector [Jordan , Cow, Bull, River] Term Weights [ 1 , 1 , 1 , 1 ] Search Display Results Gather Feedback Term Vector [Jordan , Cow, Bull, River] Update Weights Term Weights [ 1.1 , 1.7 , 0.3 , 0.9 ] Weighted Distance Measures (3/3) • How do we set the weights? • One possibility: relevance feedback which is the reformulation of a query in response to feedback provided by the user for the results of previous query

  38. Indexing Time Series (1/6) • We have seen techniques for assessing the similarity of two time series • However we have not addressed the problem of finding the best match to a query in a large database … • The obvious solution, to retrieveand examine every item(sequential scanning), simplydoes not scale to large datasets • We need some way to indexthe data

  39. Indexing Time Series (2/6) • We can project time series of length n into n-dimension space • The first value in C is the X-axis, the second value in C is the Y-axis, etc. • One advantage of doing this isthat we have abstracted awaythe details of “time series”,now all query processing canbe imagined as finding pointsin space …

  40. Q Indexing Time Series (3/6) • We can project the query time series Q into the same n-dimension space and simply look for the nearest points • The problem is that we have to look at every point to find the nearest neighbor

  41. Euclidean Weighted Euclidean Manhattan Max Indexing Time Series (4/6) • The Minkowski metrics have simple geometric interpolations

  42. R1 R4 R2 R5 R3 R6 R9 R7 R8 Indexing Time Series (5/6) • We can group clusters of datapoints with “boxes” called Minimum Bounding Rectangles (MBR) • We can further recursively group MBRs into larger MBRs

  43. R10 R11 R10 R11 R12 R1 R2 R3 R4 R5 R6 R7 R8 R9 R12 Data nodes containing points Indexing Time Series (6/6) • These nested MBRs are organized as a tree (called a spatial access tree or a multidimensional tree). Examples include R-tree, Hybrid-tree, etc.

  44. Dimensionality Curse (1/4) • If we project a query into n-dimensional space, how many additional MBRs must we examine before we are guaranteed to find the best match? • For the one dimensional space, the answer is clearly 2

  45. Dimensionality Curse (2/4) • If we project a query into n-dimensional space, how many additional MBRs must we examine before we are guaranteed to find the best match? • For the two dimensional case, the answer is 8

  46. Dimensionality Curse (3/4) • If we project a query into n-dimensional space, how many additional MBRs must we examine before we are guaranteed to find the best match? • For the three dimensional case, the answer is 26

  47. Dimensionality Curse (4/4) • If we project a query into n-dimensional space, how many additional MBRs must we examine before we are guaranteed to find the best match? • More generally, in n-dimensional space we must examine 3n-1 MBRs; n = 21 → 10,460,353,201 MBRs • This is known as the curse of dimensionality

  48. Spatial Access Methods • We can use Spatial Access Methods like the R-tree to index our data, but … • The performance of R-trees degrades exponentially with the number of dimensions. Somewhere above 6-20 dimensions the R-tree degrades to linear scanning • Often we want to index time series with hundreds, perhaps even thousands of features

  49. GEMINI (GEneric Multimedia INdexIng){Christos Faloutsos} (1/8) • Establish a distance metric from a domain expert • Produce a dimensionality reduction technique that reduces the dimensionality of the data from n to N, where N can be efficiently handled by your favorite SAM • Produce a distance measure defined on N dimensional representation of the data, and prove that it obeys Dindexspace(A,B) ≤ Dtrue(A,B) (lower bounding lemma) • Plug into an off-the-shelve SAM

  50. A 3 2.5 2 C 1.5 1 B 0.5 F 0 -0.5 -1 3 2 D 3 1 2 E 1 0 0 -1 -1 -2 -2 -3 -3 -4 GEMINI (GEneric Multimedia INdexIng){Christos Faloutsos} (2/8) • We have 6 objects in 3-D space. We issue a query to find all objects within 1 unit of the point (-3, 0, -2)

More Related