Time Series Indexing II

Time Series Indexing II

29 28 27 26 25 24 23 0 50 100 150 200 250 300 350 400 450 500 Time Series Data A time series is a collection of observations made sequentially in time. 25.1750 25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750 .. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500 value axis time axis

TS Databases • A Time Series Database stores a large number of time series • Similarity queries • Exact match or sub-sequence match • Range or nearest neighbor • But first we should define the similarity model • E.g. D(X,Y) for X = x1, x2, …, xn , and Y = y1, y2, …, yn

Similarity Models • Euclidean and Lp based • Edit Distance and LCS based • Probabilistic (using Markov Models) • Landmarks • The appropriate similarity model depends on the application

n datapoints n datapoints Database Distance Rank Query Q 4 0.98 0.07 1 Euclidean Distance between two time series Q = {q1, q2, …, qn} and S = {s1, s2, …, sn} Q 2 0.21 S 3 0.43 Euclidean model

Similarity Retrieval • Range Query • Find all time series S where • Nearest Neighbor query • Find all the k most similar time series to Q • A method to answer the above queries: Linear scan … very slow • A better approach GEMINI

GEMINI Solution: Quick-and-dirty' filter: • extract m features (numbers, eg., avg., etc.) • map into a point in m-d feature space • organize points with off-the-shelf spatial access method (‘SAM’) • discard false alarms

GEMINI Range Queries Build an index for the database in a feature space using an R-tree Algorithm RangeQuery(Q, e) • Project the query Q into a point in the feature space • Find all candidate objects in the index within e • Retrieve from disk the actual sequences • Compute the actual distances and discard false alarms

GEMINI NN Query Algorithm K_NNQuery(Q, K) • Project the query Q in the same feature space • Find the candidate K nearest neighbors in the index • Retrieve from disk the actual sequences pointed to by the candidates • Compute the actual distances and record the maximum • Issue a RangeQuery(Q, emax) • Compute the actual distances, keep K

GEMINI • GEMINI works when: Dfeature(F(x), F(y)) <= D(x, y) • Note that the closer the feature distance to the actual one the better.

Problem • How to extract the features? How to define the feature space? • Fourier transform • Wavelets transform • Averages of segments (Histograms or APCA)

Fourier transform • DFT (Discrete Fourier Transform) • Transform the data from the time domain to the frequency domain • highlights the periodicities • SO?

DFT A: several real sequences are periodic Q: Such as? A: • sales patterns follow seasons; • economy follows 50-year cycle • temperature follows daily and yearly cycles Many real signals follow (multiple) cycles

How does it work? Decomposes signal to a sum of sine (and cosine) waves. Q:How to assess ‘similarity’ of x with a wave? value x ={x0, x1, ... xn-1} time 0 n-1 1

value value freq. f=1 (sin(t * 2 p/n) ) freq. f=0 time time 0 n-1 1 0 n-1 1 How does it work? A: consider the waves with frequency 0, 1, ...; use the inner-product (~cosine similarity)

value freq. f=2 time 0 n-1 1 How does it work? A: consider the waves with frequency 0, 1, ...; use the inner-product (~cosine similarity)

0 n-1 1 n-1 0 1 How does it work? ‘basis’ functions cosine, f=1 sine, freq =1 0 n-1 1 cosine, f=2 sine, freq = 2 0 n-1 1 0 n-1 1

How does it work? • Basis functions are actually n-dim vectors, orthogonal to each other • ‘similarity’ of x with each of them: inner product • DFT: ~ all the similarities of x with the basis functions

How does it work? Since ejf = cos(f) + j sin(f) (j=sqrt(-1)), we finally have:

DFT: definition • Discrete Fourier Transform (n-point): inverse DFT

DFT: definition • Good news: Available in all symbolic math packages, eg., in ‘mathematica’ x = [1,2,1,2]; X = Fourier[x]; Plot[ Abs[X] ];

DFT: properties Observation - SYMMETRY property: Xf = (Xn-f )* ( “*”: complex conjugate: (a + b j)* = a - b j ) Thus we use only the first half numbers

DFT: Amplitude spectrum • Amplitude • Intuition: strength of frequency ‘f’ count Af freq: 12 freq. f time

DFT: Amplitude spectrum • excellent approximation, with only 2 frequencies! • so what?

DFT: Amplitude spectrum • excellent approximation, with only 2 frequencies! • so what? • A1: compression • A2: pattern discovery • A3: forecasting

DFT: Parseval’s theorem sum( xt2 ) = sum ( | X f | 2 ) Ie., DFT preserves the ‘energy’ or, alternatively: it does an axis rotation: x1 x = {x0, x1} x0

Lower Bounding lemma • Using Parseval’s theorem we can prove the lower bounding property! • So, apply DFT to each time series, keep first 3-10 coefficients as a vector and use an R-tree to index the vectors • R-tree works with euclidean distance, OK.

Wavelets - DWT • DFT is great - but, how about compressing opera? (baritone, silence, soprano?) value time

Wavelets - DWT • Solution#1: Short window Fourier transform • But: how short should be the window?

Wavelets - DWT • Answer: multiple window sizes! -> DWT

Haar Wavelets • subtract sum of left half from right half • repeat recursively for quarters, eightths ...

Wavelets - construction x0 x1 x2 x3 x4 x5 x6 x7

Wavelets - construction x0 x1 x2 x3 x4 x5 x6 x7 s1,0 ....... level 1 d1,1 s1,1 d1,0 + -

Wavelets - construction x0 x1 x2 x3 x4 x5 x6 x7 s2,0 level 2 d2,0 s1,0 ....... d1,1 s1,1 d1,0 + -

Wavelets - construction etc ... x0 x1 x2 x3 x4 x5 x6 x7 s2,0 d2,0 s1,0 ....... d1,1 s1,1 d1,0 + -

Wavelets - construction Q: map each coefficient on the time-freq. plane f x0 x1 x2 x3 x4 x5 x6 x7 s2,0 d2,0 t s1,0 ....... d1,1 s1,1 d1,0 + -

value time Wavelets - Drill: • Q: baritone/silence/soprano - DWT? f t

value time Wavelets - Drill: • Q: baritone/soprano - DWT? f t

Wavelets - construction Observation1: ‘+’ can be some weighted addition ‘-’ is the corresponding weighted difference (‘Quadrature mirror filters’) Observation2: unlike DFT/DCT, there are *many* wavelet bases: Haar, Daubechies-4, Daubechies-6, ...

Advantages of Wavelets • Better compression (better RMSE with same number of coefficients) • closely related to the processing of the mammalian eye and ear • Good for progressive transmission • handle spikes well • usually, fast to compute (O(n)!)

Feature space • Keep the d most “important” wavelets coefficients • Normalize and keep the largest • Lower bounding lemma: the same as DFT

PAA and APCA • Another approach: segment the time series into equal parts, store the average value for each part. • Use an index to store the averages and the segment end points

X X X X' X' X' SVD DFT DWT eigenwave 0 0 Haar 0 eigenwave 1 1 0 0 0 20 20 20 80 80 80 100 100 100 40 40 40 140 140 140 60 60 60 120 120 120 Haar 1 2 eigenwave 2 Haar 2 3 eigenwave 3 Haar 3 4 eigenwave 4 5 Haar 4 6 eigenwave 5 Haar 5 7 eigenwave 6 Haar 6 eigenwave 7 Haar 7 Feature Spaces Korn, Jagadish, Faloutsos 1997 Chan & Fu 1999 Agrawal, Faloutsos, Swami 1993

sv6 sv1 value axis sv7 sv5 sv4 sv2 sv3 sv8 time axis Piecewise Aggregate Approximation (PAA) Original time series (n-dimensional vector) S={s1, s2, …, sn} n’-segment PAA representation (n’-d vector) S = {sv1 ,sv2, …, svn’} PAA representation satisfies the lower bounding lemma (Keogh, Chakrabarti, Mehrotra and Pazzani, 2000; Yi and Faloutsos 2000)

sv6 sv1 sv7 sv5 sv4 sv2 sv3 sv8 Adaptive Piecewise Constant Approximation (APCA) sv3 n’/2-segment APCA representation (n’-d vector) S= { sv1, sr1, sv2, sr2, …, svM , srM } (M is the number of segments = n’/2) sv1 sv2 sv4 sr1 sr2 sr3 sr4 Can we improve upon PAA? n’-segment PAA representation (n’-d vector) S = {sv1 ,sv2, …, svN}

Reconstruction error PAAReconstruction error APCA APCA approximates original signal better than PAA Improvement factor = 3.77 1.69 1.21 1.03 3.02 1.75

APCA Representation can be computed efficiently • Near-optimal representation can be computed in O(nlog(n)) time • Optimal representation can be computed in O(n2M) (Koudas et al.)

Exact (Euclidean) distance D(Q,S) S Q S S Q Q’ DLB(Q’,S) D(Q,S) D(Q,S) DLB(Q’,S) Distance Measure Lower bounding distance DLB(Q,S)

R1 R1 R3 R2 R4 S2 S5 S3 R3 S1 S4 S6 R4 R2 R3 R2 S8 R4 S9 S8 S7 S9 S1 S2 S3 S4 S5 S6 S7 2M-dimensional APCA space Index on 2M-dimensional APCA space Any feature-based index structure can used (e.g., R-tree, X-tree, Hybrid Tree)

Time Series Indexing II

Time Series Indexing II

Presentation Transcript

Time Series

Time Series 2 Time Series 1

Time Series Data Analysis - II

Time series

Indexing Time Series

Time Series

Time Series

Time series

Time Series

Time Series

Indexing Time Series

i SAX: Indexing and Mining Terabyte Sized Time Series

Time series

Time Series

Time Series

Indexing and Mining Time Series Data Dr Eamonn Keogh

Indexing transaction time databases

Indexing of Time Series by Major Minima and Maxima

Time Series

Spatial Indexing II

Time Series

Indexing Time Series Data