Subsequence Matching in Time Series Databases

Subsequence Matching in Time Series Databases Xiaojin Xu 04-25-2006

Papers • Online Event driven Subsequence Matching over Financial Data Streams • Huanmei Wu, Betty Salzberg, Donghui Zhang • Fast Subsequence Matching in Time-Series Databases • C. Faloutsos, M. Ranganathan, Y. Manolopoulos

Challenges of Subsequence Matching over Financial Data Streams • Existing techniques of Subsequence Matching • Mainly focus on discovering the similarity between an online querying subsequence and a traditional database • Queried data are static • Subsequence Similarities of Financial Data Streams • Data changing constantly, single pass search required • Movement can be predicted by observing a repetitive pattern of waves (zigzag shapes) • The relative position of the upper and lower end points is important in subsequence similarity. • Subsequence similarity should be flexible with regard to time shifting and scaling, amplitude rescaling…

Our online event-driven subsequence matching meets the requirements of financial data analysis • Database is a dynamic stream database which stores recent financial data. • 3-tier online segmentation and pruning • Similarity measure: distance function is defined based on a permutation of the subsequence • Event-driven matching over an up-to-date database: query will be carried out only when there is a new end point • A new definition of trend for financial data stream

Processing Online Data Stream • Translating massive data streams into manageable data for database before matching • Aggregation and Smoothing • Piecewise linear representation • Online segmentation and pruning

Aggregation and Smoothing • One unique value for each time instance over a fixed time interval • Use p-interval moving average to filter out noise and generate a clean trend signal • X(i) is the value for i = 1, 2, ..., n • n is the number of periods.

Piecewise Linear Representation (PLR) • Segment over Bollinger Band Percent (%b) • %b indicator middle_band = p-period moving average upper_band = middle_band + 2*p-period standard deviation lower_band = middle_band - 2* p-period standard deviation %b =(close price – lower_band)/(upper_band – lower_band) • Advantages of %b indicator • Smoothed moving trend similar to the price movement • Normalized value of the real price. • Sensitive to price change

Segmentation • Use a sliding window which • Can only contain at most m points • Begin after the last identified end point and end right before the current point • Only contain last m points if more than m points • Segmentation over b% finds a possible upper or lower end points in the current sliding window • Current point is Pj(Xj,tj), the upper point Pi(Xi,ti) is a point in the sliding window that satisfies: 1. Xi = max( X values of current sliding window ) 2. Xi > Xj + δ (δ is the given error threshold) 3. P (Xi,ti) is the last one satisfying the above two conditions

Segmentation (Cont’d)

Pruning • Purpose — smoothing over recently identified end points • Two step • Filter: Pruning on %b • Refinement: pruning on raw data stream • Pruning rule — If the absolute %b or raw data values of two adjacent end points differs by less than a certain value, that line segment should be removed.

Pruning (Cont’d)

Online segmentation and pruning • Whenever an upper/lower point is identified, the previous line segment is checked for pruning • First check the need for pruning on %b • If pruning on %b, no pruning on raw data is done. System waits for next stream data to come in • If no pruning on %b done, the same line segment is checked for pruning on raw data • Keep which point after pruning? • Compare the last end point with the third last end point. If upper points, the one with the larger value will be kept. Otherwise, keep the point with smaller value.

Online segmentation and pruning

Online segmentation and pruning • Strategy of identifying end points • a smaller threshold δs for segmentation over %b, to ensure the sensitivity and reduce delay • a larger threshold δpb for pruning over %b, to filter out noise • a separate δpd for pruning over raw stream data. • The online segmentation and pruning are running simultaneously. • At most three end points need to be kept for segmentation and pruning procedure • All the fixed end points are updated into the database in real time

Permutation • Subsequence matching • Find the subsequence of end points that are similar to the query subsequence • Permutation • Stream of end points S = {(X1,t1), (X2,t2),…, (Xn,tn) }, divided into two subsets of upper and lower end points respectively, get S’ • S’ = {[(X1,t1), (X3,t3),…, (Xn-1,tn-1)], [(X2,t2), (X4,t4),…, (Xn,tn)]},Sort the X values of each subset, get S” • S” = {[Xi1,Xi3,…Xin-1], [Xi2,Xi4,…Xin]} where Xi1≤Xi3 ≤… ≤Xin-1, Xi2≤Xi4≤… ≤Xin, • {i1, i3 ,…, in-1, i2 , i4 ,…, in} is the permutation of S

Subsequence Similarity • Definition: S = {(X1,t1), (X2,t2),…, (Xn,tn) }, S’ = {(X1’,t1’), (X2’,t2’),…, (Xn’,tn’) }, S and S’ are similar if two conditions are satisfied: (1) S and S’ have the same permutation (2) d(S,S’) < γ where • α,β, and γ≥0 and are user-defined parameters • Permutation provides flexibility of time scaling and amplitude rescaling

Eventdriven subsequence match • Stream data are massive, real time. Do similarity search after a fixed time period may lose potentially important information • Event — A new potential end point is being identified and no pruning is need. • Event-driven subsequence match • Performs subsequence similarity search automatically only when there is a new event. • Generated query subsequence is the most recent n fixed and potential end points • Advantage: Can reduce the huge computation burden while maintain sensitivity to changes

Application－Trend Prediction • Trend of an end point: Tendency of the raw stream after k end points from the current end point E. (ε is a user defined parameter) If Ek.X≥E.X+ε E.trend = UP If Ek.X≤E.X- ε E.trend = DOWN If E.X - ε<Ek.X <E.X+ ε E.trend = NOTREND If Ek.does not exist, E.trend = UNDEFINED. • Predict trend of query event Subsequence similarity search returns a list of retrieved end points F(D) = (# of retrieved end points with trend D) / (total # of retrieved end points) ×100% if |F(UP) – F(DOWN)| < F(NOTREND) + λ predict NOTREND; else if F(UP) > F(DOWN) predict UP; else predict DOWN; (λ is a user defined threshold)

Conclusion • The online simultaneous segmentation and pruning algorithm for PLR achieves quick identification of new end points yet maintains accurate segmentation • New similarity measure of a permutation and a distance function has better performance than measures based on Euclidean distance • Experiments demonstrated that event-driven search outperformed the searches with any fixed time period.

Fast Subsequence Matching in Time-Series Databases • Whole matching • Given N data sequences of S1, S2, …, SN and a query sequence Q, find those sequences that are within distance ε from Q. Si and Q have the same length. • Subsequence matching • Given N data sequences of S1, S2, …, SN of arbitrary lengths, a query sequence Q and a tolerance ε, try to find data sequences Si that containing matching subsequences( with distance < ε from Q)

Whole matching • Use a distance preserving transform( e.g. DFT) to extract f features from sequences • Map f features into points in the f-dimensional feature space. • Use spatial access method ( e.g. R*-tree) to search for range/approximate query. • Precondition: data sequences and query sequences all have the same length

Defined Subsequence Matching • Given N data sequences of real numbers S1, S2, …, SN of potentially deferent lengths • The user specifies query subsequence Q of length Len(Q) and the tolerance ε (maximum distance) • Try to find quickly all the sequences Si and the correct offsets k, such that the subsequence Si[k: k+Len(Q)-1] matches the query sequence: D(Q , Si[k: k+Len(Q)-1] ≤ ε • Sequential Scan is not efficient for space/time overhead

ST-index • Assume the minimum query length is w • Use a sliding window of size w and place it at every possible position on every data sequence • Extract the features of subsequence inside the window for each placement • A data sequence of length Len(S) is mapped to a trail in feature space • The trail consists of Len(S)-w+1points. Each point represent each possible offset of the sliding window

How to index the trails • A straightforward way — I-naive • keep track of the individual points of each trail and store them in a spatial access method • Problem • Storing the individual points of trail in an R*-tree is inefficient in space and speed • Almost every point in a data sequence will correspond to a point in the f-dimensional feature space. 1: f increase for storage.

MBR • Divide the trail into sub-trails. Each sub-trail is represented with minimum bounding (hyper)-rectangle (MBR). • Only a few MBRs need to be stored. • When a query arrives, retrieve all the MBRs that intersect the query region. • Some false alarms are included(their MBR intersect the query region, but the sub-trails do not) • MBRs belonging to the same trail may overlap

MBR(Cont’d) • Information of MBR • tstart, tend: offsets of first and last positionings • sequence_id: unique identifier of the data sequence • (F1low,F1high,F2low,F2high,…) : extent of the MBR

MBR(Cont’d) • Group the MBRs to form MBRs at higher level • None-leaf nodes do not store sequence_id or offsets

Insertion – How to divide trails into sub-trails • I-fixed method • Sub-trail size is fixed number or a simple function of Len(S) • Resulting MBRs are not good.

I-adaptive method • Goal: Adapt to the distribution of points of the trail • Cost function • L = (L1, L2 ,…, Ln) : sides of n-dimensional MBR of a node in an R-tree • Average number of disk accesses DA(L) • Marginal cost of each point in the sub-trail of k points with the MBR • mc = DA(L)/k

I-adaptive method: Algorithm • Assign the first point of the trail in a trivial sub-trail • FOR each successive point • IF it increases the marginal cost of the current sub-trail • THEN start another sub-trail • ELSE include it in the current sub-trail

Searching : Len(Q) = w • Q is mapped to a point qf in feature space; the query corresponds to a sphere in feature space with center qf and radius ε ; • Retrieve the sub-trails whose MBRs intersect the query region using our index • Examine the corresponding subsequences of the data sequences to discard the false alarms

Searching : Len(Q) = pw • If Q and S agree within tolerance ε, then at least one of the pairs (si, qi) of corresponding subsequences agree within tolerance ε/ ; • Q is broken into p sub-queries which corresponds to p spheres in feature space with ε/ ; • Retrieve the sub-trails whose MBRs intersect at least one sub-query region using ST-index • Examine the corresponding subsequences of the data sequences to discard the false alarms

Conclusion • Designed a method that efficiently handles approximate queries for subsequence matching • Fulfill the following requirements: • Fast — Experiment results showed it achieves orders of magnitude savings over the sequential scanning • It requires small space overhead • It is dynamic • Correct : no false dismissals

Thank you!

Subsequence Matching in Time Series Databases

Subsequence Matching in Time Series Databases

Presentation Transcript

Time Series

Turn angle function and elastic time series matching

Fast Subsequence Matching in Time-Series Databases

Time series

Time Series

Time in Databases

Time series

Autocorrelation in Time Series

Pattern Matching Longest Common Subsequence

Time Series

Time Series

Embedding-Based Subsequence Matching in Large Sequence Databases

In Search of Meaning for Time Series Subsequence Clustering

Time Series Sequence Matching

Mining Time-Series Databases

Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos

Subsequence

Visualizing and Discovering Nontrivial Patterns In Large Time Series Databases

Why does subsequence time-series clustering produce sine waves?

Fast Subsequence Matching in Time-Series Databases

Autocorrelation in Time Series

Subsequence Matching on Structured Time Series Data