Fair Use Agreement

Fair Use Agreement • This agreement covers the use of all slides on this CD-Rom, please read carefully. • You may freely use these slides for teaching, if • You send me an email telling me the class number/ university in advance. • My name and email address appears on the first slide (if you are using all or most of the slides), or on each slide (if you are just taking a few slides). • You may freely use these slides for a conference presentation, if • You send me an email telling me the conference name in advance. • My name appears on each slide you use. • You may not use these slides for tutorials, or in a published work (tech report/ conference paper/ thesis/ journal etc). If you wish to do this, email me first, it is highly likely I will grant you permission. • (c) Eamonn Keogh, eamonn@cs.ucr.edu

Everything you know about Dynamic Time Warping is Wrong Chotirat Ann Ratanamahatana Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Outline of Talk • Introduction to dynamic time warping (DTW) • Why DTW is important • Introduction/review for LB_Keogh solution • Three popular beliefs about DTW • Why popular belief 1 is wrong • Why popular belief 2 is wrong • Why popular belief 3 is wrong • Conclusions

Nuclear Power Excellent! Here is simple visual example to help you develop an intuition for DTW. We are looking at nuclear power data. Dynamic Time Warping Euclidean

4 3 2 1 0 -1 -2 -3 0 50 100 150 200 250 300 Let us compare Euclidean Distance and DTW on some problems Leaves Faces 4 3 2 1 0 -1 Gun Sign language -2 -3 -4 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 90 Control Trace 2-Patterns Word Spotting

Results: Error Rate Using 1-nearest-neighbor, leaving-one-out evaluation!

Every possible warping between two time series, is a path though the matrix. We want the best one… How is DTW Calculated? Q C C Q This recursive function gives us the minimum cost path (i,j) = d(qi,cj) + min{ (i-1,j-1), (i-1,j ), (i,j-1) } Warping path w

The time series can be of different lengths.. C Q Important note Q C Warping path w

C Q Global Constraints • Slightly speed up the calculations • Prevent pathological warpings Sakoe-Chiba Band

In general, it’s hard to speed up a single DTW calculation However, if we have to make many DTW calculations (which is almost always the case), we can potentiality speed up the whole process by lowerbounding. Keep in mind that the lowerbounding trick works for any situation were you have an expensive calculation that can be lowerbound (string edit distance, graph edit distance etc) I will explain how lowerbounding works in a generic fashion in the next two slides, then show concretely how lowerbounding makes dealing with massive time series under DTW possible…

Lower Bounding I The true DTW function is very slow… • Assume that we have two functions: • DTW(A,B) • lower_bound_distance(A,B) The lower bound function is very fast… By definition, for all A, B, we have lower_bound_distance(A,B)  DTW(A,B)

C C C C , Q); , Q); , Q); , Q); i i i i Lower Bounding II We can speed up similarity search under DTW by using a lower bounding function Try to use a cheap lower bounding calculation as often as possible. Algorithm Algorithm Lower_Bounding_Sequential_Scan(Q) Lower_Bounding_Sequential_Scan(Q) 1. 1. best_so_far best_so_far = infinity; = infinity; for for 2. 2. all sequences in database all sequences in database 3. 3. LB_dist = lower_bound_distance( if if 4. 4. LB_dist < LB_dist < best_so_far best_so_far 5. 5. true_dist = DTW( if if 6. 6. true_dist < best_so_far true_dist < best_so_far Only do the expensive, full calculations when it is absolutely necessary 7. 7. best_so_far best_so_far = true_dist; = true_dist; 8. 8. index_of_best_match index_of_best_match = i; = i; endif endif 9. 9. endif endif 10. 10. endfor endfor 11. 11.

C C Q U L Q Lower Bound of Keogh U Ui = max(qi-r : qi+r) Li = min(qi-r : qi+r) L Q Sakoe-Chiba Band LB_Keogh

Important Note The LB_Keogh lower bound only works for time series of the same length, and with constraints. However, we can always normalize the length of one of the time series C Q C Q

C Q Popular Belief 1 The ability of DTW to handle sequences of different lengths is a great advantage, and therefore the simple lower bound that requires different length sequences to be reinterpolated to equal lengths is of limited utility. Examples “Time warping enables sequences with similar patterns to be found even when they are of different lengths” “ (DTW is) a more robust distance measure than Euclidean distance in many situations, where sequences may have different lengths” “(DTW) can be used to measure similarity between sequences of different lengths”

C Q Popular Belief 2 Constraining the warping paths is a necessary evil that we inherited from the speech processing community to make DTW tractable, and that we should find ways to speed up DTW with no (or larger) constraints. Examples “LB_Keogh cannot be applied when the warping path is not constrained”. “search techniques for wide constraints are required”

Popular Belief 3 There is a need for (and room for) improvements in the speed of DTW for data mining applications. Examples • “DTW incurs a heavy CPU cost” • “DTW is limited to only small time series datasets” • “(DTW) quadratic cost makes its application on databases of long time series very expensive” • “(DTW suffers from ) serious performance degradation in large databases”

Popular Belief 1 The ability of DTW to handle sequences of different lengths is a great advantage, and therefore the simple lower bound that requires different length sequences to be reinterpolated to equal lengths is of limited utility. Is this true? These claims are surprising in that they are not supported by any empirical results in the papers in question. Furthermore, an extensive literature search through more than 500 papers dating back to the 1960’s failed to produce any theoretical or empirical results to suggest that simply making the sequences have the same length has any detrimental effect. Let us test this

A Simple Experiment I • For all datasets which have naturally the different lengths, let us compare 1-nearest neighbor classification rate, for all possible warping constraints: • After simply re-normalizing lengths. • Using DTWs “wonderful” ability to support different queries. • The latter case has at least five “flavors”, to be fair we try all and report only the best.

100 100 99.9 99.9 99.8 99.8 99.7 99.7 Trace 99.6 99.6 Trace Accuracy (%) Accuracy (%) Accuracy (%) Accuracy (%) 99.5 99.5 99.4 99.4 99.3 99.3 99.2 99.2 99.1 99.1 99 99 0 0 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 100 100 Warping Window Size (%) Warping Window Size (%) A Simple Experiment II 96.5 96.5 100 100 96 96 95 95 Face 95.5 95.5 Leaf 90 90 Leaf Accuracy (%) Accuracy (%) 95 95 Accuracy (%) Accuracy (%) 85 85 94.5 94.5 80 80 94 94 93.5 93.5 75 75 0 0 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 100 100 0 0 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 100 100 Warping Window Size (%) Warping Window Size (%) Warping Window Size (%) Warping Window Size (%) A two-tailed ttest with 0.05 significance level between each variable-length and equal-length pair indicates that there is no statistically significant difference between the accuracy of the two sets of experiments.

Popular Belief 1 is a Myth! The ability of DTW to handle sequences of different lengths is a NOT great advantage. So while Wong and Wong claim in IDEAS-03 “DTW is useful to measure similarity between sequences of different lengths”, we must recall that two Wongs don’t make a right.

Popular Belief 2 Constraining the warping paths is a necessary evil that we inherited from the speech processing community to make DTW tractable, and that we should find ways to speed up DTW with no (or larger) constraints. Is this true? The vast majority of the data mining researchers have used a Sakoe-Chiba Band with a 10% width for the global constraint, but the last year has seen many papers that advocate wider constraints, or none at all. W

A Simple Experiment For all classification datasets, let us compare 1-nearest neighbor classification rate, for all possible warping constraints. If popular claim two is correct, the accuracy should grow for wider constraints. In particular, the accuracy should get better for values greater than 10%

Accuracy vs. Width of Warping Window 100 95 90 W Warping width that achieves max Accuracy 85 FACE 2% Accuracy 80 GUNX 3% LEAF 8% Control Chart 4% 75 TRACE 3% 2-Patterns 3% 70 WordSpotting 3% 65 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 100 W: Warping Width

Popular Belief 2 is a myth! Constraining the warping paths WILL give higher accuracy for classification/clustering/query by content. This result can be summarized by the Keogh-Ratanamahatana Maxim: “a little warping is a good thing, but too much warping is a bad thing”.

Popular Belief 3 There is a need for (and room for) improvements in the speed of DTW for data mining applications. Is this true? Do papers published since the introduction of LB_Keogh really speed up DTW data mining?

A Simple Experiment Lets do some experiments! We will measure the average fraction of the n2 matrix that we must calculate, for a one nearest neighbor search. We will do this for every possible value of W, the warping window width. By testing this way, we are deliberately ignoring implementation details, like index structure, buffer size etc… W

0.06 0.05 0.04 0.03 0.02 0.01 0 3 4 0 1 2 This plot tells us that although DTW is O(n2), after we set the warping window for maximum accuracy for this problem, we only have to do 6% of the work, and if we use the LB_Keogh lower bound, we only have to do 0.3% of the work! 1 Zoom-In 0.9 0.8 Nuclear Trace Dataset 0.7 0.6 No Lower Bound Fraction of warping matrix accessed 0.5 LB-Keogh 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 80 90 100 Maximum Accuracy Warping Window Size (%)

This plot tells us that although DTW is O(n2), after we set the warping window for maximum accuracy for this problem, we only have to do 6% of the work, and if we use the LB_Keogh lower bound, we only have to do 0.21% of the work! 1 0.06 Zoom-In 0.9 0.05 0.8 0.7 Gun Dataset 0.04 0.6 No Lower Bound 0.03 Fraction of warping matrix accessed 0.5 LB-Keogh 0.4 0.02 0.3 0.01 0.2 0.1 0 3 4 0 1 2 0 0 10 20 30 40 50 60 70 80 90 100 Maximum Accuracy Warping Window Size (%)

2 instances 6 instances 12 instances 24 instances 50 instances 100 instances 200 instances The results in the previous slides are pessimistic! As the size of the dataset gets larger, the lower bounds become more important and can prune a larger fraction of the data. From a similarity search/classification point of view, DTW is linear! 1 0.06 Gun Dataset Zoom-In 0.9 0.05 0.8 0.7 0.04 0.6 0.03 Fraction of warping matrix accessed 0.5 0.4 0.02 0.3 0.01 0.2 0.1 0 3 4 0 1 2 0 0 10 20 30 40 50 60 70 80 90 100 Maximum Accuracy* Warping Window Size (%)

No Lower Bound No Lower Bound LB_Keogh LB_Keogh On a (still small, by data mining standards) dataset of 40,960 objects, just ten lines of code (LB_Keogh) eliminates 99.369% of the CPU effort! Let us consider larger datasets… 9 9 8 8 7 7 6 6 5 5 Amortized percentage of the calculations required 4 4 3 3 2 2 1 1 0 0 10 10 40 40 80 80 20 20 160 160 320 320 640 640 1280 1280 2560 2560 5120 5120 10240 10240 20480 20480 40960 40960 Size of Database (Number of Objects) Size of Database (Number of Objects)

Popular Belief 3 is a Myth There is NO need for (and NO room for) improvements in the speed of DTW for data mining applications. We are very close the asymptotic limit of speed up for DTW. The time taken for searching a terabyte of data is about the same for Euclidean Distance or DTW.

Conclusions We have shown that there is much misunderstanding about dynamic time warping, an important data mining tool. These misunderstandings have lead to much wasted research effort, which is a pity, because there are several important DTW problems to be solved (see paper). Are there other major misunderstandings about other data mining problems?

Questions? All datasets and code used in this tutorial can be found at www.cs.ucr.edu/~eamonn/TSDMA/index.html

Fair Use Agreement