1 / 31

Nurjahan Begum , Bing Hu , Thanawin Rakthanmanon , and Eamonn Keogh

Towards a Minimum Description Length Based Stopping Criterion for Semi-Supervised Time Series Classification. Nurjahan Begum , Bing Hu , Thanawin Rakthanmanon , and Eamonn Keogh. Outline. Introduction Motivation of Stopping Criterion for Semi-Supervised Classification

farren
Télécharger la présentation

Nurjahan Begum , Bing Hu , Thanawin Rakthanmanon , and Eamonn Keogh

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards a Minimum Description Length Based Stopping Criterion for Semi-Supervised Time Series Classification Nurjahan Begum, Bing Hu, ThanawinRakthanmanon, and Eamonn Keogh

  2. Outline • Introduction • Motivation of Stopping Criterion for Semi-Supervised Classification • Proposed Stopping Criterion • Minimum Description Length (MDL) technique • Our Approach • Experimental Results • Conclusion

  3. Introduction We have developed a Minimum Description Length based Stopping Criterion for Semi-supervised Time Series Classification • Why Semi-Supervised Learning? • Why do we need a Stopping Criterion?

  4. Why Semi-Supervised Learning? • Labeled data • Scarce and extremely expensive* • Human intervention • Unlabeled data • Abundant. • PhysioBank archive* has more than 700 GB of digitized signals and time series freely available. • Semi Supervised classification • Less labeled data • Less human effort and usually obtains higher accuracy* • *F. Florea, et. al., Medical image categorization with MedIC and MedGIFT(2006) * A. L. Goldberger, et. al. PhysioBank,PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals (2000) * L. Wei et. al., Semi-Supervised Time Series Classification (2006)

  5. Why do we need a Stopping Criterion? Cardiac Tamponade Patient Normal Patient Use Semi-Supervised Classification

  6. Why do we need a Stopping Criterion? Cardiac Tamponade Patient Normal Patient Use Semi-Supervised Classification

  7. Why do we need a Stopping Criterion? Cardiac Tamponade Patient Normal Patient Use Semi-Supervised Classification

  8. Why do we need a Stopping Criterion? Cardiac Tamponade Patient Normal Patient Oops… We are adding false positives!

  9. Our Contribution • A novel, parameter free stopping criterion using Minimum Description Length (MDL) for semi-supervised time series classification • Allows easy adaptation by experts in medical community

  10. Minimum Description Length (MDL) • MDL is a formalization of Occam's Razor • The best hypothesis for a given set of data is the one that leads to the best compression of the data.

  11. Minimum Description Length (MDL) • MDL is a formalization of Occam's Razor • The best hypothesis for a given set of data is the one that leads to the best compression of the data. • Why MDL? • Intrinsically parameter free • Leverages the true underlying structure of data • Avoids needing to explain all of the data • Has recently shown great potential for real-valued time series data

  12. Our Approach Given Positive Instance Original Time Series

  13. Our Approach • Discretize the Time Series • Repeat • Find the Nearest Neighbor of the Positive Instance set • Calculate the BitCount Until BitCount increases Given Positive Instance Original Time Series

  14. Discrete Normalization (Why?) • MDL is defined in discrete space • Time series are real-valued • Need to normalize real-valued data in a space of reduced cardinality Won’t drastic information reduction loose meaningful information?

  15. Will Discrete Normalization loose meaningful information? • The answer is NO! • Justification? • A time series clustering experiment*… (REF: [1][2]) Real valued time series Discretized time series (cardinality = 16) [1] B. Hu, et. al. Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL.(2011) [2] T. Rakthanmanon, et. al.Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data (2011) *Incartdb dataset (Record I70, Signal II) [www.physionet.org]

  16. Our Approach • Discretize the Time Series • Repeat • Find the Nearest Neighbor of the Positive Instance set • Calculate the BitCount Until BitCount increases Given Positive Instance Original Time Series

  17. Our Approach H = Iteration 0

  18. Our Approach H = Iteration 0

  19. Our Approach 3500 No. of instances encoded H = Iteration 0 Bit Count = 100 * log216 + 6* 100 * log216 = 2800 2500 BitCount 0 2 4 6

  20. Our Approach 3500 No. of instances encoded H = 3500 Iteration 0 2500 0 2 4 6 2500 BitCount Iteration 1 0 2 4 6 Bit Count =100 * log216 +6* (ceil(log2100)+log216) + 5 * 100* log216 = 2466

  21. Our Approach 3500 3500 No. of instances encoded H = 3500 Iteration 0 2500 2500 0 2 4 6 0 2 4 6 2500 BitCount Iteration 1 0 2 4 6 Iteration 2 Bit Count = 100 * log216+ 22 * (ceil(log2100)+log216) +4* 100 * log216 = 2242

  22. Our Approach 3500 No. of instances encoded H = 2500 Iteration 3 0 2 4 6 Bit Count = 100 * log216 + 37 * (ceil(log2100)+log216) + 3 * 100 * log216 = 2007 BitCount

  23. Our Approach 3500 3500 No. of instances encoded H = 2500 2500 Iteration 3 0 0 2 2 4 4 6 6 Iteration 4 BitCount Bit Count = 100 * log216 + 115 *(ceil(log2100)+log216) + 2 * 100 * log216 = 2465

  24. Our Approach 3500 3500 3500 No. of instances encoded H = 2500 2500 2500 Iteration 3 0 0 0 2 2 2 4 4 4 6 6 6 Iteration 4 BitCount Iteration 5 Stopping point Bit Count = 100 * log216 + 192 * (ceil(log2100)+log216) + 1*100 * log216 = 2912

  25. Experimental Results * We worked with ~1 hour long data

  26. Ideal Bad, adds false positives Interpreting the plots BitCount BitCount negative negative positive positive Stopping Point Stopping Point Number of instances encoded Number of instances encoded Bad, misses true positives Really bad negative negative BitCount BitCount positive positive Stopping Point Stopping Point Number of instances encoded Number of instances encoded

  27. Experimental Results 5 2.85 X 10 5 svdb incartdb 3.6 X 10 2.75 3.2 negative BitCount BitCount 2.65 negative 2.8 positive positive 2.55 Stopping Point Stopping Point 2.4 100 300 500 700 100 300 500 700 Number of instances encoded Number of instances encoded 5 2.6x10 sddb negative 2.3 BitCount 2 positive 1.7 Stopping Point 100 200 300 400 Number of instances encoded

  28. 5 Experimental Results (Contd.) 6.4 X 10 5 1.9 X 10 1.8 6.1 5 2X 10 Fish_test Swedish_leaf 1.7 1.9 BitCount 5.8 Stopping Point BitCount 1.6 Stopping Point 1.8 Stopping Point 1.7 1.5 5.5 0 20 40 60 80 100 0 100 200 300 0 100 200 300 Number of instances encoded Number of instances encoded FaceAll_test BitCount Number of instances encoded

  29. Comparison with the state-of-the-art algorithm Fish_test 1.2 0.6 Minimal Distance Too Early Stopping (Li et. al’smethod*) 0.4 0 0 10 20 30 40 50 60 70 80 5 2X 10 Number of instances classified 1.9 BitCount 1.8 Stopping Point 1.7 0 10 20 30 40 50 60 70 80 Number of instances encoded * L. Wei et. al., Semi-Supervised Time Series Classification (2006)

  30. Conclusions • Novel way of semi-supervised classification with only one labeled instance. • Previous approaches of stopping the semi-supervised classification required – • extensive parameter tuning, • remained something of a black art. • Stoppingcriterionfor semi-supervised classification based on MDL. • To our knowledge, our stopping criterion is the • firstparameter free criterion that mitigates the early stopping problem, • leverages the inherent structure of the data.

  31. Thank you! If you have any question, please contact me: Name:Nurjahan Begum Email: nbegu001@ucr.edu

More Related