200 likes | 362 Vues
Positive Unlabeled Learning for Time Series Classification. Nguyen Minh Nhut Xiao-Li Li See-Kiong Ng Institute for Infocomm Research, Singapore. Outline. Introduction and Related Work The Proposed Technique Learning from Common Local Clusters (LCLC ) Evaluation Experiments Conclusions
E N D
Positive Unlabeled Learning for Time Series Classification Nguyen Minh Nhut Xiao-Li Li See-Kiong Ng Institute for Infocomm Research, Singapore
Outline • Introduction and Related Work • The Proposed Technique Learning from Common Local Clusters (LCLC ) • Evaluation Experiments • Conclusions • Discussion and question
1. Introduction • Traditional Supervised Learning • Given a set of labeled training examples of n classes, the system uses this set to build a classifier. • The classifier is then used to classify new examples into the n classes. • Typically require a large number of labeled examples: • Human labels can be expensive, time consuming and sometimes even impossible.
Unlabeled Data • Unlabeled data are usually plentiful. • Can we label only a small number of examples and make use of a large number of unlabeled examples to learn? • Unlabeled data contain information which can be used to improve the classification accuracy.
Positive-Unlabeled (PU) Learning • Positive examples: We have a set of examples of interesting class P, and • Unlabeled set: also has a set U of unlabeled (or mixed) examples with instances from P and also not from P (negative examples). • Build a classifier: usingPandUto classify the data inUas well as future test data.
PU Learning for Time Series Classification • PU learning is applicable in a wide range of application domains such as: text classification, bio-medical informatics, pattern recognition, and recommendation system… • However, the application of PU learning for time series data has been relatively less explored due to: • High feature correlation. • Lack of joint probability distribution over words and collocations (in texts).
Existing PU Learning for Time Series Classification • Wei, L. and E. Keogh (2006). "Semi-supervised time series classification." ACM SIGKDD. • The idea is: Let the classifier teach itself by its own predication. • Unfortunately, without a good stopping criterion, the method often stops too early, resulting in high precision but low recall.
The Proposed Technique: LCLC ( Learning from Common Local Clusters) • The proposed LCLC algorithm addresses two specific issues in PU learning for time series classification: • Select independent and relevant features from the time series data using cluster-based approach. • Accurately extract reliable positive and negatives from the given unlabeled data.
Local Clustering and Feature Selection Yoon, H., K. Yang, et al. (2005). "Feature subset selection and feature ranking for multivariate time series." IEEE Transactions on Knowledge and Data Engineering
Local Clustering and Feature Selection The cluster-based approach of the proposed LCLC method offers the following advantages: • The cluster-based approach is much more robust than instance-based methods for extracting the likely positives and negatives from U. • The similarity between two time series data can be effectively measured using a well-selected subset of the common principal features that can capture the underlying characteristics of both positive and unlabeled clusters.
Extracting Reliable Negative set Algorithm 2. Extracting Reliable Negative Examples Input: positive data P, K unlabeled local clusters ULCi • RN = ,AMBI= ; • Fori=1 to K • Compute the distance between local cluster ULCi to P; • Sort d(ULCi , P) (i=1, 2, …, K) in a decreasing order; • dMedian= the median distance of d(ULCi , P) (i=1, 2,…, K); • Fori=1 to K • If (d(ULCi , P)> dMedian) • RN= RNULCi; • Else • AMBI = AMBIULCi;
Boundary decision using Cluster chaining approach Algorithm 3. Identifying likely positive clusters LPand likely negative clusters LN Input: ambiguous clusters AMBI, positive cluster P, reliable negative clusters set in RN
Boundary decision using Cluster chaining approach • Minimize the effect of possible noisy examples. • Offer a robust solution for the cases of severely unbalanced positive and negative examples in the unlabeled dataset U .
EMPIRICAL EVALUATION Datasets Wei, L. (2007). "Self Training dataset." http://alumni.cs.ucr.edu/~wli/selfTraining/. Keogh, E. (2008). "The UCR Time Series Classification/ Clustering Homepage" http://www.cs.ucr.edu/~eamonn/time_series_data/.
Experiment setting • We randomly select just oneseed instance from the positive class for the learning phase, the rest of training data are treated as unlabeled data. • We build a 1-NN classifier using P together with LP as a positive training set, and RN together with LN as a negative training set. • We repeat our experiments 30 times and report the average values of the 30 results.
Overall performance Wei, L. and E. Keogh (2006) Ratanamahatana, C. and D. Wanichsan (2008).
Sensitivity of the size of local clusters We have set the number of clusters K= Size(U)/ULC_size, where ULC_sizeis size of the unlabeled clusters.
Conclusions There are three key approaches that underlie LCLC’s improved classification performance over existing methods. • First, LCLC adopts a cluster-based method that is much more robust than instance-based PU learning methods. • Secondly, we have adopted a feature selection strategy that can take the characteristics of both positive and unlabeled clusters. • Finally, we have devised a novel cluster chaining approach to extract the boundary positive and negative clusters.