Feature Selection on Time-Series Cab Data

Feature Selection on Time-Series Cab Data Yingkit (Keith) Chow

Contents • Introduction • Features Considered • FCBF (Filter-type feature selection) • FCBF-PCA (my variation) • Conclusion

All Features Considered • Features = • Each time sample consists of the following features • Day of Week, Time of Day (1st two features) • taxis[t, 6:9], taxis[t-1, 6:9],…, taxis[t-5, 6:9] • [6:9] represents the index to the matrix taxis, which is the cab entering with meter off, cab enter on, cab exit off, cab exit on • Not all features here will be relevant to classifying whether a game is present.

Fast Correlation-Based Filter Algorithm: • Finds features that are relevant ( SU(I, C) > threshold), • where SU is symmetric uncertainty and will be described in the next slide • Remove redundant features by comparing remaining features (after the first step) • Remove feature j if SU(i, j) >= SU(j, C)

Equations[1] • Information Gain (IG) • IG(X|Y) = H(X) – H(X|Y) • Symmetric Uncertainty (SU) • SU(X,Y) = 2 * IG(X|Y) / [H(X)+H(Y)] • SU is used instead of IG because it compensates for features having more values and normalizes data[1]

FCBF • Classifier (MATLAB Classify- Linear) • Number Bins = 96 • Threshold = 0.01 • Accuracy = 91.9%

Choice of Number Bins • Num Bins = 96 results shown in previous slide (red is ground truth of game and blue is my classification) • Num Bins = 20 • Accuracy = 58.6% • Here the algorithm breaks down and only chooses feature 2, the “time of day”. The blue is periodic here, where a certain time segment a day, everyday will be classed as a game.

FCBF - PCA • FCBF compares individual features with each other • We can use PCA to try and capture a group of features. (for example, maybe one eigenvector can capture the shape of the number of cabs incoming with meters on initially before a game or the increase in the number of cabs entering with meters off prior to the end of game) • Example shown in the next slide

Cab Traffic Behavior • Before Start of Game • Cab On Enter and Cab Off Exit are high • Towards End of Game • Cab Off Enter and Cab On Exit are high

FCBF-PCA • Classifier (MATLAB Classify- Linear) • Number Bins = 20 • Threshold = 0.01 • Accuracy = 92.9% • Note: the features here are projections onto the eigenvectors and not the original feature dimension

Conclusions • The choice of number of bins have an enormous impact on the performance. (possibly due to 96 discrete values of time of day variable) • FCBF-PCA was less susceptible to the choice of numBins (10, 20, 100 numBins all resulted in approximately 91% accuracy)

Future Work • Currently using labels of game or not game. • I’ll try to make it work for detecting the first sample of a game and another classifier to detect the last sample of a game since the mid-game generally has an entirely different characteristic from the beginning and end of game. However, I might be limited by the number of samples.

Questions • I’m not currently in NYC so please send questions or comments to: • yingkit.chow@gmail.com

Citations • “Feature Selection for High Dimensional Data: A Fast Correlation-Based Filter Solution”, by Lei Yu and Huan Liu, ICML (2003) • “Efficient Feature Selection via Analysis of Relevance and Redundancy”, by Lei Yu and Huan Liu, Journal of Machine Learning Research 5 (2004)

Feature Selection on Time-Series Cab Data

Feature Selection on Time-Series Cab Data

Presentation Transcript

Time Series Data

Feature selection

Feature Selection

Feature selection

Feature Selection

Feature Selection on Time-Series Cab Data

Data Mining Feature Selection

Feature Selection

Feature Selection

Feature selection

Feature Selection

Time Series Data

Feature Selection

Feature Selection, Feature Extraction

Feature Selection

Feature selection

Feature Selection

Feature selection for text categorization on imbalanced data

Feature Selection

Feature selection