1 / 14

Feature Selection on Time-Series Cab Data

Feature Selection on Time-Series Cab Data. Yingkit (Keith) Chow. Contents. Introduction Features Considered FCBF (Filter-type feature selection) FCBF-PCA (my variation) Conclusion. All Features Considered. Features = Each time sample consists of the following features

kreginald
Télécharger la présentation

Feature Selection on Time-Series Cab Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature Selection on Time-Series Cab Data Yingkit (Keith) Chow

  2. Contents • Introduction • Features Considered • FCBF (Filter-type feature selection) • FCBF-PCA (my variation) • Conclusion

  3. All Features Considered • Features = • Each time sample consists of the following features • Day of Week, Time of Day (1st two features) • taxis[t, 6:9], taxis[t-1, 6:9],…, taxis[t-5, 6:9] • [6:9] represents the index to the matrix taxis, which is the cab entering with meter off, cab enter on, cab exit off, cab exit on • Not all features here will be relevant to classifying whether a game is present.

  4. Fast Correlation-Based Filter Algorithm: • Finds features that are relevant ( SU(I, C) > threshold), • where SU is symmetric uncertainty and will be described in the next slide • Remove redundant features by comparing remaining features (after the first step) • Remove feature j if SU(i, j) >= SU(j, C)

  5. Equations[1] • Information Gain (IG) • IG(X|Y) = H(X) – H(X|Y) • Symmetric Uncertainty (SU) • SU(X,Y) = 2 * IG(X|Y) / [H(X)+H(Y)] • SU is used instead of IG because it compensates for features having more values and normalizes data[1]

  6. FCBF • Classifier (MATLAB Classify- Linear) • Number Bins = 96 • Threshold = 0.01 • Accuracy = 91.9%

  7. Choice of Number Bins • Num Bins = 96 results shown in previous slide (red is ground truth of game and blue is my classification) • Num Bins = 20 • Accuracy = 58.6% • Here the algorithm breaks down and only chooses feature 2, the “time of day”. The blue is periodic here, where a certain time segment a day, everyday will be classed as a game.

  8. FCBF - PCA • FCBF compares individual features with each other • We can use PCA to try and capture a group of features. (for example, maybe one eigenvector can capture the shape of the number of cabs incoming with meters on initially before a game or the increase in the number of cabs entering with meters off prior to the end of game) • Example shown in the next slide

  9. Cab Traffic Behavior • Before Start of Game • Cab On Enter and Cab Off Exit are high • Towards End of Game • Cab Off Enter and Cab On Exit are high

  10. FCBF-PCA • Classifier (MATLAB Classify- Linear) • Number Bins = 20 • Threshold = 0.01 • Accuracy = 92.9% • Note: the features here are projections onto the eigenvectors and not the original feature dimension

  11. Conclusions • The choice of number of bins have an enormous impact on the performance. (possibly due to 96 discrete values of time of day variable) • FCBF-PCA was less susceptible to the choice of numBins (10, 20, 100 numBins all resulted in approximately 91% accuracy)

  12. Future Work • Currently using labels of game or not game. • I’ll try to make it work for detecting the first sample of a game and another classifier to detect the last sample of a game since the mid-game generally has an entirely different characteristic from the beginning and end of game. However, I might be limited by the number of samples.

  13. Questions • I’m not currently in NYC so please send questions or comments to: • yingkit.chow@gmail.com

  14. Citations • “Feature Selection for High Dimensional Data: A Fast Correlation-Based Filter Solution”, by Lei Yu and Huan Liu, ICML (2003) • “Efficient Feature Selection via Analysis of Relevance and Redundancy”, by Lei Yu and Huan Liu, Journal of Machine Learning Research 5 (2004)

More Related