1 / 23

(Rare) Category Detection Using Hierarchical Mean Shift

(Rare) Category Detection Using Hierarchical Mean Shift. Pavan Vatturi (vatturi@eecs.oregonstate.edu) Weng-Keen Wong (wong@eecs.oregonstate.edu). 1. Introduction. Applications for surveillance, scientific discovery and data cleaning require anomaly detection

Télécharger la présentation

(Rare) Category Detection Using Hierarchical Mean Shift

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. (Rare) Category Detection Using Hierarchical Mean Shift Pavan Vatturi (vatturi@eecs.oregonstate.edu) Weng-Keen Wong (wong@eecs.oregonstate.edu)

  2. 1. Introduction • Applications for surveillance, scientific discovery and data cleaning require anomaly detection • Anomalies often identified as statistically unusual data points • Many detected anomalies are simply uninteresting or correspond to known sources of noise

  3. 1. Introduction Known objects (99.9% of the data) Anomalies (0.1% of the data) Pictures from: Sloan Digital Sky Survey (http://www.sdss.org/iotw/archive.html) Pelleg, D. (2004). Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection. PhD Thesis, Carnegie Mellon University. Uninteresting (99% of anomalies) Interesting (1% of anomalies)

  4. 1. Introduction Category Detection [Pelleg and Moore 2004]: human-in-the-loop exploratory data analysis Ask User to Label Categories of Interesting Data Points Data Set Update Model with Labels Build Model Spot Interesting Data Points

  5. 1. Introduction • User can: • Label a query data point under an existing category • Or declare data point to belong to a previous undeclared category Ask User to Label Categories of Interesting Data Points Data Set Update Model with Labels Build Model Spot Interesting Data Points

  6. 1. Introduction • Goal: present to user a single instance from each category in as few queries as possible • Difficult to detect rare categories if class imbalance is severe • Interested in rare categories for anomaly detection

  7. Outline • Introduction • Related Work • Background • Methodology • Results • Conclusion / Future Work

  8. 2. Related Work • Interleave [Pelleg and Moore 2004] • Nearest-Neighbor-based active learning for rare-category detection for multiple classes [He and Carbonell 2008] • Multiple output identification [Fine and Mansour 2006]

  9. 3. Background: Mean Shift [Fukunaga and Hostetler 1975] Reference data set Mean shift vector (follows density gradient) Query point Center of Mass Mean shift vector with kernel k

  10. 3. Background: Mean Shift [Fukunaga and Hostetler 1975] Reference data set Convergence to cluster center Query point Center of Mass

  11. 3. Background: Mean Shift Blurring Reference data set Query point Center of Mass • Blurring: • When query points are the same as the reference data set • Progressively blurs the original data set

  12. 3. Background: Mean Shift End result of applying mean shift to a synthetic data set

  13. 4. Methodology: Overview • Sphere the data • Hierarchical Mean Shift • Query user

  14. 4. Methodology: Hierarchical Mean Shift Repeatedly blur data using Mean Shift with increasing bandwidth: hnew = k * hold

  15. 4. Methodology: Querying the User The data point closest to the cluster center is the representative data point. Rank representative data points for querying to user according to: • Outlierness [Leung et al. 2000] for Cluster Ci: Lifetime of Ci = Log (bandwidth when cluster Ci is merged with other clusters – bandwidth when cluster Ci is formed)

  16. 4. Methodology: Querying the User Rank representative data points for querying to user according to: • Compactness + Isolation [Leung et al. 2000] for Cluster Ci:

  17. 4. Methodology: Tiebreaker • Ties may occur in Outlierness or Compactness/Isolation values. • Highest Average Distance heuristic: choose representative data point with highest average distance from user-labeled points.

  18. 5. Results Data sets used in experiments Shuttle, OptDigits, OptLetters, and Statlog were subsampled to simulate class imbalance.

  19. 5. Results (Yeast) Category detection metric: # queries before user presented with at least one example from all categories

  20. 5. Results Number of hints to discover all classes

  21. 5. Results Area under the category detection curve

  22. 6. Conclusion / Future Work Conclusions • HMS-based methods consistently discover more categories in fewer queries than existing methods • Do not need apriori knowledge of dataset properties

  23. 6. Conclusion / Future Work Future Work • Better use of user feedback • Presentation of an entire cluster to the user instead of a representative data point • Improved computational efficiency • Theoretical analysis

More Related