1 / 14

Learning Classifiers from Distributional Data

Learning Classifiers from Distributional Data. Harris T. Lin , Sanghack Lee, Ngot Bui and Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University htlin@iastate.edu. Introduction.

iniko
Télécharger la présentation

Learning Classifiers from Distributional Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Classifiers from Distributional Data Harris T. Lin, Sanghack Lee, Ngot Bui and VasantHonavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University htlin@iastate.edu

  2. Introduction • Traditional Classification Instance representation:tuple of feature values • BUT due to • Variability in sample measurements • Difference in sampling rate of each feature • Advances in tools and storages • One may want to repeatthe measurementfor each feature and for each individual for reliability • Example domains • Electronic Health Records • Sensor readings • Extracting text features • … • How to represent? White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Patient2 N Dataset Patient3 N

  3. Introduction • How to represent? • Align samples White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Patient2 N Dataset Patient3 N

  4. Introduction • How to represent? • Align samples White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y

  5. Introduction • How to represent? • Align samples • Measurements may not be synchronous • Missing data • Unnecessary big and sparse dataset • Need to adjust for weights White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Y Y ? Y ? Y ? ? Y ? ? Y ? ? Y ? ?

  6. Introduction • How to represent? • Aggregation White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Patient2 N Dataset Patient3 N

  7. Introduction • How to represent? • Aggregation • May lose valuable information • Which aggregation function? • The distribution of each sample set may contain information White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y max max avg avg Y

  8. Introduction • How to represent? • Proposed approach • Just as drawn • Bag of feature values • “Distributional” representation • Adapt learning models to this new representation • Contribution • Introduce problem of learning from Distributional data • Offers 3 basic solution approaches White blood cell Cholesterol Heart Rate Temperature Healthy? Patient1 Y Patient2 N Dataset Patient3 N

  9. Problem Formulation • Distributional Instance:x = (B1, …, BK)where Bk represents a bag of values of the kth feature • Distributional Dataset:D = {(x1, c1), …, (xn, cn)} • Distributional Classifier Learning Problem: New instance (x1, c1) (x2, c2) … (xn, cn) Learner Classifier Predicted class

  10. Distributional Learning Algorithms • Considers discrete domain for simplicity • 3 basic approaches • Aggregation • Simple aggregation (max, min, avg, etc.) • Vector distance aggregation (Perlich and Provost [2006]) • Generative Models • Naïve Bayes (with 4 different distributions) • Bernoulli • Multinomial • Dirichlet • Polya(Dirichlet-Multinomial) • Discriminative Models • Using standard techniques to transform the above generative models into its discriminative counterpart

  11. Result Summary • Dataset: • 2 real-world datasets and 1 synthetic dataset • Dataset sizes: • Results:DIL algorithms that take advantage of the information available in the distributional instance representation outperform or match the performance of their counterparts that fail to fully exploit such information • Main critics:Results from discrete domain may not carry over to numerical features

  12. Related Work # Features = 1 Tuple of bags of features Bag of tuples of features Y Y Multiple Instance Learning Document Distributional Tabular Size of bag = 1 Numerical Domains Topic Models • Supervised • Multi-modal Topic Models Discrete Domains

  13. Future Work • Consider ordinal and numerical features • Consider dependencies between features • Adapt other existing Machine Learning methods (e.g. kernel methods, SVMs, decision trees, nearest neighbors, etc.) • Unsupervised setting: clustering distributional data

  14. Conclusion • Opportunities • Variability in sample measurements • Difference in sampling rate of each feature • One may want to repeat the measurementfor each featureand for each individual for reliability • Contributions • Introduce problem of learning from Distributional data • Offer 3 basic solution approaches • Suggest that the distribution embedded in the Distributional representation may improve performance Y

More Related