1 / 17

Thesis Proposal

Thesis Proposal. PrActive Learning: Pr actical Active Learning, Generalizing Active Learning for Real-World Deployments. Generic example system flow for interactive classification problems. Majority transactions cleared. Domain system pricing and validation. Machine Learning model.

hova
Télécharger la présentation

Thesis Proposal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thesis Proposal PrActive Learning: Practical ActiveLearning, Generalizing Active Learning for Real-World Deployments

  2. Generic example system flow for interactive classification problems Majority transactions cleared Domain system pricing and validation Machine Learning model Large volume (in millions) of transactions coming in Transactions paid Minority transactions flagged for auditing • Common Characteristics • Skewed class distribution (minority events) • Concept/Feature drift • Expensive domain experts • Biased sampling of labeled historical data

  3. Interactive Classification Applications • Fraud detection • Network Intrusion detection • Video Surveillance • Information Filtering / Recommender Systems • Error prediction/Quality Control

  4. Interactive Classification Setting Trained Classifier Ranked List scored by classifier Unlabeled + Labeled Data • Classifier trained from labeled data • Human (user/expert) in the loop using the results but also providing feedback at a cost • Goal: Maximize the Return on Investment which is equivalent to the productivity of the human

  5. Factorization of the problem Cost-Sensitive Exploitation Cost-Sensitive Active Learning Exploration-Exploitation Tradeoffs Standard Ranking / Relevance Feedback Active Learning

  6. Interactive Classification-High Level Picture Unlabeled Data (t) Trained Classifier (1,…,t-1) Labeled Data (1,…,t-1) Ranked List Labeled Data (1,…,t)

  7. Thesis Contributions • Problem Statement: How to generalize active learning to incorporate crucial factors like differential utility of a labeled example(dynamic/variable exploitation), dynamic cost of labeling an example, concept drift in a unified framework that makes the deployment of such learning systems practical • Contributions • Generalization of Active Learning along the following dimensions • Differential utility of a labeled example • Dynamic cost of labeling an example • Tackling concept drift • A unified framework to solve these considerations jointly • First solution: Optimizing joint utility function based on cost, exploration utility and exploitation utility • Second solution: Using Upper Confidence Bound approach with contextual multi-armed bandit setup to incorporate the different factors • Empirical Evaluation of the proposed framework • Using evaluation metric motivated by real business tasks • Datasets • Synthetic dataset • Real world dataset: Health Insurance Claims Rework • Cost-Sensitive Exploitation

  8. Situating the thesis work wrt related work • Knowledge • Based Learning • Feature level • Knowledge encoding (GE) Cost-sensitive Active Learning • PrActive • Learning • Differential Utility • Dynamic cost • Concept Drift • Proactive • Learning • Unreliable Oracle • Oracle variation

  9. Factorization– Cost/Exploitation/Exploration • Type of model pre-determined by the domain need • Following are the 3 possible types of models: • Uniform • Each example gets the same value from the model • Variable • Each example can potentially have different value that is a function of it’s features • Markovian • Each example has variable value which is a function of it’s features and the history(ordered) of examples already labeled

  10. Utility Function

  11. Joint Optimization Algorithm

  12. Evaluation Metric • Return on Investment • The net dollar amount saved by auditing the claim per dollar amount invested/spent • Net Savings/Net Cost • Net Savings=Net dollar amount saved-Cost • The net dollar amount can be determined independent of the exploitation model • For Claims rework: Admin cost savings or Med Cost savings + admin cost savings • Long term evaluation (difficult to see the exploration effects in short time windows)

  13. Factorization - continued • Each factor: Cost, Exploration, Exploitation can have following 3 setups: • Uniform • Variable: static/pre-determined • Variable: dynamic/online

  14. Thesis Contributions • Define the novel area of interactive classification for skewed class distribution • Problem definition/setup • Characterization of the problem • Factorization of the problem • Hypothesis: Jointly managing the different factors involved will lead to better overall performance metric over time than considering the factors in isolation • Framework for solving interactive skewed classification problems • Modules: Cost model, Exploitation model, Exploration model, Utility function, Joint optimization algorithm, Evaluation metric • Demonstrate the usefulness of the framework for: • Synthetic data • Generalization • Health Claims Error prediction problem • Temporal Active Learning • Cost-Sensitive Exploitation

  15. Trained Classifier Ranked List Unlabeled + Labeled Data

  16. Unlabeled Data (t) Trained Classifier (t-1) Labeled Data (t-1) Ranked List Labeled Data (t)

  17. Thesis Contributions • Problem Statement: What are the considerations for developing/deploying a long running system for interactive classification task where the system is assisting human experts in solving business tasks • Contributions • Define the area of interactive classification for skewed class problems motivated by deploying these learning systems that run over time • Framework for solving interactive skewed classification problems • Defining the trade-offs between exploitation, exploration and cost • What are the relevant metrics to evaluate such systems • Hypothesis: Jointly managing the different factors involved will lead to better overall performance metric over time than considering the factors in isolation • Explore, evaluate and compare solutions for the framework • First approach: Defining a joint utility function and optimizing for the utility function • Second approach: Using upper confidence bounds with contextual multi-armed bandit • Demonstrate the usefulness of the framework for: • Synthetic data • Health Claims Error prediction problem • For demonstrating generalization • Handling temporal drift with active sampling from evolving unlabeled pool • Cost-Sensitive Exploitation

More Related