1 / 21

Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles

Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles. Pinar Donmez and Jaime Carbonell Language Technologies Institute, School of Computer Science, Carnegie Mellon University CIKM ’08, Napa Valley, October 2008. unique oracle

royals
Télécharger la présentation

Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Language Technologies Institute, School of Computer Science, Carnegie Mellon University CIKM ’08, Napa Valley, October 2008

  2. unique oracle perfect oracle always right never tired works for free or charges uniformly multiple sources of information imperfect oracles unreliable reluctant expensive or charges non-uniformly Active learning Assumptions and Real World Active Learning Real World

  3. Solution: Proactive Learning • Proactive learningis a generalization of active learning to relax these assumptions • decision-theoretic framework to jointly optimize instance-oracle pair • utility optimization problem under a fixed budget constraint

  4. Outline • Methodology • 3 Scenarios • Reluctance • Fallibility • Variable and Fixed Cost • Evaluation • Problem Setup • Datasets • Results • Conclusion

  5. Scenario 1: Reluctance • 2 oracles: • reliable oracle: expensive but always answers with a correct label • reluctant oracle: cheap but may not respond to some queries • Define a utility score as expected value of information at unit cost

  6. How to simulate oracle unreliability? • depend on factors such as query difficulty (hard to classify), complexity of the data (requires long and time-consuming analysis), etc. In this work, we model it based on query difficulty • Assumptions • Perfect oracle ~ classifier having zero training error on the entire data • Imperfect oracle ~ weak classifier trained on a subset of the entire data • Train a logistic regression classifier on the subset to obtain • Identify instances with • These are the unreliable instances • Challenge: tradeoff between the information value of an instance and the reliability of the oracle

  7. How to estimate ? • Cluster unlabeled data using k-means • Ask the label of each cluster centroid to the reluctant oracle. If • label received: increase of nearby points • no label: decrease of nearby points equals 1 when label received, -1 otherwise • # clusters depend on the clustering budget and oracle fee

  8. Algorithm works in rounds till no budget • At each round, sampling continues until a label is obtained • Be careful: You may spend the entire budget on a single attempt • If no label, decrease the utility of remaining instances: • This is adaptive Penalization of the Reluctant Oracle

  9. Algorithm for Scenario 1

  10. Scenario 2: Fallibility • 2 oracles: • One perfect but expensive oracle • One fallible but cheap oracle, always answers • Alg. Similar to Scenario 1 with slight modifications • During exploration: • Fallible oracle provides the label with its confidence • Confidence = of fallible oracle • If then we don’t use the label but we still update

  11. Outline of Scenario 2

  12. Scenario 3: Non-uniform Cost • Uniform cost: Fraud detection, face recognition, etc. • Non-uniform cost: text categorization, medical diagnosis, protein structure prediction, etc. • 2 oracles: • Fixed-cost Oracle • Variable-cost Oracle

  13. Outline of Scenario 3

  14. Evaluation • Datasets: Face detection, UCI Letter (V-vs-Y), Spambase, and UCI Adult

  15. Oracle Properties and Costs • The cost is inversely proportional to reliability • Higher costs for the fallible oracle since a noisy label should be penalized more than no label at all • Cost ratio creates an incentive to choose between oracles

  16. Underlying Sampling Strategy • Conditional entropy based sampling, weighted by a density measure • Captures the information content of a close neighborhood close neighbors of x

  17. Results: Overall and Reluctance on Spambase Data

  18. Results: Reluctance

  19. Cost varies non-uniformly statistically significant results (p<0.01)

  20. More light on the clustering step • Run each baseline without the clustering step • Entire budget is spent in rounds for data elicitation • No separate clustering budget • Results on Spambase under Scenario 1, cost 1:3

  21. Conclusion • Address issues with the assumptions of active learning • Introduction to a Proactive Learning framework • Analysis of imperfect oracles with differing properties and costs • Expected utility maximization across oracle-instance pairs • Effective against exploitation of a single oracle

More Related