210 likes | 223 Vues
Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles. Pinar Donmez and Jaime Carbonell Language Technologies Institute, School of Computer Science, Carnegie Mellon University CIKM ’08, Napa Valley, October 2008. unique oracle
E N D
Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Language Technologies Institute, School of Computer Science, Carnegie Mellon University CIKM ’08, Napa Valley, October 2008
unique oracle perfect oracle always right never tired works for free or charges uniformly multiple sources of information imperfect oracles unreliable reluctant expensive or charges non-uniformly Active learning Assumptions and Real World Active Learning Real World
Solution: Proactive Learning • Proactive learningis a generalization of active learning to relax these assumptions • decision-theoretic framework to jointly optimize instance-oracle pair • utility optimization problem under a fixed budget constraint
Outline • Methodology • 3 Scenarios • Reluctance • Fallibility • Variable and Fixed Cost • Evaluation • Problem Setup • Datasets • Results • Conclusion
Scenario 1: Reluctance • 2 oracles: • reliable oracle: expensive but always answers with a correct label • reluctant oracle: cheap but may not respond to some queries • Define a utility score as expected value of information at unit cost
How to simulate oracle unreliability? • depend on factors such as query difficulty (hard to classify), complexity of the data (requires long and time-consuming analysis), etc. In this work, we model it based on query difficulty • Assumptions • Perfect oracle ~ classifier having zero training error on the entire data • Imperfect oracle ~ weak classifier trained on a subset of the entire data • Train a logistic regression classifier on the subset to obtain • Identify instances with • These are the unreliable instances • Challenge: tradeoff between the information value of an instance and the reliability of the oracle
How to estimate ? • Cluster unlabeled data using k-means • Ask the label of each cluster centroid to the reluctant oracle. If • label received: increase of nearby points • no label: decrease of nearby points equals 1 when label received, -1 otherwise • # clusters depend on the clustering budget and oracle fee
Algorithm works in rounds till no budget • At each round, sampling continues until a label is obtained • Be careful: You may spend the entire budget on a single attempt • If no label, decrease the utility of remaining instances: • This is adaptive Penalization of the Reluctant Oracle
Scenario 2: Fallibility • 2 oracles: • One perfect but expensive oracle • One fallible but cheap oracle, always answers • Alg. Similar to Scenario 1 with slight modifications • During exploration: • Fallible oracle provides the label with its confidence • Confidence = of fallible oracle • If then we don’t use the label but we still update
Scenario 3: Non-uniform Cost • Uniform cost: Fraud detection, face recognition, etc. • Non-uniform cost: text categorization, medical diagnosis, protein structure prediction, etc. • 2 oracles: • Fixed-cost Oracle • Variable-cost Oracle
Evaluation • Datasets: Face detection, UCI Letter (V-vs-Y), Spambase, and UCI Adult
Oracle Properties and Costs • The cost is inversely proportional to reliability • Higher costs for the fallible oracle since a noisy label should be penalized more than no label at all • Cost ratio creates an incentive to choose between oracles
Underlying Sampling Strategy • Conditional entropy based sampling, weighted by a density measure • Captures the information content of a close neighborhood close neighbors of x
Cost varies non-uniformly statistically significant results (p<0.01)
More light on the clustering step • Run each baseline without the clustering step • Entire budget is spent in rounds for data elicitation • No separate clustering budget • Results on Spambase under Scenario 1, cost 1:3
Conclusion • Address issues with the assumptions of active learning • Introduction to a Proactive Learning framework • Analysis of imperfect oracles with differing properties and costs • Expected utility maximization across oracle-instance pairs • Effective against exploitation of a single oracle