Presented by: John Paisley

Active Learning with Feedback on Both Features and InstancesH. Raghavan, O. Madani and R. JonesJournal of Machine Learning Research 7 (2006) Presented by: John Paisley

Outline • Discuss problem • Discuss proposed solution • Discuss results • Conclusion

Problem of Paper • Imagine you want to filter junk email via some classifier and you’re willing to help train that classifier by labeling things, but you want to do it quickly because you’re impatient. • Imagine you want to sort a database of news articles, etc. • This paper is concerned with trying to speed this process up, meaning reach a high performance in fewer iterations.

Suggestion of Paper • Traditionally, active learning will query a user about “instances” (articles, emails etc) and the user will provide a label for that instance (one-vs-rest in this paper). • This paper suggests that the user also be queried about features (words) and their relevance for distinguishing classes to speed up the learning process. • The reason is that, apparently, in typical applications, all words of a document are used as features in classification. Therefore the feature is a very high dimension and, with only a few labeled data, it’s hard to build a good classifier. • By asking about features, the dimensionality is (effectively) reduced early on with the “nuisance” dimensions (effectively) removed.

Traditional Active Learning • Several instances are selected at random and labeled by a user • A model is built (SVM using “direct kernel” here) • Sequentially, the most uncertain (closest to boundary and called “uncertainty sampling”) instances are selected, labeled, and the model updated. • The algorithm terminates at some point (when a high enough level of performance is reached).

Their “Feature Feedback” Addition • (same) Several instances are selected at random and labeled by a user • (same) A model (SVM using “direct kernel” here) is built. • (same) Sequentially, the most uncertain (closest to boundary and called “uncertainty sampling”) instances are selected, labeled, and the model updated. • Then, the user is shown a list of features (words) and asked whether they are relevant to distinguishing this class from others. Their algorithm then incorporates this in further training by simply multiplying that dimension by 10 (arbitrary) to increase the impact that dimension has on classification (because of the direct kernel I assume)

How They Assess Performance (1) • Before humans are involved, they create an “oracle” that can rank features by importance (it has all labels a priori) as determined via Information Gain • Where P(c) it the probability of the class of interest, P(t) is the probability of the word of interest appearing in an article, and P(c,t) is their joint probability. The larger the IG, the more informative the word is on determining the class (e.g. “football” is informative for sports).

How They Assess Performance (2) • They devise their performance metric called “efficiency” • F1 is the harmonic mean of the precision and recall, where “precision” is the fraction of (e.g.) articles classified as “1” that are correct and “recall” is the fraction of articles correctly classified as “1” to all articles with label “1” • They set M = 1000, assuming that the classifier will be about perfect at that point and they’re measuring how far active learning (ACT) is from that perfection compared with random sampling. [Right: Efficiency is defined as one minus blue area divided by grey area. They only measure after seeing 42 documents throughout the paper]

Results with Oracle • These results show the ideal performance of feature feedback to see if it’s worthwhile to begin with. • Basically, they select the top n features that maximize performance (via Information Gain) and do active learning, reporting the efficiency after 42 documents, as well as the F1 score after 7 and 22 documents. The F1 results are upper bounded by the far right column. The results indicate that selecting the most informative features speeds up learning (the uninformative features are “distractions” for the classifier in the early stages when there are only a few labels).

Results with Human • How well can a human label features compared with the oracle and, if not as well, is it still beneficial? • Experiment: Have a human read an article and show the top 20 words from the oracle mixed in with some other words. Have the user mark “relevant” or “not relevant/don’t know” for each. Below shows the human compared with the oracle. Also shown is the ability of 50 labeled documents (picked via uncertainty sampling) to select the top 20 words (via Information Gain) aka, traditional active learning after 50. • What it says is that after seeing one document, a human can tell the relevant features better than the classifier can after 50. Kappa is a measure of how well the humans agree (which they say is good).

Putting Humans “In the Loop” • They then took the human responses and simulated active learning with feature feedback. The experimenters were shown an article and the features to respond to (“relevant” or “not”) for that article and they input what the humans of the previous slide said. UNC is no feature feedback, ORA is the oracle (correct “answers” for the feature queries) and HIL is the human response (as opposed to oracle). • It says that humans speed up the active learning process.

Conclusions • Knowing what features are relevant at the early stages of active learning will help speed up the process of building an accurate classifier. • Far fewer instances will need to be labeled for the classifier to reach a high performance. • Humans are able to identify these features (in the case of identifying words for documents)

Presented by: John Paisley