310 likes | 426 Vues
This work explores a multi-criteria approach to active learning aimed at improving Named Entity Recognition (NER). Traditional supervised learning in NER faces significant challenges due to the need for large annotated corpora, which are time-consuming to create. Our research introduces a framework that utilizes multiples criteria, specifically informativeness, representativeness, and diversity, to select the most useful examples for annotation. The paper presents various active learning strategies, experiments, and results that demonstrate the effectiveness of our approach, ultimately minimizing human annotation effort while maintaining performance levels.
E N D
Multi-Criteria-based Active Learning for Named Entity Recognition Dan Shen†‡, Jie Zhang†‡, Jian Su†, Guodong Zhou†, Chew-Lim Tan‡ †Institute for Infocomm Research ‡National University of Singapore
Outline • Introduction • Active Learning in NER • Multiple Criteria for Active Learning • Informativeness • Representativeness • Diversity • Active Learning Strategies • Experiments and Results • Conclusion ACL 2004
Motivation • Named Entity Recognition (NER) • most of current work: supervised learning • a large annotated corpus • MUC-6 / MUC-7 corpus (newswire domain) • GENIA corpus (biomedical domain) • Limitation of supervised NER • corpus annotating: tedious and time-consuming • adaptability: in limited level • Target of our work • explore active learning in NER • minimize the human annotation effort • without degrading performance ACL 2004
research focus Active Learning Framework • Given • a small labeled data set L • a large unlabeled data set U • Repeat • Train a model M on L • Use M to test U • select the most useful example from U • require human expert to label it • add the labeled example to L • Until M achieves a certain performance level ACL 2004
Related Work • Active Learning Criteria • Most of current work: informativeness • Few work: representativeness or diversity • E.g. [McCallum and Nigam 1998] and [Tang et al. 2002] • E.g. [Brinker 2003] NO workexplored multiple criteria in active learning • Active Learning in NLP • E.g. POS tagging / text classification / statistical parsing NO work explored active learning in NER ACL 2004
Outline • Introduction • Active Learning in NER • Multiple Criteria for Active Learning • Informativeness • Representativeness • Diversity • Active Learning Strategies • Experiments and Results • Conclusion ACL 2004
Active Learning in NER • SVM-based NER system • Recognize one class of NEs at a time • Features – Different from supervised learning! • Cannot be produced statistically from training data set • No gazetteer or dictionaries • Active Learning in NER system • Select a most useful batch of examples • An example – a word sequence • E.g. a named entity and its context • Measurements – from words to NEs • Only word-based score is available from SVM ACL 2004
Outline • Introduction • Active Learning in NER • Multiple Criteria for Active Learning • Informativeness • Representativeness • Diversity • Active Learning Strategies • Experiments and Results • Conclusion ACL 2004
1. Informativeness Criterion Most informative example: most uncertain in existing model Most previous works are only based on this criterion.
Informativeness Measurement • Informativeness for Word • Change / induce support vectors in SVM • Distance of the word’s feature vector to separating hyperplane • Informativeness for NE • NE -- a sequence of words • NE = w1w2…wN , wi is the ith word of NE • Heuristic scoring functions • E.g. ACL 2004
2. Representativeness Criterion Most representative example: represent most examples Only few works [McCallum and Nigam 1998; Tang et al. 2002] consider this criterion.
Similarity Measurement • Similarity between Words • Cosine-similarity Measurement • Adapted to SVM • Similarity between NEs • Alignment of two word sequences • Dynamic Time Warping (DTW) algorithm • Given • point-by-point distance • To find an optimal path • Minimize accumulated distance along the path ACL 2004
Representativeness Measurement • Representativeness of NEiin NESet • NESet = {NE1, …NEi , …NEN} • Quantified by its density • The average similarity between NEi and the other NEj(j≠i ) in NESet • Most representative NE • Largest density among all NEs in NESet • centroid of NESet ACL 2004
3. Diversity Criterion Maximize the training utility of a batch: the members in the batch have high variance to each other Only one work [Brinker 2003] considered this criterion.
Global Consideration • Consider the examples in a whole sample space • K-Means Clustering • Cluster all named entities in NESet • Suppose: • the examples in one cluster are quite similar to each other • Select the examples from different clusters at a time • Time consuming • For efficiency, filter out NEs before clustering ACL 2004
Local Consideration • Consider the examples in a batch • For an example candidate: • Compare it with all previously selected examples in the batch one by one • Add it into the batch • If the similarity between all of them is below a threshold • More efficient! ACL 2004
Outline • Introduction • Active Learning in NER • Multiple Criteria for Active Learning • Informativeness • Representativeness • Diversity • Active Learning Strategies • Experiments and Results • Conclusion ACL 2004
Select M most informative examples (Informativeness Criterion) Unlabeled Data Set Select centroid of each cluster (Representativeness Criterion) Strategy 1 Intermediate Set Batch Clustering (K clusters) (Diversity Criterion) ACL 2004
Select example with max score: λInfo+(1- λ)Rep (Informativeness & Representativeness Criteria) Unlabeled Data Set Batch example candidate Strategy 2 Compare the candidate with each example in Batch IF any of the similarity values > threshold THEN reject ELSE add to Batch (Diversity Criterion) ACL 2004
Outline • Introduction • Active Learning in NER • Multiple Criteria for Active Learning • Informativeness • Representativeness • Diversity • Active Learning Strategies • Experiments and Results • Conclusion ACL 2004
Experimental Setting 1 • Corpus • MUC-6 Corpus • To recognize Person, Location and Organization • GENIA Corpus V1.1 • To recognize Protein • Corpus Split • Initial training data set • Test data set • Unlabeled data set • Size of each data set • Batch size K • = 50 in biomedical domain • = 10 in newswire domain ACL 2004
Experimental Setting 2 • Supervised learning • trained on the entire annotated corpus. • Newswire: 408 WSJ articles • Biomedical: 590 MEDLINE abstracts • Random Selection • a batch of examples is randomly selected in each round • F-Measurement ACL 2004
Experimental Results 1 • Effectiveness of Single-Criterion-based Active Learning Supervised (223K) Info_based (52K, 62%,23%) Random (83K) ACL 2004
Avg(3.5K/11.5K, 2.1K/13.6K, 7.8K/20.2K) = 28% Avg(3.5K/157K, 2.1K/157K, 7.8K/157K) = 5% 31K/83K = 37% 31K/223K = 14% more 9K words Experimental Results 2 • Overall Results of Multi-Criteria-based Active Learning ACL 2004
Experimental Results 3 • Effectiveness of Multi-Criteria-based Active Learning Strategy2 (31K, 60%) Info_based (52K) Supervised (223K) Strategy1 (40K, 77%) ACL 2004
Outline • Introduction • Active Learning in NER • Multiple Criteria for Active Learning • Informativeness • Representativeness • Diversity • Active Learning Strategies • Experiments and Results • Conclusion ACL 2004
Conclusion • The first work-- multiple criteria in active learning • Informativeness, Representativeness & Diversity • Two active learning strategies • Outperform single-criterion-based method (60%) • The first work -- active learning in NER • Measurements for multiple criteria • Greatly reduce annotation cost • General measurements and strategies • Can be easily adapted to other NLP tasks ACL 2004
Future Work • How to automatically decide the optimal value of these parameters? • Batch size K • Linear interpolation parameter • When to stop the active learning process? • the change of support vectors ACL 2004
Acknowledgement Thanks to International Post-Graduate College (IGK) at Saarland University for the generous travel support. The first author is now a Ph.D. student in this program. ACL 2004
The End Thank You !
Corpus Split ACL 2004