CS 6998 NLP for the Web Columbia University 04/22/2010

Analyzing Wikipedia and Gold-Standard Corpora for NER Training Nothman et al. 2009, EACL William Y. Wang Computer Science CS 6998 NLP for the Web Columbia University 04/22/2010

Outline • Motivation • NER and Gold-Standard Corpora • The Problem: Cross-corpora Performance • Wikipedia for NER • Results • Conclusion and My Observation

Motivation • Manual Annotation is “expensive”. • (1) expensive (2) time (3) extra problems • Can we use linguistic resources to create NER corpus automatically? • What’s the cross-corpora NER performance? • How can we utilize Web resource (e.g. Wikipedia) to improve NER?

NER Gold Corpora • MUC-7: Locations(LOC), organizations(ORG), personal names(PER) • CoNLL-03: LOC, ORG, PER, Miscellaneous(MISC) • BBN: 54 tags in Penn Treebank

Problem: Poor Cross-corpus Performance

Corpus and Error Analysis • N-gram tag variation: • Check tags of all n-grams appear multiple times to see if the NE tags are consistent • Entity type frequency: • (1) POS tag with its NE tag • (e.g. nationalities are often with JJ or NNPS) • (2) Wordtypes • (3) Wordtypes with Functions (e.g. Bank of New England -> Aaa of Aaa Aaa) • Tag sequence confusion: • Looking into the detail of confusion matrix

Using Wikipedia to Build NER Corpus • Classify all articles into entity classes • 2. Split Wikipedia articles into sentences • 3. Label NEs according to link targets • 4. Select sentences for inclusion in a corpus

Improve Wikipedia NER • Baseline: 58.9% and 62.3% on CoNLL and BBN • Inferring extra links using Wikipedia Disambiguation Pages • 2. Personal titles: not all preceding titles indicate PER • (e.g. Prime Minister of Australia) • Previously missed JJ entities (e.g. American / MISC) • Miscellaneous changes

Results DEV set results (higher but similar to test set results)

Conclusion • The impact of NER training corpora on its corresponding test set is huge • Annotation-free Wikipedia NER corpora created • Wikipedia data performs better in the cross-corpora NER task • Still much room for improvement

Comments • What I like about this paper: • The scope of this paper is unique (analogy: cross-cultural studies) • Utilizing novel linguistic resources to solve basic NLP problems • Good results • Relatively clear and easy to understand • What I don’t like about this paper: • The overall method to improve Wikipedia NER training is not a principal approach

Overall Assessment: 8/10

Thank you!

CS 6998 NLP for the Web Columbia University 04/22/2010

CS 6998 NLP for the Web Columbia University 04/22/2010

Presentation Transcript

Columbia University

Columbia University NLP Colloquium

CS 679: Advanced NLP

Columbia University

CS626: NLP, Speech and the Web

Andrej Ficnar Columbia University

CS 241 Section (04/29/2010)

Web IR / NLP Group (WING) Architecture

CS626: NLP, Speech and the Web

Columbia University

Columbia university

CS626: NLP, Speech and the Web

Columbia University Libraries / Information Services

2010. 04

22/04/2010

Columbia University

University of British Columbia

Lawrence F. Jindra, MD Columbia University Winthrop University Hospital

Columbia University