210 likes | 332 Vues
The third Chinese Language Processing Bakeoff focused on Word Segmentation (WS) and Named Entity Recognition (NER) challenges. This event brought together participants from various regions, showcasing significant advancements and diverse approaches in Chinese language processing. Through the evaluation of multiple runs and corpora, the bakeoff highlighted crucial issues like out-of-vocabulary (OOV) words and annotation consistency. The results showed that the best-performing models achieved high F-scores, emphasizing the need for effective methodologies to tackle the inherent complexities in Chinese language processing.
E N D
The Third Chinese Language Processing Bakeoff:Word Segmentation and Named Entity Recognition Gina-Anne Levow Fifth SIGHAN Workshop July 22, 2006
Roadmap • Bakeoff Task Motivation • Bakeoff Structure: • Materials and annotations • Tasks and conditions • Participants and timeline • Results & Discussion: • Word Segmentation • Named Entity Recognition • Observations & Conclusions • Thanks
Bakeoff Task Motivation • Core enabling technologies for Chinese language processing • Word segmentation (WS) • Crucial tokenization in absence of whitespace • Supports POS tagging, parsing, ref. resolution, etc • Fundamental challenges: • “Word” not well, consistently defined; humans disagree • Unknown words impede performance • Named Entity Recognition (NER) • Essential for reference resolution, IR, etc • Common class of new unknown words
Data Source Characterization • Five corpora, providers • Annotation guidelines available, varied • Simplified and traditional characters • Range of encodings, all available in Unicode (UTF-8) • Provided in common XML, converted to train/test form (LDC)
Tasks and Tracks • Tasks: • Word Segmentation: • Training and truth: whitespace delimited • End-of-word tags replaced with space, no others • Named Entity Recognition: • Training and truth: Similar to Co-NLL 2-column • NAMEX only: LOC, PER, ORG (LDC: +GPE) • Tracks: • Closed: Only provided materials may be used • Open: Any materials may be used, but must document
Structure: Participants &Timeline • Participants: • 29 sites submitted runs for evaluation (36 init) • 144 runs submitted: ~2/3 WS; 1/3 NER • Diverse groups: 11 PRC, 7 Taiwan, 5 US, 2 Japan, 1each: Singapore, Korea, Hong Kong, Canada • Mix of Commercial: MSRA, Yahoo!, Alias-I, FR Telecom, etc- and Academic sites • Timeline: • March 15: Registration open • April 17: Training data released • May 15: Test data released • May 17: Results due
Word Segmentation: Results • Contrasts: Left-to-right maximal match • Baseline: Uses only training vocabulary • Topline: Uses only testing vocabulary
Word Segmentation: CityU CityU Closed CityU Open
Word Segmentation: CKIP CKIP Closed CKIP Open
Word Segmentation: MSRA MSRA Closed MSRA Open
Word Segmentation: UPUC UPUC Closed UPUC Open
Word Segmentation: Overview • F-scores: 0.481-0.797 • Best score: MSRA Open Task (FR Telecom) • Best relative to topline: CityU Open: >99% • Most frequent top rank: MSRA • Both F-scores and OOV recall higher in Open • Overall good results: Most outperform baseline
Word Segmentation: Discussion • Continuing OOV challenges • Highest F-scores on MSRA • Also highest top and base lines • Lowest OOV rate • Lowest F-scores on UPUC • Also lowest top and baselines • Highest OOV rate (> double all other OOV) • Smallest corpus (~1/3 MSRA) • Best scores: most consistent corpus • Vocabulary, annotation • UPUC also varies in genre: train: CTB; test: CTB,NW,BN
NER Results • Contrast: Baseline • Label as Named Entity if unique tag in training
NER Results: CityU CityU Closed CityU Open
NER Results: LDC LDC Closed LDC Open
NER Results: MSRA MSRA Closed MSRA Open
NER: Overview • Overall results: • Best F-score: MSRA Open Track: 0.91 • Strong overall performance: • Only two results below baseline • Direct comparison of NER Open vs Closed • Difficult: only two sites performed both tracks • Only MSRA had large numbers of runs • Here Open outperformed Closed: top 3 Open > Closed
NER Observations • Named Entity Recognition challenges • Tagsets, variation, and corpus size • Results on MSRA/CityU much better than LDC • LDC corpus substantially smaller • Also larger tagset: GPE • GPE easily confused for ORG or LOC • NER results sensitive to corpus size, tagset, genre
Conclusions & Future Challenges • Strong, diverse participation in WS & NER • Many effective competitive results • Cross-task, cross-evaluation comparisons • Still difficult • Scores sensitive to corpus size, annotation consistency, tagset, genre, etc • Need corpus, config-independent measure of progress • Encourage submissions that support comparisons • Extrinsic, task-oriented evaluation of WS/NER • Continuing challenges: OOV, annotation consistency, encoding combinations and variation, code-switching
Thanks • Data Providers: • Chinese Knowledge Information Processing Group, Academia Sinica, Taiwan: • Keh-Jiann Chen, Henning Chiu • City University of Hong Kong: • Benjamin K.Tsou, Olivia Oi Yee Kwong • Linguistic Data Consortium: Stephanie Strassel • Microsoft Research Asia: Mu Li • University of Pennsylvania/University of Colorado: • Martha Palmer, Nianwen Xue • Workshop co-chairs: • Hwee Tou Ng and Olivia Oi Yee Kwong • All participants!