140 likes | 270 Vues
20 th of May 2004. Mixed-Lingual Entity Recognition. Beatrice Alex School of Informatics The University of Edinburgh. Named Entity Recognition. What is a named entity (NE)? A string that refers to a particular kind of object in the world, e.g.
E N D
20th of May 2004 Mixed-Lingual Entity Recognition Beatrice Alex School of InformaticsThe University of Edinburgh
Named Entity Recognition • What is a named entity (NE)? A string that refers to a particular kind of object in the world, e.g. “John Lennon” = NE of type person “T-Mobile” = NE of type organisation “Edinburgh” = NE of type location • How are they recognised? Use of internal and external context
NER Methods • Rule-based • hand-written patterns • rely on punctuation, capitalisation and other features in the text • Statistical-based • data-driven approaches • exploit the statistical properties of real language to learn models • Hybrid Methods
PhD ProposalSupervisors: Claire Grover, Stephen Clark • Proposed research topic: mixed-lingual NER, i.e. the detection and classification of NEs in a different language from the base language of the text • Examples: „Das Central Command erklärte, das Schicksal des Piloten sei noch ungeklärt.“ “Germany's Die Welt reports that four people died in the heat wave last week.”
Background and Motivation • Multi-lingual and language-independent NER - active research area in NLP circles (MET-1/2, CoNLL02/03) • Many errors in German NER due to amount of foreign language material in German articles (Rössler, 2002) • Mixed-lingual NER - unspecified or beyond capabilities of existing approaches
Beneficiaries • Performance improvements of applications where NER is standardly applied (IE, QA, text summarisation, topic identification) • Valuable information to polyglot TTS synthesis • Pre-processing tool for MT systems
Denglish • English: dominant language of science & technology, air-traffic control, advertising • Increasing influence on German The live eventwas really cool. There were tickets, fast food, drinks in the basement.
Preliminary Research • Analysis of English inclusions in German newspaper articles on different domains: • (1) Internet & Telecoms, (2) EU and (3) space travel • Corpus: 16,000 tokens per domain from German newspaper (FAZ) • Automatic classification of English tokens (NN and FM) by means of a simple lookup procedure • More than 90% of all English inclusions are nouns (Yang, 1999; Yeandle, 2001; Corr, 2003)
1. Lookup Procedure • CELEX lookup (NN|FM) in German and English databases • only in German database > DE • only in English database > EN • in both databases: • Computer, Trend, Monster • Generation, Union, Mission • Art, Tag, Rat, Fall, All • in neither database > 2. lookup procedure
2. Lookup Procedure • Google lookup with language preference • German compounds: Mausklick (mouse click) • English unhyphenated compounds: Homepage • Mixed-lingual unhyphenated compounds: Shuttleflug (shuttle flight) • English nouns with German inflections: Receivern • Abbreviations and acronyms: GPS, UKW • Words with spelling mistakes: Abruch (abortion) • English words with American spelling: Center • Classification based on number of hits
Results • Output: Das <EN>Central</EN> <EN>Command</EN> erklärte, das <DE>Schicksal</DE> des <DE>Piloten</DE> sei noch ungeklärt. EN: Central Command explained, the fate of the pilot is still unclear. MT: CentralCommand explained, the fate of the pilot was still unsettled.
Error Analysis • Sources of Error: • Wrong POS tags • Mixed-lingual unhyphenated compounds • New internationalisms • Abbreviations with several expansions • Unreliable Google hits • Inclusions from other languages • Need for better handling of NEs • Morpheme level analysis for compounds • Extension to other POS tag
Future Work • Collection of more data and annotation for training and evaluation • Development of sequence modelling classifier, e.g. maximum entropy • Implementation of other languages • Application-based evaluation (e.g. MT)