551 likes | 1.21k Vues
Named Entity Recognition and Transliteration for 50 Languages Richard Sproat, Dan Roth, ChengXiang Zhai, Elabbas Benmamoun, Andrew Fister, Nadia Karlinsky, Alex Klementiev, Chongwon Park, Vasin Punyakanok, Tao Tao, Su-youn Yoon University of Illinois at Urbana-Champaign
E N D
Named Entity Recognition and Transliteration for 50 Languages Richard Sproat, Dan Roth, ChengXiang Zhai, Elabbas Benmamoun, Andrew Fister, Nadia Karlinsky, Alex Klementiev, Chongwon Park, Vasin Punyakanok, Tao Tao, Su-youn Yoon University of Illinois at Urbana-Champaign http://compling.ai.uiuc.edu/reflex The Second Midwest Computational Linguistics Colloquium (MCLC-2005) May 14-15 The Ohio State University
General Goals • Develop multilingual named entity recognition technology: focus on persons, places, organizations • Produce seed rules and (small) corpora for several LCTLs (Less Commonly Taught Languages) • Develop methods for automatic named entity transliteration • Develop methods for tracking names in comparable corpora Sproat et al.: NER and Transliteration for 50 Languages
Languages • Languages for seed rules: Chinese, English, Spanish, Arabic, Hindi, Portuguese, Russian, Japanese, German, Marathi, French, Korean, Urdu, Italian, Turkish, Thai, Polish, Farsi, Hausa, Burmese, Sindhi, Yoruba, Serbo-Croatian, Pashto, Amharic, Indonesian, Tagalog, Hungarian, Greek, Czech, Swahili, Somali, Zulu, Bulgarian, Quechua, Berber, Lingala, Catalan, Mongolian, Danish, Hebrew, Kashmiri, Norwegian, Wolof, Bamanankan, Twi, Basque. • Languages for (small) corpora: Chinese, Arabic, Hindi, Marathi, Thai, Farsi, Amharic, Indonesian, Swahili, Quechua. Sproat et al.: NER and Transliteration for 50 Languages
Milestones • Resources for various languages: • NER seed rules for: Armenian, Persian, Swahili, Zulu, Hindi, Russian, Thai • Tagged corpora for: Chinese, Arabic, Korean • Small tagged corpora for: Armenian, Persian, Russian (10-20K words) • Named Entity recognition technology: • Ported NER technology from English to Chinese, Arabic, Russian and German • Name transliteration: Chinese-English, Arabic-English, Korean-English Sproat et al.: NER and Transliteration for 50 Languages
Linguistic/Orthographic Issues • Capitalization • Word boundaries • Phonetic vs.Orthographic issues in transliteration Sproat et al.: NER and Transliteration for 50 Languages
Named Entity Recognition Sproat et al.: NER and Transliteration for 50 Languages
Multi-lingual Text Annotator Annotate any word in a sentence by selecting the word and an available category. It's also possible to create new categories. http://l2r.cs.uiuc.edu/~cogcomp/ner_applet.php Sproat et al.: NER and Transliteration for 50 Languages
Multi-lingual Text Annotator View text in other encodings. New language encodings are easily added in a simple text file mapping. http://l2r.cs.uiuc.edu/~cogcomp/ner_applet.php Sproat et al.: NER and Transliteration for 50 Languages
Motivation for Seed Rules “The only supervision is in the form of 7 seed rules (namely, that New York, California and U.S. are locations; that any name containing Mr. is a person; that any name containing Incorporated is an organization; and that I.B.M. and Microsoft are organizations).” [Collins and Singer, 1999] Sproat et al.: NER and Transliteration for 50 Languages
Seed Rules: Thai • Something including and to the right of นาย is likely to be a personSomething including and to the right of นาง is likely to be a personSomething including and to the right of นางสาว is likely to be a personSomething including and to the right of น.ส. is likely to be a personSomething including and to the right of คุณ is likely to be a personSomething including and to the right of เด็กหญิง is likely to be a personSomething including and to the right of ด.ญ. is likely to be a person • Something including and to the right of พ.ต.อ. is likely to be a personSomething including and to the right of พล.ต.ต. is likely to be a personSomething including and to the right of พล.ต.ท. is likely to be a personSomething including and to the right of พล.ต.อ. is likely to be a personSomething including and to the right of ส.ส. is likely to be a person • ทักษิณ ชินวัตร is a personทักษิณ is likely a personชวน หลีกภัย is a personบรรหาร ศิลปอาชา is a person Sproat et al.: NER and Transliteration for 50 Languages
Seed Rules: Thai • Something including and in between บริษัท and จำกัด is likely to be an organizationSomething including and to the right of บจก. is likely to be an organizationSomething including and in between บริษัท and จำกัด (มหาชน) is likely to be an organizationSomething including and in between บจก. and (มหาชน) is likely to be an organizationSomething including and to the right of ห้างหุ้นส่วนจำกัด is likely to be an organizationSomething including and to the right of หจก. is likely to be an organization • สำนักนายกรัฐมนครี is an organizationวุฒิสภา is an organizationแพทยสภา is an organizationพรรคไทยรักไทย is an organizationพรรคประชาธิปัตย์ is an organizationพรรคชาติไทย is an organization • Something including and to the right of จังหวัด is likely to be a locationSomething including and to the right of จ. is likely to be a locationSomething including and to the right of อำเถอ is likely to be a locationSomething including and to the right of ตำบล is likely to be a location • กรุงเทพมหานคร is a locationเชียงใหม่ is a locationเชียงราย is a locationขอนแก่น is a location Sproat et al.: NER and Transliteration for 50 Languages
Seed Rules: Armenian • CityName = CapWord [ քաղաք | մայրաքաղաք ] StateName = CapWord նահանգ CountryName1 = CapWord երկիր • PersonName1 = TITLE? FirstName? LastName LastName = [Ա-Ֆ].*յան FirstName = [FirstName1 | FirstName2] FirstName1 = [Ա-Ֆ]\. FirstName2 = [Ա-Ֆ].* PersonNameForeign = TITLE FirstName? CapWord? CapWord PersonAny = PersonName1 | PersonNameForeign Sproat et al.: NER and Transliteration for 50 Languages
Armenian Lexicon Lexicon GEODESC արեւելյան արեւմտյան … Lexicon PLACEDESC պանդոկ պալատ … Lexicon ORGDESC միություն ժողով … Lexicon COMPDESC գործակալություն ընկերություն… Lexicon TITLE տիկին Տկն… Sproat et al.: NER and Transliteration for 50 Languages
Lexicon TITLEآقايدکترخانمجناببانومهندس Lexicon OrgDescاستانداريوزارتدولترژيمشهرداريانجمن Lexicon POSITIONرئيس جمهوررييس جمهوريپرزيدنتديپلمات Descriptors for named entitiesLexicon PerDescسابقآيندهLexicon CityDescشهرشهرکپايتختLexicon CountryDescکشور Seed Rules: Persian Sproat et al.: NER and Transliteration for 50 Languages
People Rules Something including and to the right of Bw. is likely to be a person. Something including and to the right of Bi. is likely to be a person. A capitalized word to the right of bwana, together with the word bwana, is likely to be a person. A capitalized word to the right of bibi, together with the word bibi, is likely to designate a person. Place Rules A capitalized word to the right of a word ending with -jini, is likely to be a place. A capitalized word starting with the letter U is likely to be a place. A word ending in ni is likely to be a place. A sequence of words including and following the capitalized word Uwanja is likely a place. Seed Rules: Swahili Sproat et al.: NER and Transliteration for 50 Languages
Named Entity Recognition • Identify entities of specific types in text (e.g. people, locations, dates, organizations, etc.) After receiving his M.B.A. from [ORG Harvard Business School], [PER Richard F. America] accepted a faculty position at the [ORG McDonough School of Business] in [LOC Washington]. Sproat et al.: NER and Transliteration for 50 Languages
Named Entity Recognition • Not an easy problem since entities: • Are inherently ambiguous (e.g. JFK can be both location and a person depending on the context) • Can appear in various forms (e.g. abbreviations) • Can be nested, etc. • Are too numerous and constantly evolving (cf. Baayen, H. 2000. Word Frequency Distributions. Kluwer. Dordrecht.) Sproat et al.: NER and Transliteration for 50 Languages
Named Entity Recognition Two tasks (sometimes, done simultaneously): • Identify the named entity phrase boundaries (segmentation) • May need to respect constraints: • Phrases do not overlap • Phrase order • Phrase length • Classify the phrases (classification) Sproat et al.: NER and Transliteration for 50 Languages
s1 s2 s3 s4 s5 s6 s1 s2 s3 s4 s5 s6 o2 o1 o3 o4 o5 o6 o1 o2 o3 o4 o5 o6 Identifying phrase properties with sequential constraints • View as inference with classifiers problem. Three models[Punyakanok & Roth NIPS’01] http://l2r.cs.uiuc.edu/~danr/Papers/iwclong.pdf • HMMs • HMM with classifiers • Conditional Models • Projection based Markov model • Constraint Satisfaction Models • Constraint satisfaction with classifiers • Other models proposed • CRF • StructurePerceptron • A model comparison in the context of the SRL problem [Punyakanok et al IJCAI’05] Most common Sproat et al.: NER and Transliteration for 50 Languages
Adaptation • Most approaches in NER are targeted toward specific setting: language, subject, set of tags, etc. • Labeled data may be hard to acquire for each particular setting • Trained classifiers tend to be brittle when moved even just to a related subject • We consider the problem of exploiting the hypothesis we learned in one setting to improve learning in another. • Kinds of adaptation that can be considered: • Across corpora with a domain • Across domains • Across annotation methodologies • Across languages Sproat et al.: NER and Transliteration for 50 Languages
Adaptation Example Starting with Reuters classifier is better than starting from scratch • Train on: • Reuters + increasing amounts of NYT • No Reuters, just increasing amounts of NYT • Test on: NYT • Performance on NYT increases quickly as classifier is trained on examples from NYT • Starting with existing classifier trained on related corpus is better than starting from scratch Trained on Reuters + 13% NYT; tested on NYT Trained on Reuters; tested on NYT Sproat et al.: NER and Transliteration for 50 Languages
Sentence Splitter Word Splitter FEX NER SNoW-based Network file Current Architecture - Training Annotated Corpus • Pre-process annotated corpus • Extract features • Train classifier Honorifics Features script Gazetteers Italics : setting specific : optional Sproat et al.: NER and Transliteration for 50 Languages
Sentence Splitter Word Splitter FEX NER SNoW-based Current Architecture - Tagging Corpus • Pre-process corpus • Extract features • Run NER Honorifics Features script Gazetteers Network file Annotated Corpus Sproat et al.: NER and Transliteration for 50 Languages
Document Classifier Sentence Splitter Honorifics Knowledge Engineering Components Word Splitter Features script Gazetteers Network file Annotated Corpus Extending Current Architecture to Multiple Settings Chinese newswire Corpus German biological English news • Choose setting • Pre-process, extract features and run NER FEX NER SNoW-based Sproat et al.: NER and Transliteration for 50 Languages
Extending Current Architecture to Multiple Settings: Issues For each setting, we need: • Honorifics and gazetteers • Tuned sentence and word splitters • Types of features • Tagged training corpus • Work is being done to move tags across parallel corpora (if available) Sproat et al.: NER and Transliteration for 50 Languages
Extending Current Architecture to Multiple Settings: Issues If parallel corpora are available and one is annotated, may be able to use Stochastic Inversion Transduction Grammars to move tags across corpora [Wu, Computational Linguistics ‘97] • Generate bilingual (annotated and unannotated parallel corpora) parses • Use ITGs as a filter to deem sentence/phrase pairs as parallel enough • For those that are, simply move the label from annotated to the unannotated phrase in same parse tree node. • Use the now tagged examples as training corpus Sproat et al.: NER and Transliteration for 50 Languages
Extending Current Architecture to Multiple Settings • Baseline experiments with Arabic, German, and Russian: • E.g. For Russian with no honorifics, gazetteers, features tuned for English, and imperfect sentence splitter we still get about 77% precision and 36% recall. NB: Used small hand-constructed corpus of approx. 15K wds, 1,300 NE (80/20 split) Sproat et al.: NER and Transliteration for 50 Languages
Summary • Seed rules and corpora for subset of 50 languages • Adapted NER system for English to other languages • Demonstrated adaptation of NER system to other settings • Experimenting with ITG as basis for annotation transplantation Sproat et al.: NER and Transliteration for 50 Languages
Methods of Transliteration Sproat et al.: NER and Transliteration for 50 Languages
Comparable Corpora 三号种子龚睿那今晚以两个11:1轻取丹麦选手蒂・ 拉斯姆森,张宁在上午以11:2和11:9淘汰了荷兰 的于・默伦迪克斯,周蜜在下午以11:4和11:1战 胜了中国香港选手凌婉婷。 In the day's other matches, second seed Zhou Mi overwhelmed Ling Wan Ting of Hong Kong, China 11-4, 11-4, Zhang Ning defeat Judith Meulendijks of Netherlands 11-2, 11-9 and third seed Gong Ruina took 21 minutes to eliminate Tine Rasmussen of Denmark 11-1, 11-1, enabling China to claim five quarterfinal places in the women's singles. 三号种子龚睿那今晚以两个11:1轻取丹麦选手蒂・ 拉斯姆森,张宁在上午以11:2和11:9淘汰了荷兰 的于・默伦迪克斯,周蜜在下午以11:4和11:1战 胜了中国香港选手凌婉婷。 In the day's other matches, second seed Zhou Mi overwhelmed Ling Wan Ting of Hong Kong, China 11-4, 11-4, Zhang Ning defeat Judith Meulendijks of Netherlands 11-2, 11-9 and third seed Gong Ruina took 21 minutes to eliminate Tine Rasmussen of Denmark 11-1, 11-1, enabling China to claim five quarterfinal places in the women's singles. Sproat et al.: NER and Transliteration for 50 Languages
Transliteration in Comparable Corpora • Take the newspapers for a day in any set of languages: a lot of them will have names in common. • Given a name in one language, find its transliteration in a similar text in another language. • How can we make use of: • Linguistic factors such as similar pronunciations • Distributional factors • Right now we used partly supervised methods (e.g. we assume small training dictionaries): • We are aiming for largely unsupervised methods (in particular, no training dictionary) Sproat et al.: NER and Transliteration for 50 Languages
Some Comparable Corpora • We have (from the LDC) comparable text corpora for: • English (19M words) • Chinese (22M characters) • Arabic (8M words) • Many more such corpora can, in principle, be collected from the web Sproat et al.: NER and Transliteration for 50 Languages
How Chinese Transliteration Works • About 500 characters tend to be used for foreign words • Attempt to mimic the pronunciation • But lots of alternative ways of doing it Sproat et al.: NER and Transliteration for 50 Languages
Transliteration Problem • Many applications of transliteration have been in machine translation [Knight&Graehl, 1998; Al-Onaizan&Knight, 2002; Gao, 2004]: • What’s the best translation of this Chinese name? • Our problem is slightly different: • Are these two names the same? • Want to be able to reject correspondences • Assign 0 probability to some unseen cases in training data Sproat et al.: NER and Transliteration for 50 Languages
Approaches to Transliteration • Much work using the source-channel approach: • Cast as a problem where you have a clean “source” – e.g. a Chinese name – and a “noisy channel” that “corrupts” the source into the observed form – e.g. an English name: • P(E|C)P(C) • E.g.: P(fi,E fi+1,E fi+2,E … fi+n,E |sC) Chinese characters represent syllables (s); we match these to sequences of English phonemes (f) Sproat et al.: NER and Transliteration for 50 Languages
Resources • Small dictionary of 721 (mostly English) names and their Chinese transliterations • Large dictionary of about 1.6 million names from LDC Sproat et al.: NER and Transliteration for 50 Languages
General Approach • Train a tight transliteration model from a dictionary of known transliterations • Identify names in English news text for a given day using an existing named entity recognizer • Process same day of Chinese text looking for sequences of characters used in foreign names • Do an all-pairs match using the transliteration model to find possible transliteration pairs Sproat et al.: NER and Transliteration for 50 Languages
Model Estimation • Seek to estimate P(e|c) where e is a sequence of words in Roman script and c is a sequence of Chinese characters • We actually estimate P(e’|c’), where e’ is the pronunciation of e and c’ is the pronunciation of c. • We decompose the estimate of P(e’|c’) as: • Chinese transliteration matches syllables to similar-sounding spans of foreign phones. So c’I are syllables, and e’I are subsequences of the English phone string Sproat et al.: NER and Transliteration for 50 Languages
Model Estimation • Align phone strings using modified Sankoff/Kruskal algorithm • For each Chinese s, allow an English phone string f to correspond just in case the initial of s corresponds to the initial of f some minimum number of times in training • Smooth probabilities using Good-Turing • Distribute unseen probability mass over unseen cases non-uniformly according to a weighting scheme Sproat et al.: NER and Transliteration for 50 Languages
Model Estimation • We estimate the probability for a given unseen case as follows: • Where: • P(n0) is the probability of unseen cases according to the Good-Turing smoothing • P(len(e)=m|len(c)=n) is the probability of a Chinese syllable of length n corresponding to an English phone sequence of length m • count(len(e)=m) is the type count of phone sequences of length m (estimated from 194,000 pronunciations produced by the Festival TTS system on the XTag dictionary) Sproat et al.: NER and Transliteration for 50 Languages
Some Automatically Found Pairs Pairs found in same day of newswire text Sproat et al.: NER and Transliteration for 50 Languages
Further Pairs Sproat et al.: NER and Transliteration for 50 Languages
Time Correlations • When some major event happens (e.g., the tsunami disaster), it is very likely covered by news articles in multiple languages • Each event/topic tends to have its own “associated vocabulary” (e.g., names such as Sri Lanka, India may occur in recent news articles) • We thus will likely see that the frequency of a name such as Sri Lanka will peak as compared with other time periods and the pattern is likely the same across languages • cf. [Kay and Roscheisen, CL, 1993; Kupiec, ACL, 1993; Rapp, ACL, 1995; Fung, WVLC, 1995] Sproat et al.: NER and Transliteration for 50 Languages
… … Documents Day 1 Day 2 Day 3 Day n Time line a term Term Frequency … … Normalized to obtain a distribution Construct Term Distributions over Time Sproat et al.: NER and Transliteration for 50 Languages
Pearson Correlation scores [-1, 1] Megawati-English Arafat-Chinese Megawati-English Megawati-Chinese Measure Correlations of English and Chinese Word Pairs bad correlation corr = 0.0324 good correlation corr = 0.885 Sproat et al.: NER and Transliteration for 50 Languages
Chinese Transliteration English termEdmonton Chinese documents Candidate Chinese names 埃德蒙顿 阿勒泰 埃丁顿 阿马纳 阿亚德 埃蒂纳罗 … … 埃德蒙顿 0.96 阿勒泰 0.91 埃丁顿 0.88 阿马纳 0.75 … … Rank Candidates • Methods: • Phonetic approach • Frequency correlation • Combination Sproat et al.: NER and Transliteration for 50 Languages
Method1 (Freq+PhoneticFilter) • Compute the correlation • ranking them by correlation scores Phonetic method • Method2 (Freq+PhoneticScore) • Linearly combine the correlation scores with Phonetic scores (half/half) Chinese candidate 埃德蒙顿 阿勒泰 埃丁顿 阿马纳 阿亚德 埃蒂纳罗 … … Evaluation English term Edmonton MRR: Mean Reciprocal Rank AllMRR: Evaluation over all English names CoreMRR: Evaluation over just names w/ found Chinese correspondence Sproat et al.: NER and Transliteration for 50 Languages
Summary and Future Work • So far: • Phonetic transliteration models • Time correlation between name distributions • Work in progress: • Linguistic models: • Develop graphical model approach to transliteration • Semantic aspects of transliteration in Chinese: female names ending in –ia transliterated with 娅 ya rather than 亚 • Resource-poor transliteration for any pair of languages • Document alignment • Coordinated mixture models for document/word-level alignment Sproat et al.: NER and Transliteration for 50 Languages
character counter End character character transition chinese phone transition chinese phone english phone Graphical Models [Bilmes & Zweig 2002] Sproat et al.: NER and Transliteration for 50 Languages
Semantic Aspects of Transliteration • Phonological model doesn’t capture semantic/orthographic features of transliteration: • Saint, San, Sao, … use 圣 sheng `holy’ • Female names ending in –ia transliterated with 娅 ya rather than 亚 ya • Such information boosts evidence that two strings are transliterations of each other • Consider gender. For each character c: • compute log-likelihood ratio abs(log(P(f|c)/P(m|c))) • build a decision list ranked by decreasing LLR Sproat et al.: NER and Transliteration for 50 Languages