Named Entity Recognition

1. Named Entity Recognition Beto Boullosa

2. Introduction Presentation Motivation Contents Information Extraction Named Entity Recognition (NER) An experiment with NER Conclusions

3. Information Extraction �Automatic identification of selected types of entities, relations or events in free text� (GRISHAM, 2003) Related areas Information Retrieval, Knowledge Extraction IE x IR

4. Information Extraction Applications �Processing of natural language texts for the extraction of relevant content pieces� (MART� AND CASTELL�N, 2000) Raw texts => structured databases Templates filling Improving search engines Auxiliary tool for other language applications

5. IE History Early projects Knowledge-based, rule-based FRUMP � 1979 Newswire LSP (Language String Project) � 1981 AMA � American Medical Association Patient summaries

6. IE History MUC � Message Understanding Conferences (1987) DARPA, NRAD Standardization Evaluation Dissemination DARPA�s TIPSTER Program: Document Detection, Summarization and Information Extraction � until 1998 TREC (Text Retrieval Conferences)

7. IE History MUC Evaluation standards (for the 1st time in MUC-2) Recall correct units total units Precision correct units units found F-Measure (�2+1) PR �2P + R

8. IE History MUC Template filling Mr. John Smith was appointed CEO of ACME last December 31.

9. IE History MUC-6 (1995) Extraction of Named Entities names of persons, organizations, locations temporal expressions, currency and percentages Extraction of Template Elements grouping of entity attributes together into entity �objects� Extraction of events (or Scenario Templates) Extraction of coreferences

10. IE History MUC-6 ENAMEX (�entity name expression�) tag people, organization and locations NUMEX (�numeric expression�) tag currency and percentages TIMEX (�time expression�) tag temporal expressions � dates and times

11. IE History MUC-6 Andrew Johnson was appointed last Sunday president of ACME, the biggest company in Santa Barbara, California, with an estimated $300 million market capacity. <ENAMEX TYPE=�PERSON�>Andrew Johnson</ENAMEX> was appointed <TIMEX TYPE=�DATE�>last Sunday</TIMEX> president of <ENAMEX TYPE=�ORGANIZATION�>ACME</ENAMEX>, the biggest company in <ENAMEX TYPE=�LOCATION�>Santa Barbara</ENAMEX>, <ENAMEX TYPE=�LOCATION�>California</ENAMEX> with an estimated <NUMEX TYPE=�MONEY�>$300 million</NUMEX> market capacity.

12. IE History MUC-7 (1998) Tasks Named Entities (NE task) Template Element (TE task) Scenario Template (ST task) Template Relation (TR task) Coreferences (CO task) System portability among domains

13. IE History Domains used in MUCs:

14. IE History Results in MUC-6:

15. IE History Other conferences MET (Multilingual Entity Task Evaluation) Japanese NEs IREX Japan, 1998 Organization, Person, Location, Artifact, Date, Time, Money and Percent

16. IE History Other conferences HUB-4 and ACE (Automatic Content Extraction) NIST National Institute of Standards and Technology Spoken and printed text CoNLL (Conference on Natural Language Learning) Since 1997 NEs in the 2002 and 2003 editions Multilingual of person (PER), location (LOC), organization (ORG) and other (O) classes

17. IE Techniques and tasks IE techniques: Document indexing + text understanding Document Indexing Tags texts with different descriptors, giving a kind of semantic representation for its contents Text Understanding Builds a knowledge representation of texts IE history: TU => DI More tractable perspective

18. IE Techniques and tasks FC Barcelona sold goalkeeper Vald�s to Espanyol last August 14

19. IE Techniques and tasks Compare with:

20. IE Techniques and tasks Events and relations extraction Knowledge-based techniques Regular expressions and patterns Knowledge-poor approaches Machine learning, statistics Coreferences Anaphora resolution Cross-document

21. IE Techniques and tasks Performance: Events and relations extraction x Named entities extraction Why?

22. Named Entity Recognition Recognition x Classification �Name Identification and Classification� NER as: as a tool or component of IE and IR as an input module for a robust shallow parsing engine Component technology for other areas Question Answering (QA) Summarization Automatic translation Document indexing Text data mining Genetics �

23. Named Entity Recognition NE Hierarchies Person Organization Location But also: Artifact Facility Geopolitical entity Vehicle Weapon Etc. SEKINE & NOBATA (2004) 150 types Domain-dependent

24. Named Entity Recognition Internal and external features (or evidences) Capitalization not all languages speech data trigger words �El senyor Balaguer vol comprar-se un cotxe nou.� �La ciutat de Balaguer �s tot un compendi de hist�ria de Catalunya�.

25. Named Entity Recognition Handcrafted systems Knowledge (rule) based Patterns Gazetteers Automatic systems Statistical Machine learning Unsupervised Analyze: char type, POS, lexical info, dictionaries Hybrid systems

26. Named Entity Recognition Handcrafted systems LTG F-measure of 93.39 in MUC-7 (the best) Ltquery, XML internal representation Tokenizer, POS-tagger, SGML transducer Nominator (1997) IBM Heavy heuristics Cross-document co-reference resolution Used later in IBM Intelligent Miner

27. Named Entity Recognition Handcrafted systems LaSIE (Large Scale Information Extraction) MUC-6 (LaSIE II in MUC-7) Univ. of Sheffield�s GATE architecture (General Architecture for Text Engineering ) JAPE language FACILE (1998) NEA language (Named Entity Analysis) Context-sensitive rules NetOwl (MUC-7) Commercial product C++ engine, extraction rules

28. NER � automatic approaches Learning of statistical models or symbolic rules Use of annotated text corpus Manually annotated Automatically annotated �BIO� tagging Tags: Begin, Inside, Outside an NE Probabilities: Simple: P(tag i | token i) With external evidence: P(tag i | token i-1, token i, token i+1) �OpenClose� tagging Two classifiers: one for the beginning, one for the end

29. NER � automatic approaches Decision trees Tree-oriented sequence of tests in every word Determine probabilities of having a BIO tag Use training corpus Viterbi, ID3, C4.5 algorithms Select most probable tag sequence SEKINE et al (1998) BALUJA et al (1999) F-measure: 90%

30. NER � automatic approaches HMM Markov models, Viterbi Separate statistical model for each NE category + model for words outside NEs Nymble (1997) / IdentiFinder (1999) Maximum Entropy (ME) Separate, independent probabilities for every evidence (external and internal features) are merged multiplicatively MENE (NYU - 1998) Capitalization, many lexical features, type of text F-Measure: 89%

31. NER � other approaches Hybrid systems Combination of techniques IBM�s Intelligent Miner: Nominator + DB/2 data mining WordNet hierarchies MAGNINI et al. (2002) Stacks of classifiers Adaboost algorithm Bootstrapping approaches Small set of seeds Memory-based ML, etc.

32. Named Entity Recognition Handcrafted systems x automatic systems Ease of change Portability (domains and languages) Scalability Language resources Cost-effectiveness

33. NER in various languages Arabic TAGARAB (1998) Pattern-matching engine + morphological analysis Lots of morphological info (no differences in ortographic case) Bulgarian OSENOVA & KOLKOVSKA (2002) Handcrafted cascaded regular NE grammar Pre-compiled lexicon and gazetteers Catalan CARRERAS et al. (2003b) and M�RQUEZ et al. (2003) Extract catalan NEs with spanish resources (F-measure 93%) Bootstrap using catalan texts

34. NER in various languages Chinese & Japanese Many works Special characteristics Character or word-based No capitalization CHINERS (2003) Sports domain Machine learning Shallow parsing technique ASAHARA & MATSMUTO (2003) Character-based method Support Vector Machine 87.2% F-measure in the IREX (outperformed most word-based systems)

35. NER in various languages Dutch DE MEULDER et al. (2002) Hybrid system Gazetteers, grammars of names Machine Learning Ripper algorithm French B�CHET et al. (2000) Decision trees Le Monde news corpus German Non-proper nouns also capitalized THIELEN (1995) Incremental statistical approach 65% of corrected disambiguated proper names

36. NER in various languages Greek KARKALETSIS et al. (1998) English � Greek GIE (Greek Information Extraction) project GATE platform Italian CUCCHIARELLI et al. (1998) Merge rule-based and statistical approaches Gazetteers Context-dependent heuristics ECRAN (Extraction of Content: Research at Near Market) GATE architecture Lack of linguistic resources: 20% of NEs undetected Korean CHUNG et al. (2003) Rule-based model, Hidden Markov Model, boosting approach over unannotated data

37. NER in various languages Portuguese SOLORIO & L�PEZ (2004, 2005) Adapted CARRERAS et al. (2002b) spanish NER Brazilian newspapers Serbo-croatian NENADIC & SPASIC (2000) Hand-written grammar rules Highly inflective language Lots of lexical and lemmatization pre-processing Dual alphabet (Cyrillic and Latin) Pre-processing stores the text in an independent format

38. NER in various languages Spanish CARRERAS et al. (2002b) Machine Learning, AdaBoost algorithm BIO and OpenClose approaches Swedish SweNam system (DALIANIS & ASTROM, 2001) Perl Machine Learning techniques and matching rules Turkish TUR et al (2000) Hidden Markov Model and Viterbi search Lexical, morphological and context clues

39. Named Entity Recognition Multilingual approaches Goals - CUCERZAN & YAROWSKI (1999) To handle basic language-specific evidences To learn from small NE lists (about 100 names) To process large and small texts To have a good class-scalability (to allow the definition of different classes of entities, according to the language or to the purpose) To learn incrementally, storing learned information for future use

40. Named Entity Recognition Multilingual approaches GALLIPI (1996) Machine Learning English, Spanish, Portuguese ECRAN (Extraction of Content: Research at Near Market) REFLEX project (2005) the US National Business Center

41. Named Entity Recognition Multilingual approaches POIBEAU (2003) Arabic, Chinese, English, French, German, Japanese, Finnish, Malagasy, Persian, Polish, Russian, Spanish and Swedish UNICODE Language independent architecture Rule-based, machine-learning Sharing of resources (dictionary, grammar rules�) for some languages BOAS II (2004) University of Maryland Baltimore County Web-based Pattern-matching No large corpora

42. NER � other topics Character x word-based JING et al. (2003) Hidden Markov Model classifier Character-based model better than word-based model NER translation Cross-language Information Retrieval (CLIR), Machine Translation (MT) and Question Answering (QA) NER in speech No punctuation, no capitalization KIM & WOODLAND (2000) Up to 88.58% F-measure NER in Web pages wrappers

43. NER: an experiment in Catalan General architecture Common API Segmentation module POS-tagger Disambiguator Grammar module Module for accessing the system dictionaries

44. NER: an experiment in Catalan General architecture Typographical error detection module Spelling error detection module Grammatical error detection module NER module

45. NER: an experiment in Catalan NER Module Dictionary Multi tokens WORD FORM#LEMMA:TAG:FREQUENCY|WORD FORM:FREQUENCY|WORD FORM:FREQUENCY � can#can:N5-FP:444|barbet:42|bar�a:23|Barcel�:4 Categories PERSON Names and surnames LOCATION Common indicators ORGANIZATION Common indicators UNKNOWN

46. NER: an experiment in Catalan NER Module Rules Locations Verb_viure a location� Exiliat novament, Maci� viu a B�lgica. �Verb_n�ixer a location� Joan neix a Barcelona Persons �Sr. person� El Sr. Companys va sortir. �El position de location, person� El alcalde de Barcelona, Joan Clos.

47. NER: an experiment in Catalan NER Module Rules Organizations �El position de organization� El president de Cases Rives. �Organization, verb_fundat el� El club Orfeas Smyrna, fundat el 1890 per j�nics que residien a la ciutat turca. Combinations For persons, organizations and locations

48. NER: an experiment in Catalan NER Module Error detection and suggestion Pre-defined spelling rules Inserting try characters before every letter of the word Swapping characters one by one Inserting try characters in their places The NER correction as input for the Grammar module

49. NER: an experiment in Catalan Results 20 catalan texts Wikipedia, El Peri�dic 10000 words Various domains Precision : 70% Recall: 75% F-Measure: 72% Error correction and suggestions

50. Conclusions Needs better tuning Rules Dictionary can#P0 can benet#Can Benet:N4BMS:9#can:N4BMS:9#benet:N4BMS:9:10000000#P1 can benet de#P0 can benet de la#P0 can benet de la prua#Can Benet de la Prua#can:N4BMS#benet:N4BMS#de:P#el:EA--FS#prua:N4BFS#P1 Test statistical based-engine? Treatment of gender, number Expand to full IE system

Named Entity Recognition

Named Entity Recognition

Presentation Transcript

Named Entity Recognition

Exploiting Domain Structure for Named Entity Recognition

Cross-Domain Bootstrapping for Named Entity Recognition

CS544: Named Entity Recognition and Classification

Named Entity Recognition in Tweets: TwitterNLP

Biomedical Named Entity Recognition

Named Entity Recognition

Chinese Named Entity Recognition using Lexicalized HMMs

Named-Entity Recognition with Character-Level Models

Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

Named Entity Recognition

Named Entity Recognition based on Bilingual Co-training

NAMED ENTITY RECOGNITION

Named Entity Recognition (NER) with NLTK

Named Entity Recognition

CS544: Named Entity Recognition and Classification

How Does Named Entity Recognition Work?