1 / 60

Named Entity Recognition

Named Entity Recognition. Sobha Lalitha Devi AU-KBC Research Centre Chennai. Named Entity(NE) Recognition. What is NE and What is not an NE How to identify NE Tagset and Annotation Guidelines Methods Used in developing NER. Why do NER?. Key part of Information Extraction system

gurit
Télécharger la présentation

Named Entity Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Named Entity Recognition SobhaLalitha Devi AU-KBC Research Centre Chennai

  2. Named Entity(NE) Recognition • What is NE and What is not an NE • How to identify NE • Tagset and Annotation Guidelines • Methods Used in developing NER IIIT Summer School

  3. Why do NER? • Key part of Information Extraction system • Robust handling of proper names essential for many applications such as Summarization, IR, Anaphora,......... • Pre-processing for different classification levels • Information filtering • Information linking IIIT Summer School

  4. What is NER ? • NER involves identification of proper names in texts, andclassification into a set of predefined categories of interest. • Three universally accepted categories: • Person, location and organisation • Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), email addresses etc. • Other domain-specific entities: names of Drugs, Genes, medical conditions, names of ships, bibliographic references etc. IIIT Summer School

  5. NER Definition • Named entity recognition (NER) (also known as entity identification (EI) and entity extraction) is the task that locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. John sold 5 companies in 2002. <ENAMEX TYPE="PERSON">John</ENAMEX> sold <NUMEX TYPE="QUANTITY">5</NUMEX> companies in <TIMEX TYPE="DATE">2002</TIMEX>. IIIT Summer School

  6. What is not NER? • NER is not event recognition. • NER does not create templates, • NER does not perform co-reference or entity linking, • though these processes are often implemented alongside NER as part of a larger IE system. • NER is not just matching text strings with pre-defined lists of names. It recognises entities which are being used as entities in a given context. • NER is not an easy task! IIIT Summer School

  7. Named Entity and Philosophy of Language • Proper Names are defined by • Descriptivist's theory of Names • Frege, Russell, Ludwig , Wittgenstein and John Searle • Causal theory of Reference • Saul Kripke IIIT Summer School

  8. Descriptivist's theory of Names Proper names either are synonymous with descriptions, or have their reference determined by virtue of the name's being associated with a description or cluster of descriptions that an object uniquely satisfies. Causal theory of Reference Proper names refer to an object by virtue of a causal connection with the object as mediated through communities of speakers. That is , proper names, in contrast to descriptions, are rigid designators. Rigid designators :A proper name refers to the named object in every possible world in which the object exists. Descriptions designate : a proper name as different objects in different possible worlds. IIIT Summer School

  9. Proper Names and Definite Descriptions • A meaning of a Sentences involving Proper names could be substituted by a contextually appropriate description for a name. eg: Otto von Bismarck can be known or described as the first Chancellor of the German Empire Kripke argues that definite descriptions cannot be rigid designators . Because definite descriptions cannot be same/similar in all possible worlds More on Kripke’s Proper name in Naming and Necessity 1980 IIIT Summer School

  10. What is Named Entity • Named Entities are • A Noun Phrase • Rigid Designators : It designates/denotes the same thing in all possible worlds in which the same thing exists and does not designate anything else in those possible worlds in which that same thing does not exist IIIT Summer School

  11. EXAMPLES for Named Entity and not a Named entity • Hotel & Taj Hotel • Flower & Rose Flower • Beach & Kovalam Beach • Airport & Indira Gandhi International airport • The School & Good Shepherd School • Prime Minister & Mr. Manmohan Singh IIIT Summer School

  12. Some problems in indentifying NE • Variation of NEs. • Manmohan Singh, Manmohan, Dr. Manmohan Singh • Ambiguity of NE types: • 1945 (date vs. time) • Washington (location vs. person) • May (person vs. month) • Tata (person vs. organization) IIIT Summer School

  13. Ambiguity Examples • Person vs Location • Sir C. P Ramaswamy was the Divan of Travancore (Per) • Sir C.P Ramaswamy Road is in Chennai (Loc) • Person vs Organization • Anil Ambani opened Reliance Fresh (Per) • Reliance Fresh is under Anil Amabani Group Ltd (Org) IIIT Summer School

  14. More complex problems in NER Issues of style, structure, domain, genre etc. • Punctuation, spelling, spacing, formatting, ….all have an impact Dept. of Computing and Information Science Manchester Metropolitan University Manchester United Kingdom > Tell me more about Leonardo > Da Vinci IIIT Summer School

  15. Problems in NE Task Definition • Category definitions are intuitively quite clear, but there are many grey areas. • Many of these grey area are caused by metonymy. Person vs. Artefact Organisation vs. Location Company vs. Artefact Location vs. Organisation IIIT Summer School

  16. Tagset for Named Entity • ACE tagset is Hierarchical • ACE-Automatic Content Extraction • The tagset • CLIA-is Hierarchical -Similar to ACE • Developed for two domains • Tourism and Health IIIT Summer School

  17. TAGSET ENAMEX Person Individual Family name Title Group Organization Government Public/private company Religious Non-government Political Party Para military Charitable Association GPE (Geo-political Social Entity) Media Location Place District City State Nation Continent Address Water-bodies Landscapes Celestial Bodies Manmade Religious Places Roads/Highways Museum Theme parks/Parks/Gardens Monuments Facilities Hospitals Institutes Library Hotel/Restaurants/Lodges Plant/Factories Police Station/Fire Services Public Comfort Stations Airports Ports Bus-Stations Locomotives Artifacts Implements Ammunition Paintings Sculptures Cloths Gems & Stones Entertainment Dance Music Drama/Cinema Sports Events/Exhibitions/Conferences Cuisine’s Animals Plants IIIT Summer School

  18. Tagset Continued • NUMEX • Distance • Money • Quantity • Count • TIMEX • Time • Date • Day • Period Tagset Counts First Level Tags -3 Second Level -43 Third Level – 40 Total - 86 IIIT Summer School

  19. How to Annotate • 1.ENAMEX • 1.1 Person • 1.1.1 Individual • These refer to names of each individual person, also includes names of fictional characters found in stories/novels etc. Tag Structure: <ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”> abc </ENAMEX> Examples: English: <ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”>Abdul Kalam</ENAMEX> IIIT Summer School

  20. Annotation continued 1.1.1.1 Family name In general we find that a person name consists of a family name. Whenever an instance of individual name occurs with family name, then that part of the name, which refers to family name, must be tagged specifically with subtag “FAMILYNAME” as shown below. Tag Structure: <ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL” SUBTYPE_2= “FAMILYNAME”> abc </ENAMEX> Examples: English: <ENAMEX TYPE=”PERSON” SUBTYPE_1=”INDIVIDUAL”> Lalu Prasad<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL” SUBTYPE_2= “FAMILYNAME”>Yadav</ENAMEX></ENAMEX> IIIT Summer School

  21. ENAMEX NE TYPES NUMEX TIMEX NE Types The Named entity hierarchy is divided into three major classes Entity Name, Time and Numerical expressions. IIIT Summer School

  22. Entity Types IIIT Summer School

  23. Entity Name Types • Persons are entities limited to humans. A person may be a single individual or a group. Individual refer to names of each individual person. Group refers to set of individual • Location entities are limited to geographical entities such as geographical areas like names of countries, cities, continents and landmasses, bodies of water, and geological formations. • Organization entities are limited to corporations, agencies, and other groups of people defined by an established organizational structure IIIT Summer School

  24. Examples for Entity Name Types • En: [Sita]PERSONis working at [HCL]ORGANIZATION , which is in [Chennai]LOCATION • Ta: [Seetha] PERSON [chennaiyilrukkira] LOCATION [HCLlil]ORGANIZATION En: Sita Chennai HCL velaiseikirAl. Working • Ml: [Seetha] PERSON [chennaiyillula] LOCATION [HCLlil] ORGANIZATION En: Sita Chennai HCL jolicheyyunnu. Working • Hi: [Seetha]PERSON [HCL] ORGANIZATION main kaamkarrahahai, jo En: Sita HCL work is which [chennai] LOCATION main hain. Chennai in IIIT Summer School

  25. Entity Name Types Facilitiesare limited to buildings and other permanent man-made structures and real estate improvements like hospitals, airport, colleges, libraries etc. En: [Appolo Hospital]FACILITY is in ChennaiLOCATION Ta: [AppallomaruthuvamanAi]FACILITY [Chennaiyil]LOCATIONirukkirathu Ml: [AppoloAsupathri]FACILITY [chennaiyil]LOCATIONaaN Hi: [Appoloaspathaal]FACILITY [chennai]LOCATIONmeinhaim. IIIT Summer School

  26. Entity Name Types A locomotive entity is a physical device primarily designed to move an object from one location to another, by carrying, pulling, or pushing the transported object. En: [Ananthapuri Express]LOCOMOTIVE departs from [Chennai]LOCATION at [7.30pm]Time. Hi: [Ananthapuri express] LOCOMOTIVE [Chennai] LOCATION se [rAth 7.30] TIMEkoravanahoga Ml: [Ananthapurieksprass] LOCOMOTIVE [chennaiyilninn] LOCATION [raathri 7.30 maNikk] TIMEpuRappetum. Ta: [Ananthapuriviraivurayil]LOCOMOTIVE [chennaiyilirunthu]LOCATION [iRavu 7.30 maNikku]TIMEpuRappatukirathu IIIT Summer School

  27. Entity Name Types Artifact entities are objects or things, produced or shaped by human craft, such as tools, weapons/ammunition, art paintings, clothes, ornaments, medicines En: [Vinayaga Statue]ARTIFACT is looking beautiful Ta: [VinayakarinSilai] ARTIFACTpArpatharkkualakAkAkairukkirathu Ml: [ganapathivigraham]ARTIFACTbaMgiyaayiirikkunnu. Hi: [Vinayakamoorthi] ARTIFACTachilaghrahihaim. IIIT Summer School

  28. Entity Name Types Entertainment entities denote activities, which are diverting and hold human attention or interest, giving pleasure, happiness, amusement especially performance of some kind such as dance, music, sports, events. En: [Flower Exhibition]ENTERTAINMENT is held at [Hyderabad]LOCATION Ta: [Malar kankAtchi] ENTERTAINMENT [hyderabaadil]LOCATIONNadaiperukirathu Ml: [pushpapradarshanam]ENTERTAINMENT [hyderabaadil]LOCATIONnatakkunnu Hi: [phoolpradarshnii]ENTERTAINMENT [hyderabad]LOCATIONmeNAyojithkiyaajAthAhai IIIT Summer School

  29. Entity Name Types Materials refer to the names of food items, cuisines, chemicals and cosmetics En: [Honey]MATERIALS is good for face Ta: [ThEn]MATERIALSmukaththiRkunallathu Ml: [Madhu] MATERIALSmukaththinunallathAN Hi: [Shahad] MATERIALScheharekeliyeachchahai. IIIT Summer School

  30. Entity Name Types ORGANISMS: These are the names of different animal species including birds, reptiles, viruses, bacteria and names of herbs, medicinal plants, shrubs, trees, fruits, flowers etc. En: [Peacock]ORGANISM is the national bird of [India] LOCATION Ta: [Mayil]ORGANISM [InthiyAvin]LOCATIONthEciyappaRavaiAkum. Ml: [Mayil]ORGANISM [indyayute]LOCATIONraashtrapakshi AN. Hi: [Mor] ORGANISM[bhaarath]LOCATIONkaaraashtrIyapakshihai. IIIT Summer School

  31. Entity Name Types Disease: Names of disease, symptoms, diagonisis and treatment are comes under this type. En: Smoking Causes [Cancer]DISEASE Ta: PukaippithithalAl [puRRuNoi] DISEASEvarukiRathu Ml : pukavali [aRbhudham] DISEASEuNtAkkunnu Hi: dhumrapan [kaansar] DISEASEkakaaraNbanaathahai. IIIT Summer School

  32. DISTANCE QUANTITY NUMEX MONEY COUNT Numerical Expressions IIIT Summer School

  33. Numerical Expressions • Distance refers to the distance measures such as kilometers, Centimeters, meters, acres, feet etc. Example: 10 cm., twenty feet, 15 hectares • Money specifies the different currency value such as rupee, euro, Dinar, dollar etc. Example: Rs. 1000, 250 Euro, $160 • Count denotes the number (or counts) of Items/ articles/things etc. Example: 5 subjects, 12 students, 20 books • Quantity measurements like liters, tons, grams, volts etc. are comes under this category. Example: 20 litres, 22 kg, 50g, 100 volts IIIT Summer School

  34. TIMEX DATE YEAR MONTH TIME SPECIAL DAY DAY PERIOD Time Expressions IIIT Summer School

  35. Temporal Expressions • Temporal expressions are the entities refers to time, date, year, month and day • Time: These refer to expressions of time, includes different forms • of expressing time. This also includes Hours, minutes and seconds. • Example • 5’o clock in the morning • 9.30 a.m. • Evening 6.30 p.m. • Date: This refers to expressions of Date such as 13/12/2001 etc in • different forms. This also includes month, date and year • Example • August 15 1947 • 1956 • September 11 IIIT Summer School

  36. Temporal Expressions Day: These are expressions, which convey days in a year. Also it can include days occurring weekly /fortnightly/ monthly /quarterly/ biennial etc. Example • Sunday • Tomorrow • Today • Yesterday Special Day: refers to special days in a year Example • Gandhi Jayanthi • Rama Navami IIIT Summer School

  37. Temporal Expressions Period: refers to expressions, which express duration of time or time periods or time intervals. Example • 17 th century • 10 minutes • 10 a.m. to 12 p.m. • One year IIIT Summer School

  38. Methodologies • Methods: • Rule Based • Machine Learning Hidden Markov Model (HMM) Naïve Bayes Classifier Maximum Entropy Markov Model (MEMM) Conditional random Fields (CRF) 4) Hybrid Approach IIIT Summer School

  39. Challenges of NER in Indian Languages Following are the major challenges encountering in Indian Languages. • Agglutination • Ambiguity • Between Proper and common nouns • Between named entities • Lack of Capitalization IIIT Summer School

  40. Challenges of NER in Indian Languages Agglutination In Dravidian languages, words consist of a lexical root to which one or more affixes are attached. Example in Tamil: 1) Ta: Ramanaiththavira (otherthan Raman) 2) Ta: Cevvaiyandru (On Tuesday) 3) Ta: Inthiyavilllula (In India) 4) Ta: KannanaippaRRikkondu (hold onto Kannan) IIIT Summer School

  41. Challenges of NER in Indian Languages Example in Malayalam: 1) Ml: hemayiluNtaayirunna (that which Hema have) 2) Ml: Chennaiyilethunna (reach in Chennai) 3) Ml: arabikatalinaBimukhamaayi (towards the arabian sea) 4) Ml: kaaSiyilekkozhukunna ( flowing towards kaaSi) IIIT Summer School

  42. Challenges of NER in Indian Languages • Ambiguity • Comparatively Indian languages suffer more due to the ambiguity that exists between common & proper nouns and between named entities itself. In some cases same word can refer to different named entity types. Those instances can recognized by contextual information. • Examples: • Hi: Akash - Person name and Sky • Hi: Sooraj - Person name and Sun • Hi: Chaanth – Moon and Silver • Hi: Aam – Mango and Common • Ml: Roopa – Person name and Rupee • Ml: Madhu – Person name and Honey • Ml: Mala – Person name and Garland IIIT Summer School

  43. Challenges of NER in Indian Languages • Ta: Thinkal - Day and Month • Ta: Malar - Person name and Flower • Ta: Chevvai - Day and planet • Ta: Shakthi – Person name and Power • Ta: MAlai – Evening and Garland • Ta & Ml: Velli – Silver, Planet, Day IIIT Summer School

  44. Challenges of NER in Indian Languages Spell Variation: Due to the different writing styles same entity is represented in various word forms. In Tamil, sanskirit letters such as “ja”, “sha”, “sri” “Ha” are replaced by “sa”,“ciri”, “ka” Example: Roja can be written as Rosa Srimathi - cirimathi Raja - rasa ShajahAn - sajakAn IIIT Summer School

  45. Challenges of NER in Indian Languages Lack of Capitalization • In English and some other European languages capitalization is considered as the important feature to identify proper noun. • It plays a major role in NE identification. • Unlike English capitalization concept is not found in Indian languages. IIIT Summer School

  46. Nested Entities Nested Entities: Refers to the named entities which occurs within another named entities. Also called as embedded entities. Ta: [[Mathurai] LOCATION [MeenAtchi Amman]PERSONKoyil]RELPLACE En: MathuraiMeenatchi Amman Temple Ml: [[Nittoor] PERSONSrinivasarao] PERSON En : NitoorSrinivasarao Hi: [[Rajeev] PERSONMArg] ROAD En : Rajeev Road IIIT Summer School

  47. Approaches in Named Entity Resolution • Dictionary Look-up • Rule based ( Using lexical, contextual and morphological information) • Maximum entropy theory based • Hidden Markov Model • Conditional Random Fields • Hybrid methods (Statistical+ Linguistics) IIIT Summer School

  48. Dictionary (Gazetteers) Look-up Approach • Uses Dictionaries for identifying NERs ( Gazetteers) • Gazetteer contains NEs from all domains • Advantage • Very simple approach • Gives very high precision IIIT Summer School

  49. Disadvantages of Dictionary Approach • Preparation of exhaustive dictionary is a tedious and expensive process. • The dictionary should cover the different spellings of the same place. IIIT Summer School

  50. Rule Based Approach • Rule Based System • Needs more rules to tag all kinds of NE • Advantages: • Rich and expressive rules • Good results • Disadvantages: • Requires huge experience and grammatical knowledge • Experts to craft rules are expensive • Highly domain specific ( not portable to a new domain) IIIT Summer School

More Related