1 / 23

Named Entity Recognition based on three different machine learning techniques

Named Entity Recognition based on three different machine learning techniques. Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005. Research Group on Language Processing and Information Systems. g PLSI. Outline. Named Entity Recognition task definition applications

lysa
Télécharger la présentation

Named Entity Recognition based on three different machine learning techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005 Research Group on Language Processing and Information Systems g PLSI

  2. Outline • Named Entity Recognition • task definition • applications • Machine learning approach • Classifier combination • Feature description and experimental evaluation • for NE detection • for NE classification • NERUA at GeoCLEF • Conclusions and future work

  3. Identification of proper names in text, using BIO scheme B starts an entity I continues the entity O words outside entity Classification into a predefined set of categories Person names Organizations (companies, governmental organizations, etc) Locations (cities, countries, etc) Miscellaneous (movie titles, sport events, etc) Named Entity Recognition – task definition Adam_B-PER Smith_I-PER works_O for_O IBM_B-ORG ,_O London_B-LOC ._O

  4. Named Entity Recognition – applications • Information Extraction • Question Answering • Document classification • Automatic indexing of books • Increase accuracy of Internet search results (location Clinton/South Carolina vs. PresidentClinton)

  5. Outline • Named Entity Recognition • task definition • applications • Machine learning approach • Classifier combination • Feature description and experimental evaluation • for NE detection • for NE classification • NERUA at GeoCLEF • Conclusions and future work

  6. Machine learning approach • Given: • NER task • tagged corpus • Select classification methods • Memory-based learning • Maximum Entropy • Hidden Markov Models • Construct set of characteristics • detection phase • classification phase

  7. HMM Text Detection Voting TiMBL Classification HMM Voting MXE TiMBL NERText NERUA:sistema de detección y clasificación de entidades utilizando aprendizaje automático, Ferrández et al.

  8. Classification method 1 • Memory-based learning (k-nearest neighbours) • toolkit • TiMBL package • time performance • quick training phase • slow during testing • features • various types of features • irrelevant features impede performance

  9. Classification method 2 • Maximum Entropy • toolkit • MaxEnt • time performance • slow training phase • slow testing phase • feature management • string, missing values

  10. Classification method 3 • Hidden Markov Models • toolkit • ICOPOST • time performance • quick training phase • quick testing phase • feature management • cannot handle as many features as the other two methods • need corpus or label transformation

  11. Outline • Named Entity Recognition • task definition • applications • Machine learning approach • Classifier combination • Feature description and experimental evaluation • for NE detection • for NE classification • NERUA at GeoCLEF • Conclusions and future work

  12. Classifier combination • Majority voting • give each classifier one vote

  13. Outline • Named Entity Recognition • task definition • applications • Machine learning approach • Classifier combination • Feature description and experimental evaluation • for NE detection • for NE classification • NERUA at GeoCLEF • Conclusions and future work

  14. Features for NE detection • Contextual • anchor word (e.g. the word to be classified); • words in a [-3,…,+3] window ; • Orthographic • capitalization at position 0,[-3,..,+3]; • whole anchor word in capitals (ex. IBM) • position of anchor word in a sentence • Substring extraction • 2 and 3 letter extraction from left and right side of the anchor word • Gazetteer list • word at position 0,+1,+2,+3 seen in the list • Trigger word list • word at position 0,[-3,..,+3] seen in the list Using Language Resource Independent Detection for Spanish NER, Kozareva et al., RANLP’05

  15. Results for NE detection

  16. Index • Named Entity Recognition • task definition • applications • Machine learning approach • Classifier combination • Feature description and experimental evaluation • for NE detection • for NE classification • NERUA at GeoCLEF • Conclusions and future work

  17. Features for NE classification • Contextual • whole entity • first word of the entity • second word of the entity if present • words around the entity in [-3,…,+3] window • Orthographic • position of anchor word in a sentence • capital, lowercase or other symbol • Gazetteer list • part of entity in the list • whole entity in the list • whole entity is not in any of these lists • Trigger lists • anchor word • words in [-1,+1] window

  18. Results for NE classification F-score for Spanish classification

  19. Outline • Named Entity Recognition – task definition, applications • Machine learning approach • Classifier combination • Feature description and experimental evaluation • for NE detection • for NE classification • NERUA at GeoCLEF • Conclusions and future work

  20. NERUA at GeoCLEF • English used directly the feature sets constructed for Spanish • NERUA outperformed the rule-based system Dramneri although both consulted the same gazetteer and trigger word lists • NERUA took more processing time University of Alicante at GeoCLEF 2005, Ferrández et al., CLEF’05

  21. Conclusions and future work • We found a language resource independent feature set for NE detection • 92.96% of Spanish entities • 78.86% of Portuguese entities • Classifier combination has improved NE classification • Good coverage over PER, LOC and ORG classes is maintained • Machine learning systems may outperform rule-based systems, however they need more processing time and hand-labeled resources which are not available for all languages

  22. Future work • Find discriminative features for MISC class • Resolve NER leaning upon unlabeled data • Divide the four categories into more detailed ones • Adapt the system for other languages • Study ways of automatic gazetteer construction

  23. Thank you for the attention!¿Questions? Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005 Research Group on Language Processing and Information Systems g PLSI

More Related