Character Gazetteer for Named Entity Recognition with Linear Matching Complexity

Character Gazetteer for Named Entity Recognition with Linear Matching Complexity Dlugolinsky S., Nguyen G., Laclavik M., Seleng M. Institute of Informatics, Slovak Academy of Sciences giang.ui@savba.sk

Content • Context: Big Data, Natural Language Processing (NLP), Named Entity Recognition (NER) • Gazetteers • Tree structures: design and realizations • NER with linear matching complexity • Evaluations • Future work

Work context NER important task in order to gain the information Big Data produced daily in • Social media: Twitter, Google+, Facebook, Instagram, etc. • Wikipedia, Wikia, newspapers … • Other internal sources like transactions, logs, emails, … Knowledgeand Informationhiddenin (un|semi-)structured data • useful for • business or political sentiment analysis • public opinion assessment • emergency response, etc. • text, images, audio, video Text  NLP  Information

Natural Language Processing (NLP) • Incoming text comes continuously from websites, portals, social media, etc. • Need to recognize well-known NEs and theirs occurrences with references • NER is important task in order to gain information

Gazetteers • Basic, independent and very effective NER technique for NE identification in text • Processing approaches • Token-based: split input text into a sequence of tokens (words) • Character-based: processing input text character by character • NE recognitions • Machine learning techniques • Finite-state machines (FSM)

Related work Ontotext Hash Gazetteer • Based on hash tables • Authors: “3x faster and 4x less memory than FSM equivalent” • As a part of the GATE only Ontotext Stand-Alone Gazetteer • Stand-alone version of the Hash Gazetteer • No longer available Ontotext Large Knowledge Base Gazetteer • Support for ontology-aware NLP • As a part of the GATE only Other gazetteers implemented as a proprietary look-up piece of code or complex solutions

Our requirements Standalone • no 3rd party libraries needed • does not rely on external preprocessing; e.g. tokenization Linear complexity lookup algorithm • fast and effective processing of input text as a stream, especially for Big Data Editable data structure • add/remove NEs between lookups Memory efficient data structure • “learn” tens of millions of entities Robust • input texts of any size • any language

Gazetteer tree data structures for HMT and CST realizations

Named entity recognitionCharacter-based with Linear matching complexity

HMT and CST realizations • HMT: Hash Map Tree (multi-way tree) • implemented by Java HashMap, constant-time performance O(1) in average for basic operations (get and put) • (-) consumes a lot of memory • (+) very fast • CST: Child-Sibling Tree • pure and simple Java structure for nodes • (+) memory efficiency (only 25% vs. HMT) • (-) slower (cca. 10x vs. HMT for big data) • Deal with overlapping, prefix, postfix NE cases

Evaluation datasets • Gazetteer datasets: • Freebase organizations: 778,814 unique entities • Freebase locations: 1,256,552 unique entities • Freebase persons: 2,614,401 unique entities • Wikipedia titles and alternative names: 9,319,611 unique entities • Incoming data sets • 9,909 documents acquired from CoNLL-2003 datasets (Reuters’ text) with approximately 29MB of text

Memory consumptions

Rating characters per node

Matching time

Simple output example

Next steps • Improving the tree data structure in order to • Decrease memory requirements • More efficient for traversing and matching • Possible direction is collapsing nodes: • PHT - Patricia Hash Map Trie • Work completions • Integration to our projects and existing complex tools • Open source at http://ikt.ui.sav.sk/gazetteer

Thank you for attention Giang Nguyen giang.ui@savba.sk Cite: Stefan Dlugolinsky, Giang Nguyen, Michal Laclavik, Martin Seleng: "Character Gazetteer for Named Entity Recognition with Linear Matching Complexity", 3rd World Congress on Information and Communication Technologies, WICT'2013, pp. 364-368, IEEE Catalog Number: CFP1395H-ART, ISBN: 978-1-4799-3230-6

Character Gazetteer for Named Entity Recognition with Linear Matching Complexity

Character Gazetteer for Named Entity Recognition with Linear Matching Complexity

Presentation Transcript

Named Entity Recognition and Transliteration for 50 Languages

Named Entity Recognition

Exploiting Domain Structure for Named Entity Recognition

Named Entity Recognition

Cross-Domain Bootstrapping for Named Entity Recognition

CS544: Named Entity Recognition and Classification

Named Entity Recognition in Tweets: TwitterNLP

Biomedical Named Entity Recognition

Named Entity Recognition

Named-Entity Recognition with Character-Level Models

Named Entity Recognition

Pair Hidden Markov Model for Named Entity Matching

NAMED ENTITY RECOGNITION

Named Entity Recognition (NER) with NLTK

Named Entity Recognition

CS544: Named Entity Recognition and Classification

Myanmar Named Entity Recognition with Hidden Markov Model

How Does Named Entity Recognition Work?