Japanese Named Entity Extraction with Redundant Morphological Analysis

Japanese Named Entity Extraction with Redundant Morphological Analysis Masayuki Asahara and Yuji Matsumoto Graduate School of Information Science Nara Institute of Science and Technology, Japan NAACL + HLT 2003

Abstract • Named Entity Extraction is important subtask as Information Extraction and Question Answering. • Typical method of Japanese texts: • Japanese text is segmented into words and is annotated with POS tags by a morphological analyzer. • Chunking • Some cases segmentation granularity contradict the result of morphological analysis. Extraction of these NEs are inherently impossible. • We propose “Character-based chunking method”: • Input sentence: Statistical morphological analysis (produce multiple n-best answer) • Annotated every character • Type • Pos tag (top n-best answer) • SVM-based chunker, extract NE from input sentence. • Apply to IREX NE extraction task, F-measure being 87.2

Introduction (1/2) • Example: (typical method can’t solve) “小泉純一郎首相が９月に訪朝” is segmented as “小泉／純一郎／首相／が／９月／に／訪朝” “朝” (abbreviation of North Korea) cannot be extracted as a name of location because it is contained by the word unit “訪朝”. • Some previous works try to cope with the word unit problem. • Uchimoto et al., 2000: transformation rules to modify the word units given by a morphological analyzer. • Isozaki and Kazawa, 2002: controls the parameters of a statistical morphological analyzer so as to produce more fine-grained output. • These method are used as a preprocessing of chunking. • By contrast, we propose more straightforward method in which we perform the chunking process based on character units.

Introduction (2/2) • Each character receives annotations • character type • multiple POS information found by a morphological analyzer. • Use redundant outputs of the morphological analysis as the base features for the chunker to introduce more information-rich features. • Use a SVM-base chunker yamcha (Kudo and Matsumato, 2001) for chunking process. • Our method achieves better score than all systems reported previously for IREX NE extraction task.

IREX NE extraction task (1/4)

IREX NE extraction task (2/4) • The chunking problem is solved by annotation of chunk tags to tokens. • 5 chunk tag sets: IOB1, IOB2, IOE1, IOE2 (Ramshaw and Marcus, 1995) SE (Uchimoto et al., 2000)

IREX NE extraction task (3/4) • In IOB1, IOB2 models: • I: inside of a chunk. • O: outside of a chunk. • B: begin of a chunk. • IOB1: B is used only at the beginning of a chunk that immediately follows another chunk. • IOB2: B is always used at the beginning of a chunk. • In IOE1, IOE2 models: • Use E tag instead of B, E: end point of a chunk.

IREX NE extraction task (4/4) • In SE model: • Generally, the tagging unit is a chunk. By contrast, we take characters as the units and annotate a tag on each character.

Method • Step 1: A statistical morphological/POS analyzer is applied to the input sentence and produces POS tags of the n-best answers. • Step 2: Each character in the sentences is annotated with the character type and multiple POS tag information according to the n-best answers. • Step 3: Using annotated features, NEs are extracted by an SVM-based chunker.

Japanese Morphological Analysis (1/2) • Our morphological/POS analysis is based on Markov model. • Morphological/POS analysis can be define as the determination of POS tag sequence T once a segmentation into a word sequence W is given. • To find the POS and word sequences T and W that maximize the following probability: (Bayes’ Rule)

Japanese Morphological Analysis (2/2) • Using Maximum Likelihood Estimation. • Using Viterbi algorithm. • Using log likelihood as cost. Maximizing probabilities means minimizing costs. • Redundant analysis output means the top n-best answers within a certain cost width.

Feature Extraction for Chunking (1/2) • To encode relative positions of characters within a word, we employ SE tag model. • A character is tagged with a pair of: • POS tag • Position tag within a word • Example: the character at the initial position of a common noun, “Noun-General-B”. • Character types are also used for features.

Feature Extraction for Chunking (2/2)

Support Vector Machine-based Chunking (1/3) • Chunker yamcha (Kudo and Matsumoto, 2001), which base on SVM (Vapnik, 1998). • Training data for a binary class problem: is a feature vector of the i-th sample in the training data and is the label of the sample. The goal is to find a decision function which accurately predicts y for an unseen x.

Support Vector Machine-based Chunking (2/3) for an input vector x where f(x) = +1 means that x is a positive member, f(x) = -1 means that x is a negative member. The zi are called support vectors. K(x,z) is a kernel function, we use the polynomial kernel of degree 2 given by K(x,z) = (1+x*z)^2

Support Vector Machine-based Chunking (3/3) • To facilitate chunking tasks by SVMs, we have to extend binary classifiers to n-class classifiers. 2 well-known methods: • One-vs-Rest method • We prepare n binary classifiers, one between a class and the rest of the classes. • Pairwise method • We prepare nC2 binary classifiers between all pairs of classes.

The effect of n-best answer • Solve longer words hide short words. (Ambiguity) • Solve unknown word.

Evaluation • Data: CRL NE data (IREX, 1999) • 1,174 newspaper articles • 19,262 NEs • The length of contextual feature • The size of redundant morphological analysis • Feature selection • The degree of polynomial Kernel functions. • Chunk tag scheme we use IOB2 since it gave the best result in a pilot study. • F-Measure (β= 1)

The length of contextual feature • Chang length of contextual features and the direction of chunking.

The depth of redundant morphological analysis

Feature selection

The degree of polynomial Kernel functions

The effect of thesaurus • Thesaurus NTT Goi Taikei (Ikehara et al., 1999)

Discussion

Conclusions • F-measure 87.21 (Best) on CRL NE data. • Character level information with redundant outputs of a statistical morphological analyzer in an SVM-based chunker. • The method is applicable to any other languages that have word unit problem in NE extraction.

Japanese Named Entity Extraction with Redundant Morphological Analysis

Japanese Named Entity Extraction with Redundant Morphological Analysis

Presentation Transcript

Named Entity Recognition

Named Entity Classification

Named Entity Recognition

Information Extraction Lecture 3 – Rule-based Named Entity Recognition

Information Extraction Lecture 5 – Named Entity Recognition III

Unsupervised Named-Entity Extraction from the Web: An Experimental Study

Biomedical Named Entity Recognition

Augmenting Wikipedia with Named Entity Tags

Information Extraction Lecture 4 – Named Entity Recognition II

Extended Named Entity Ontology with Attribute Information

Named Entity Recognition

Named Entity Recognition

Named Entity Tagging

Information Extraction Lecture 5 – Named Entity Recognition III

Information Extraction Lecture 4 – Named Entity Recognition II

NAMED ENTITY RECOGNITION

Named Entity Extraction

Named Entity Recognition (NER) with NLTK

Named Entity Recognition

Named Entity Tagging