Outline

Chinese Information ExtractionTianfang YaoDepartment of Computer Science and EngineeringShanghai Jiao Tong University1954 Hua Shan RoadShanghai, 200030China Chinese Information Extraction by Tianfang Yao

Outline Introduction Word Segmentation Named Entity Extraction Entity Relation Extraction Conclusion Chinese Information Extraction by Tianfang Yao

Introduction (1) Chinese Language Difficulties in Chinese NLP State-of-the-Art for Chinese Information Extraction Chinese Information Extraction by Tianfang Yao

Introduction (2) Chinese Language Chinese is a different topological language from English or German. It has a big character set that involves about 44,908 characters. Although Chinese has a history of more than 6,000 years, up to now, Chinese grammar standard has not been built perfectly. Chinese Information Extraction by Tianfang Yao

Introduction (3) Chinese Language The form of Chinese character is related to the meaning of character. It combines with the hieroglyph, e.g. 日(sun) and月(moon), the self-explanatory, e.g. 上(above) and 下(below), as well as the associative compounds, e.g. 信 (believe), a character made up of 人 (man) and 言 (word), means a message or something that can be believed or trusted. There are many homonyms in Chinese words, e.g. 锣(gong), 螺(spiral shell), 骡(mule), 箩(bamboo basket) etc. Chinese word can be disconnected or expanded. Its order can be changed. e.g.吃饭(take a meal) vs. 吃了一顿饭；理发(haircut)了 vs. 发理了 Chinese Information Extraction by Tianfang Yao

Introduction (4) Difficulties in Chinese NLP Because there is no space between the characters in the Chinese sentence, we have to segment word before we analyze the sentence structure. Chinese characters have no flection, using semantic structures to understand Chinese sentences is more important than using syntactic structures to do that. The combination of Chinese words is flexible, changeable, succinct and implicit. Sometimes there are omitted constituents in the sentence. There exist continuous nouns or continuous verbs in a Chinese sentence at times. Chinese Information Extraction by Tianfang Yao

Introduction (5) State-of-the-Art for Chinese Information Extraction Knowledge engineering approaches Automatically trainable approaches Statistic approaches Hybrid approaches Chinese Information Extraction by Tianfang Yao

Word Segmentation (1) Research of Automatic Chinese Word Segmentation (Kaiying Liu. Computer Science Department, Shan Xi University, China) 1. Definitions Definition 1: Ambiguous Phrase of Overlap Type Assume that AJB is a character string and W is a word list. If AJ W, and JB W, then AJB is called ambiguous phrase of overlap type. e.g. In the string “当代表(act as a delegate)” , both “当代(of our time)” and “代表(delegate)” are words. So this string is an ambiguous phrase of overlap type. Chinese Information Extraction by Tianfang Yao

Word Segmentation (2) Definition 2: Chain Length The number of ambiguous strings is called chain length. e.g. There is one ambiguous string in the string “当代表”, so the chain length is 1. Definition 3: Ambiguous Phrase of Combination Type Assume that AB is a character string and W is a word list. If A W, B W andAB W, then AB is called ambiguous phrase of combination type. e.g. In the string “个人(individual)” , “个(quantifier)”, “人(man)” and “个人” are allwords. So this string is an ambiguous phrase of combination type. Chinese Information Extraction by Tianfang Yao

Word Segmentation (3) 2. Build the ambiguous phrase libraries 78,000 phrases for overlap type More than 3,000 phrases for combination type Statistical results for overlap type: Their chain lengths are mostly 1 or 2, about 95% of all. Among the ambiguous phrases like “ABCD” with a chain length of 2. 98% of them can be segmented into “AB|CD”. The segmentation of about 82% of the ambiguous phrases like “ABCDE” with a chain length of 3 depends on the leftmost three characters “ABC”. False ambiguous phrase: 94% Real ambiguous phrase: 6% Chinese Information Extraction by Tianfang Yao

Word Segmentation (4) False ambiguous phrase: It is with actually only one segmentation result in real texts. e.g. “挨(be given)|批评(a criticism)” Real ambiguous phrase: It is more than two applicable segmentation results. Case 1: with almost equal occurrence probabilities e.g. “应用于(apply to)” can be segmented into “应用|于(apply…to…)” or “应|用于(should be used in…)” Case 2: mostly segmented into only one result in real texts. e.g. “解除了(have dismissed)” should be mostly segmented into “解除|了(have dismissed)” Chinese Information Extraction by Tianfang Yao

Word Segmentation (5) 3. Approaches for segmenting ambiguous phrases with overlap type Statistics based approach Built the wording capacity library: includes frequency information for ambiguous phrase “AJB” with chain length of 1, that is, different frequencies for constructing words: FreqLeft(AJ), FreqRight(B), FreqLeft(A) and FreqRight(JB) Rule1:If FreqLeft(AJ) + FreqRight(B) > FreqLeft(A) + FreqRight(JB), “AJB” is segmented into “AJ|B”; otherwise “A|JB” Chinese Information Extraction by Tianfang Yao

Word Segmentation (6) (Depending on the statistical results for ambiguous phrase library) Rule 2: Ambiguous phrase with a chain length of 2, like “ABCD”, is segmented into “AB|CD”. Rule 3:Ambiguous phrase with a chain length of 3, like “ABCDE”, first is segmented into “ABC|DE”; then the fore part “ABC” is segmented as an ambiguous phrase with a chain length of 1. Rule 4:Ambiguous phrase with a chain length of 4, like “ABCDEF”, is segmented into “AB|CD|EF” Chinese Information Extraction by Tianfang Yao

Word Segmentation (7) Rules based approach Rule 1: If there is an appulsive verb in an ambiguous phrase with its previous word as a verb, it is segmented solely. e.g. “真正体现出(really embody)” should be segmented “真正|体现|出”, because “出(come up)” is an appulsive verb, “体现” is a verb. Rule 2:If the foremost character in an ambiguous phrase is a quantifier and the preceding word of the phrase is a numeral, the a quantifier is segmented solely. e.g. “65层高楼(a high building of 65 stories)” should be segmented into “65|层|高楼”, because “层” is a quantifier and “65” is a numeral. Chinese Information Extraction by Tianfang Yao

Word Segmentation (8) 4. Approaches for segmenting ambiguous phrases with combination type Statistics based approach Among all ambiguous phrases, 30% of them usually have only one segmentation result. Therefore, a library including 133 phrases is built. The structure of database is as follows: FIELD NAME TYPE LENGTH EXPLANATION word char 4 AB nh number 3 the times of seg. into AB nf number 3 the times of seg. into A|B Assume freq=nh/(nh+nf), thresholds are α1and α2, here α1> α2 . If freq>α1 , “AB” will be segmented into “AB”; if freq<α2 , it is segmented into “A|B”. Chinese Information Extraction by Tianfang Yao

Word Segmentation (9) POS rule based approach The word to be segmented is related with the POS of its context words. If the previous word of “AB” is numeral, “AB” will be segmented into “A|B”; otherwise segmented into “AB”. e.g. In the sentence “他一个人睡在屋里(He sleeps in his room by himself)”, here AB=个人. Because “一” is a numeral, “个人” should be segmented into “个|人” . But in the phrase “农民个人利益(The individual interests of the peasantry)”, “个人” should not be segmented. Chinese Information Extraction by Tianfang Yao

Text Pre-Processing Basic Lexicon Special Lexicon User’s Lexicon Word Matching Ambiguous Segmentation Rule Set of Ambiguity of Overlap Type Wording Capacity Lib. Ambiguous Segmentation of Overlap Type Rule Set of Ambiguity of Combination Type Ambiguous Phrase Lib. of Combination Type Ambiguous Segmentation of Combination Type Word Segmentation (10) 5. System architecture Chinese Information Extraction by Tianfang Yao

Word Segmentation (11) 6. System test results The system has been tested with the corpus randomly chosen from Beijing Youth, in which there are 607 ambiguous phrases of overlap type and 2292 ambiguous phrases of combination type. The precisions are 97% and 87% respectively. Chinese Information Extraction by Tianfang Yao

Named Entity Extraction (1) Description of the NTU System used for MET2 (Hsin-His Chen et al. Natural Language Processing Lab., Department of Computer Science and Information Engineering, National Taiwan University) Processing Steps of Named Entity Extraction (1) Transform Chinese texts in GB codes into texts in Big-5 codes (2) Segment Chinese texts into a sequence of tokens (3) Identify named people (4) Identify named organizations (5) Identify named locations (6) Use n-gram model to identify named organizations/locations (7) Identify the rest of named expressions (8) Transform the results in Big-5 codes into the results in GB codes Chinese Information Extraction by Tianfang Yao

Named Entity Extraction (2) (1) Transform Chinese texts in GB codes into texts in Big-5 codes The GB code is an internal code of the simplified Chinese character set, which is used in the mainland of China. The Big-5, on the other hand, is an internal code of the traditional Chinese character set, which is used in Taiwan and Hong Kong. e.g. simplified Chinese character vs. traditional Chinese character 人工智能 (Artificial Intelligence) 人工智慧软件(Software) 軟體报道(Report) 報導新西兰(New Zealand) 紐西蘭 NTU System is designed for the traditional Chinese character text and the test texts in MET2 are in GB code. So it must transform GB code of test texts into Big-5 code. But this mapping is not only one-to-one, sometimes it is one-to-many. Chinese Information Extraction by Tianfang Yao

Named Entity Extraction (3) (2) Segment Chinese texts into a sequence of tokens List all possible words by dictionary look-up, and then resolve ambiguities by segmentation strategies. The dictionary is trained from CKIP corpus, of which articles are collected from Taiwan newspapers, magazines, etc. (3) Identify named people Chinese person names MostHanChinese surnames are single character, but some are two characters. Most names are two characters, but some are single character. Theoretically, every character can be used for a name. Thus the length of Chinese names ranges from 2 to 6 characters. Three kinds of recognition strategies are adopted: Named-formulation rules Context clues, e.g., titles, positions, speech-act verbs, etc. Cache Chinese Information Extraction by Tianfang Yao

Named Entity Extraction (4) Named-formulation rules Theyare trained from a person name corpus in Taiwan, which contains 1 million Chinese names. Each contains surname, name and sex. Possible candidates: Model 1. Single character for surname P(C1)*P(C2)*P(C3) using male (female) training table > threshold1(3) and P(C2)*P(C3) using male (female) training table > threshold2(4) Model 2. Two characters for surname P(C2)*P(C3) using male (female) training table > threshold2(4) Model 3. Two surnames together P(C12)*P(C2)*P(C3) using female training table > threshold3 P(C2)*P(C3) using female training table > threshold4 and P(C12)*P(C2)*P(C3) using female training table > P(C12)*P(C2)*P(C3) using male training table Chinese Information Extraction by Tianfang Yao

Named Entity Extraction (5) Context clues, e.g., titles, positions, speech-act verbs, etc. Titles: 博士(Dr.); 教授(Prof.); 女士(Mrs./Ms.); 小姐(Miss); 先生(Mr.) Positions: 总统(President); 导演(Director); 总经理(General Manager) Speech-act verbs: 发言(speak)；说(say)；提出(bring up) Cache The cachepresents a global clue. Because a person name may appear more than once in a document. The cache is used to store the identified candidates. There are four cases shown below when cache is used: (1) C1C2C3 and C1C2C4 are in the cache, and C1C2 is correct. (2) C1C2C3 and C1C2C4 are in the cache, and bothare correct. (3) C1C2C3 and C1C2 are in the cache, and C1C2C3 is correct. (4) C1C2C3 and C1C2 are in the cache, and C1C2 is correct. Chinese Information Extraction by Tianfang Yao

Named Entity Extraction (6) Transliterated person names Transliteratedperson names denote foreigners. The length of transliteratedperson names is not restricted to 2 to 6 characters. Main strategies: Transliterated name set The transliterated names trained from MET data are regarded as a built-in name set. Character condition Two specialcharacter sets are retrieved from MET training data. The first character of names must belong to a 280-character set, and the remaining characters must appear in a 411-character set. The character condition is a loose restriction. It should be employed with other clues. Titles They used in Chinese person names are also applicable to transliterated person names. Name introducers Such as, 叫 (be called), 名叫 (Her/His name is …), 尊称 (respectfully call sb. …) Special verbs e.g. 发表(issue/express/deliver), 暗示(hint/imply) Chinese Information Extraction by Tianfang Yao

Named Entity Extraction (7) (4) Identify named organizations The structure of organization names is more complex than that of person names. Basically, a complete organization name can be divided into name and keyword. Such as, names: 联合国(UN), 美国(USA), 罗伯逊(Robertson) keywords: 部队(Army), 大使馆(Embassy), 基金会(Foundation) There are some rules to recognize organization names: OrganizationName -> OrganizationName + OrganizationNameKeyword OrganizationName -> CountryName + OrganizationNameKeyword OrganizationName -> PersonName + OrganizationNameKeyword OrganizationName -> CountryName + {D|DD} + OrganizationNameKeyword OrganizationName -> PersonName + {D|D} + OrganizationNameKeyword OrganizationName -> LocationName + {D|D} + OrganizationNameKeyword OrganizationName -> CountryName + OrganizationName OrganizationName -> LocationName + OrganizationName Where D is a content word, such as, 国际(International), 文教(culture and education) etc. Chinese Information Extraction by Tianfang Yao

Named Entity Extraction (8) Identify named locations The structure of location names is similar to that of organization names. The rules are like: LocationName -> PersonName + LocationNameKeyword LocationName -> LocationName + LocationNameKeyword The following are some examples of location keywords: 山(maintain); 中心(center); 公路(highway); 以北(the Northern of …); 市(city) Other strategies for recognizing location names without keywords: Locative verbs: 来自(come from …); 前往(go to …) Cache: N-gram model: employ multiple occurrences to find a pattern Chinese Information Extraction by Tianfang Yao

Named Entity Extraction (9) (6) Use n-gram model to identify named organizations/locations Although cache mechanism and n-gram use the same feature, i.e., multiple occurrences, their concepts are totally different. For organization names, it is not sure when a pattern should be put into cache because its left boundary is hard to be decided. In the model, the patterns are selected to meet the following criteria: It must consist of a name and an organization name keyword Its length must be greater than two words It does not cross sentence boundary and any punctuation marks It must occur at lease twice Chinese Information Extraction by Tianfang Yao

Named Entity Extraction (10) (7) Identify the rest of named expressions The rule based approach is used for the following named expressions: Date expressions DATE->NUMBER+YEAR DATE->NUMBER+MTHUNIT Time expressions TIME->NUMBER+HUNIT TIME->TIME+BSTATE Monetary expressions DMONEY->MOUNIT+NUMBER+MOUNIT DMONEY->NUMBER+MONUIT Percentage expressions DPERCENT->PERCENT+NUMBER DPERCENT->NUMBER+PERCENT Chinese Information Extraction by Tianfang Yao

Named Entity Extraction (11) (8) Transform the results in Big-5 codes into the results in GB codes MET2 Testing Results Named Entity Recall(%) Precision(%) Person Name 91 74 Organization Name 78 85 Location Name 78 69 Date 94 88 Time 98 70 Money 98 98 Percent 83 98 F-MEASURES: P&R 79.61% 2P&R 77.88% P&2R 81.42% Chinese Information Extraction by Tianfang Yao

Entity Relation Extraction (1) A Trainable Method for Extracting Chinese Entity Names and Their Relations (Yimin Zhang et al. Intel China Research Center, Beijing, China) The process can be divided into two stages. The first one is the learning process in which several classifiers are built from the training data. The second one is the extracting process in which Chinese entity names and their relations are extracted using the classifiers learned. The learning algorithm used in the learning process is memory-based learning (MBL) which is a classification based supervised learning approach. Chinese Information Extraction by Tianfang Yao

Memory-Based Learning Architecture EXAMPLES Learning Storage Computation of Metrics INPUT CASES OUTPUT Similarity-Based Reasoning Performance Entity Relation Extraction (2) Chinese Information Extraction by Tianfang Yao

Entity Relation Extraction (3) The main steps for the learning process: (1) Prepare training data in which all noun phrases, entity names and relations are manually annotated. (2) Segmenting, tagging and partial parsing for the training data. (3) Extract the training sets from the parsed training data. Four training sets are extracted for different tasks, related to Chinese person names, entity names, noun phrase, or relations between entity names in the training data respectively. The main feathers used in an example can be either local context feathers, e.g. dependency relation, or global context features, e.g. the feature of a word in the whole document, etc. (4) Use MBL algorithm to obtain IG-Tree for four training sets. IG-Tree is a compressed representation of the training set that can be processed quickly in classification process. Chinese Information Extraction by Tianfang Yao

Entity Relation Extraction (4) The main steps for the extracting process: Segmenting, tagging and partial parsing for the Chinese documents. Identify Chinese people names using PersonName-IG-Tree. Identify Chinese organization names using the same method of NTU System. Identify other entity names using the same method of NTU System. Identify Chinese noun phrases (NP chunking) using NP-IG-Tree. Use entity names and noun phrases extracted to perform partial parsing again to fix the parsing errors. Use EntityName-IG-Tree to classify the noun phrases extracted. This step will identify entity names that are missed in the previous steps. Use Relation-IG-Tree to identify relations between the extracted entity names. Chinese Information Extraction by Tianfang Yao

Entity Relation Extraction (5) The entity relation extracted: Employee-of, Location-of, Product-of and No-relation The feathers for this task: The features used in CRYSTAL System, Add some new feathers, such as the linear order of entity names, the word(s) between the entity names, the relative position of the entity names (in same sentence or in neighboring sentence) etc. Chinese Information Extraction by Tianfang Yao

Entity Relation Extraction (6) Example: Phrase “联想总裁(Legend’s President)” (Note: Legend=Legend Holdings Limited or Legend Group which is a famous computer company in China) in the subject position includes the features: SUBJ-Terms-联想 SUBJ-Terms-总裁 SUBJ-Mod-Terms-联想 SUBJ-Head-Terms-总裁 SUB-Classes-Employee SUB-Mod-Classes-Organization SUB-Head-Classes-Organization(should be Position) Chinese Information Extraction by Tianfang Yao

Entity Relation Extraction (7) Learning and extracting processes: For every two related entity names in the training data, a training example is identified and extracted. After all examples are extracted, they are fed to MBL Learner to build the Relation-IG-Tree. The extracting process is the same as the learning process for extracting all pairs of entity names. Then the relation between every pair of entity names is derived by the Relation-IG-Tree. Chinese Information Extraction by Tianfang Yao

Entity Relation Extraction (8) Example1: “浪潮集团作为国内著名的IT硬件设备制造商，…” As a famous manufacturer of IT hardware devices in China, the Lang Chao Group … Company name: 浪潮集团 Product name: IT硬件设备 Training example: Company name (作为/是) … Product name 制造商 Relation: product-of Example2: “吴士宏再度成为媒体关注的焦点。不过，这次她是以TCL集团副总裁兼信息产业公司总经理的身份来上海的。” Wu Shihong became the media focus once again, however, this time she came to Shanghai as the vice president of TCL group and its IT company’s general manager. Person name: 吴士宏 Company name: TCL集团 Training example: If a person name and a company name appear in neighboring sentences, and no other person names and company names are found in between, they tend to have an employee-of relation. Relation: employee-of Chinese Information Extraction by Tianfang Yao

Entity Relation Extraction (9) System testing results: To test this approach, a manually annotated corpus which comprises about 200 business news is used. All the entity names (about 500 person names and 300 organization names), noun phrases, and relations in the corpus were manually annotated. Ten pairs of training sets and tests were randomly selected from the corpus with each set size equivalent to half of the entire corpus. All data sets were tested, the result is as follows: Recall(%) Precision(%) Person Name 86.3 83.2 Organization Name 73.4 89.3 Employee-of 75.6 92.3 Product-of 56.2 87.1 Location-of 67.2 75.6 Chinese Information Extraction by Tianfang Yao

Conclusion Chinese is a different topological language from English or German. There exist some special difficulties in Chinese NLP, such as word segmentation. There are mainly two ambiguous phrases in Chinese word segmentation. One is overlap type, another is combination type. In overlay ambiguous phrases, the chain lengths are mostly 1 or 2 and take up 95%. In combination ambiguous phrases, 30% of them usually have only one possibility of segmentation. We can remove ambiguity depending on different ambiguous types. Chinese named entities are major constituents in Chinese documents. We can adopt different methods to extract them together, such as character conditions, statistical information, titles, punctuation marks, organization and location keywords, speech-act and locative verbs, cache and n-gram model. We can view the determination of Chinese entity relation as classification process. In the learning process, several classifiers are built from the training data. In the extracting process, the relations are extracted using the classifiers learned. Machine learning technique has been effectively used in Chinese entity relation extraction. Chinese Information Extraction by Tianfang Yao

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: