Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi

Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation Chenhui Chu, Toshiaki Nakazawa, Daisuke Kawahara, SadaoKurohashi Graduate School of Informatics, Kyoto University EAMT2012 (2012/05/28)

Outline • Word Segmentation Problems • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work

Word Segmentation for Chinese-Japanese MT 小坂先生是日本临床麻醉学会的创始者。 Zh: 小坂先生は日本臨床麻酔学会の創始者である。 Ja: 小/坂/先生/是/日本/临床/麻醉/学会/的/创始者/。 Zh: 小坂/先生/は/日本/臨床/麻酔/学会/の/創始/者/である/。 Ja: Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.

Word Segmentation Problems in Chinese-Japanese MT • Unknown words • Affect segmentation accuracy and consistency • Word segmentation granularity • Affect word alignment 小/坂 /先生/是/日本/临床/麻醉/学会/的/ 创始者 /。 Zh: 小坂 /先生/は/日本/臨床/麻酔/学会/の/ 創始/者 /である/。 Ja: Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.

Outline • Introduction • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work

Chinese Characters • Chinese characters are used both in Chinese (Hanzi) and Japanese (Kanji) • There are many common Chinese characters between Hanzi and Kanji • We made a common Chinese characters mapping table for 6,355JIS Kanji(Chu et al., 2012)

Common Chinese Characters Related Studies • Automatic sentence alignment task (Tan et al., 1995) • Dictionary construction (Goh et al., 2005) • Word level semantic relations investigation (Huang et al., 2008) • Phrase alignment (Chu et al., 2011) • This study exploits common Chinese characters in Chinese word segmentation optimization for MT

Reason for Chinese Word Segmentation Optimization • Segmentation for Japanese is easier than Chinese, because Japanese uses Kana other than Chinese characters • F-score for Japanese segmentation is nearly 99% (Kudo et al., 2004), while that for Chinese is still about 95% (Wang et al., 2011) • Therefore, we only do word segmentation optimization for Chinese, and keep the Japanese segmentation results

Parallel Training Corpus ① Chinese Lexicons Extraction Common Chinese Characters Chinese Lexicons Chinese Annotated Corpus for Chinese Segmenter System Dictionary of Chinese Segmenter ② Chinese Lexicons Incorporation ③ Short Unit Transformation Short Unit Chinese Corpus for Chinese Segmenter System Dictionary with Chinese Lexicons Training Optimized Chinese Segmenter

① Chinese Lexicons Extraction • Step 1: Segment Chinese and Japanese sentences in the parallel training corpus • Step 2: Convert Japanese Kanji tokens into Chinese using the mapping table we made (Chu et al., 2012) • Step 3: Extract the converted tokens as Chinese lexicons if they exist in the corresponding Chinese sentence

Extraction Example 小坂 /先生/は/日本/臨床/麻酔/学会/の/ 創始/者 /である/。 Ja: Kanji tokens conversion 小坂/先生/は/日本/临床/麻醉/学会/の/ 创始/者/である/。 Ja: Check 小/坂 / 先生 /是/ 日本 / 临床 / 麻醉 / 学会 /的/ 创始者 /。 Zh: Extraction 小坂先生日本临床麻醉学会创始者 Chinese Lexicons : Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.

② Chinese Lexicons Incorporation • Using a system dictionary is helpful for Chinese word segmentation (Low et al., 2005; Wang et al., 2011) • We incorporate the extracted lexicons into the system dictionary of a Chinese segmenter • Assign POS tags by converting the POS tags assigned by the Japanese segmenter using POS tags mapping table between Chinese and Japanese

③ Short Unit Transformation • Adjusting Chinese word segmentation to make tokens 1-to-1 mapping as many as possible between a parallel sentences can improve alignment accuracy (Bai et al., 2008) • Wang et al. (2010) proposed a short unit standard for Chinese word segmentation, which can reduce the number of 1-to-n alignments and improve MT performance

Our Method • We transform the annotated training data of Chinese segmenter utilizing the extracted lexicons 从_P/ 有效性_NN /高_VA/的_DEC/ 格要素_NN /… CTB: Lexicon: 有效 (effective) Lexicon : 要素 (element) 从_P/ 有效_NN/性_NN /高_VA/的_DEC/ 格_NN/要素_NN /… Short: From case element with high effectiveness … Ref:

Constraints • We do not use the extracted lexicons that are composed of onlyone Chinese character 歌(song) ・・・歌(song)/ 颂(praise) 歌颂(praise) long token extracted lexicons short unit tokens

Two Kinds of Experiments • Experiments on Moses • Experiments on Kyoto example-based machine translation (EBMT) system Nakazawa and Kurohashi, 2011) • A dependency tree-based decoder

Experimental Settings on MOSES (1/2)

Experimental Settings on MOSES (2/2) • Baseline: Only using the lexicons extracted from Chinese annotated corpus • Incorporation: Incorporate the extracted Chinese lexicons • Short unit: Incorporate the extracted Chinese lexicons and train the Chinese segmenter on the short unit training data

Results of Chinese-to-Japanese Translation Experiments on MOSES • CTB 7 shows better performance because the size is more than 3 times of NICT Chinese Treebank • Lexicons extracted from a paper abstract domain also work well on other domains (i.e. CTB 7)

Results of Japanese-to-Chinese Translation Experiments on MOSES • Not significant compared to Zh-to-Ja, because our proposed approach does not change the segmentation results of input Japanese sentences

Experimental Settings on EBMT (1/2)

Experimental Settings on EBMT (2/2) • Baseline: Only using the lexicons extracted from Chinese annotated corpus • Short unit: Incorporate the extracted Chinese lexicons and train the Chinese segmenter on the short unit training data

Results of Translation Experiments on EBMT • Translation performance is worse than MOSES, because EBMT suffers from low accuracy of Chinese parser • Improvement by short unit is not significant because the Chinese parser is not trained on short unit segmented training data

Short Unit Effectiveness on MOSES Baseline (BLEU=49.38) Input: 本/论文/中/，/提议/考虑/现存/实现/方式/的/ 功能 / 适应性 /决定/对策/目标/的/保密/基本/设计法/。 Output: 本/論文/で/は/，/提案/する/ 適応/的 /対策/を/決定/する/セキュリティ/基本/設計/法/を/考える/現存/の/実現/方式/の/ 機能 /を/目標/と/して/いる/． • Short unit (BLEU=56.33) • Input:本/论文/中/，/提议/考虑/现存/实现/方式/的/ 功能 / 适应/性 /决定/对策/目标/的/保密/基本/设计/法/。 Output: 本/論文/で/は/，/提案/する/考え/現存/の/実現/方式/の/ 機能/的 / 適応/性 /を/決定/する/対策/目標/の/セキュリティ/基本/設計/法/を/提案/する/． Reference 本/論文/で/は/，/対策/目標/を/現存/の/実現/方式/の/ 機能/的 / 適合/性 /も/考慮/して/決定/する/セキュリティ/基本/設計/法/を/提案/する/ ． • (In this paper, we propose a basic security design method also consider functional suitability of the existing implementation method for determining countermeasures target.)

Number of Extracted Lexicons • The number of extracted lexicons deceased after short unit transformation because duplicated lexicons increased

Short Unit Transformation Percentage • NICT Chinese Treebank • 6,623 tokens out of 257,825 been transformed to 13,469 short unit tokens, the percentage is 2.57% • CTB 7 • 19,983 tokens out of 718,716 been transformed to 41,336 short unit tokens, the percentage is 2.78%

Short Unit Transformation Problems (1/3) • Improper transformation problem 好意(favor) ・・・不(not)/ 好意 (favor)/ 思(think) 不好意思 (sorry) long token extracted lexicons short unit tokens

Short Unit Transformation Problems (2/3) • Transformation ambiguity problem 充电(charge) 电器(electric equipment) ・・・充电(charge)/ 器(device) 充电器(charger) 充(charge)/ 电器(electric equipment) long token extracted lexicons short unit tokens

Short Unit Transformation Problems (3/3) • POS Tag assignment problem 实验(test) ・・・被_NN(be)/ 实验_NN(test)/ 者_NN(person) 被实验者_NN (test subject) long token extracted lexicons short unit tokens The correct POS tag for “被 (be)” should be LB (“被”in long bei-const)

Bai et al., 2008 • Proposed a method of learning affix rules from a aligned Chinese-English bilingual terminology bank to adjust Chinese word segmentation in the parallel corpus directly

Wang et al., 2010 • Proposed a method based on transfer rules and a transfer database. • The transfer rules are extracted from alignment results of annotated Chinese and segmented Japanese training data • The transfer database is constructed using external lexicons, and is manually modified

Conclusion • We proposed an approach of exploiting common Chinese characters in Chinese word segmentation optimization for Chinese-Japanese MT • Experimental results of Chinese-Japanese MT on a phrase-based SMT system indicated that our approach can improve MT performance significantly

Future Work • Solve the short unit transformation problems • Evaluate the proposed approach on parallel corpus of other domains

Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi

Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi

Presentation Transcript

Chu Ju’s House

Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi

Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi

Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi

CHU JU’S HOUSE

CHU LU RANCH

Yuya Akita , Tatsuya Kawahara

Hideki Kawahara Wakayama University ATR-HIS

Yamanaka Sadao

Daisuke Kubo

Sen S, Kawahara B, Chaudhuri G.

Toshiaki Nagata UNFCCC secretariat unfccct tnagata@unfccct

Oscar Chu

Daisuke Sato

Tatsuya Kawahara (Kyoto University, Japan) kawahara@i.kyoto-u.ac.jp

Daisuke Kameda BigRIPS team, RIKEN Nishina Center

Mrs. Chu presents

Tatsuya Kawahara (Kyoto University, Japan)