1 / 42

Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi

Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation. Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi Graduate School of Informatics, Kyoto University. EAMT2012 ( 2012/05/ 28). Outline.

barney
Télécharger la présentation

Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation Chenhui Chu, Toshiaki Nakazawa, Daisuke Kawahara, SadaoKurohashi Graduate School of Informatics, Kyoto University EAMT2012 (2012/05/28)

  2. Outline • Word Segmentation Problems • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work

  3. Outline • Word Segmentation Problems • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work

  4. Word Segmentation for Chinese-Japanese MT 小坂先生是日本临床麻醉学会的创始者。 Zh: 小坂先生は日本臨床麻酔学会の創始者である。 Ja: 小/坂/先生/是/日本/临床/麻醉/学会/的/创始者/。 Zh: 小坂/先生/は/日本/臨床/麻酔/学会/の/創始/者/である/。 Ja: Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.

  5. Word Segmentation Problems in Chinese-Japanese MT • Unknown words • Affect segmentation accuracy and consistency • Word segmentation granularity • Affect word alignment 小/坂 /先生/是/日本/临床/麻醉/学会/的/ 创始者 /。 Zh: 小坂 /先生/は/日本/臨床/麻酔/学会/の/ 創始/者 /である/。 Ja: Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.

  6. Outline • Introduction • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work

  7. Chinese Characters • Chinese characters are used both in Chinese (Hanzi) and Japanese (Kanji) • There are many common Chinese characters between Hanzi and Kanji • We made a common Chinese characters mapping table for 6,355JIS Kanji(Chu et al., 2012)

  8. Common Chinese Characters Related Studies • Automatic sentence alignment task (Tan et al., 1995) • Dictionary construction (Goh et al., 2005) • Word level semantic relations investigation (Huang et al., 2008) • Phrase alignment (Chu et al., 2011) • This study exploits common Chinese characters in Chinese word segmentation optimization for MT

  9. Outline • Word Segmentation Problems • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work

  10. Reason for Chinese Word Segmentation Optimization • Segmentation for Japanese is easier than Chinese, because Japanese uses Kana other than Chinese characters • F-score for Japanese segmentation is nearly 99% (Kudo et al., 2004), while that for Chinese is still about 95% (Wang et al., 2011) • Therefore, we only do word segmentation optimization for Chinese, and keep the Japanese segmentation results

  11. Parallel Training Corpus ① Chinese Lexicons Extraction Common Chinese Characters Chinese Lexicons Chinese Annotated Corpus for Chinese Segmenter System Dictionary of Chinese Segmenter ② Chinese Lexicons Incorporation ③ Short Unit Transformation Short Unit Chinese Corpus for Chinese Segmenter System Dictionary with Chinese Lexicons Training Optimized Chinese Segmenter

  12. Parallel Training Corpus ① Chinese Lexicons Extraction Common Chinese Characters Chinese Lexicons Chinese Annotated Corpus for Chinese Segmenter System Dictionary of Chinese Segmenter ② Chinese Lexicons Incorporation ③ Short Unit Transformation Short Unit Chinese Corpus for Chinese Segmenter System Dictionary with Chinese Lexicons Training Optimized Chinese Segmenter

  13. ① Chinese Lexicons Extraction • Step 1: Segment Chinese and Japanese sentences in the parallel training corpus • Step 2: Convert Japanese Kanji tokens into Chinese using the mapping table we made (Chu et al., 2012) • Step 3: Extract the converted tokens as Chinese lexicons if they exist in the corresponding Chinese sentence

  14. Extraction Example 小坂 /先生/は/日本/臨床/麻酔/学会/の/ 創始/者 /である/。 Ja: Kanji tokens conversion 小坂/先生/は/日本/临床/麻醉/学会/の/ 创始/者/である/。 Ja: Check 小/坂 / 先生 /是/ 日本 / 临床 / 麻醉 / 学会 /的/ 创始 者 /。 Zh: Extraction 小坂 先生 日本 临床 麻醉 学会 创始 者 Chinese Lexicons : Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.

  15. Parallel Training Corpus ① Chinese Lexicons Extraction Common Chinese Characters Chinese Lexicons Chinese Annotated Corpus for Chinese Segmenter System Dictionary of Chinese Segmenter ② Chinese Lexicons Incorporation ③ Short Unit Transformation Short Unit Chinese Corpus for Chinese Segmenter System Dictionary with Chinese Lexicons Training Optimized Chinese Segmenter

  16. ② Chinese Lexicons Incorporation • Using a system dictionary is helpful for Chinese word segmentation (Low et al., 2005; Wang et al., 2011) • We incorporate the extracted lexicons into the system dictionary of a Chinese segmenter • Assign POS tags by converting the POS tags assigned by the Japanese segmenter using POS tags mapping table between Chinese and Japanese

  17. Parallel Training Corpus ① Chinese Lexicons Extraction Common Chinese Characters Chinese Lexicons Chinese Annotated Corpus for Chinese Segmenter System Dictionary of Chinese Segmenter ② Chinese Lexicons Incorporation ③ Short Unit Transformation Short Unit Chinese Corpus for Chinese Segmenter System Dictionary with Chinese Lexicons Training Optimized Chinese Segmenter

  18. ③ Short Unit Transformation • Adjusting Chinese word segmentation to make tokens 1-to-1 mapping as many as possible between a parallel sentences can improve alignment accuracy (Bai et al., 2008) • Wang et al. (2010) proposed a short unit standard for Chinese word segmentation, which can reduce the number of 1-to-n alignments and improve MT performance

  19. Our Method • We transform the annotated training data of Chinese segmenter utilizing the extracted lexicons 从_P/ 有效性_NN /高_VA/的_DEC/ 格要素_NN /… CTB: Lexicon: 有效 (effective) Lexicon : 要素 (element) 从_P/ 有效_NN/性_NN /高_VA/的_DEC/ 格_NN/要素_NN /… Short: From case element with high effectiveness … Ref:

  20. Constraints • We do not use the extracted lexicons that are composed of onlyone Chinese character 歌(song) ・・・ 歌(song)/ 颂(praise) 歌颂(praise) long token extracted lexicons short unit tokens

  21. Outline • Word Segmentation Problems • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work

  22. Two Kinds of Experiments • Experiments on Moses • Experiments on Kyoto example-based machine translation (EBMT) system Nakazawa and Kurohashi, 2011) • A dependency tree-based decoder

  23. Experimental Settings on MOSES (1/2)

  24. Experimental Settings on MOSES (2/2) • Baseline: Only using the lexicons extracted from Chinese annotated corpus • Incorporation: Incorporate the extracted Chinese lexicons • Short unit: Incorporate the extracted Chinese lexicons and train the Chinese segmenter on the short unit training data

  25. Results of Chinese-to-Japanese Translation Experiments on MOSES • CTB 7 shows better performance because the size is more than 3 times of NICT Chinese Treebank • Lexicons extracted from a paper abstract domain also work well on other domains (i.e. CTB 7)

  26. Results of Japanese-to-Chinese Translation Experiments on MOSES • Not significant compared to Zh-to-Ja, because our proposed approach does not change the segmentation results of input Japanese sentences

  27. Experimental Settings on EBMT (1/2)

  28. Experimental Settings on EBMT (2/2) • Baseline: Only using the lexicons extracted from Chinese annotated corpus • Short unit: Incorporate the extracted Chinese lexicons and train the Chinese segmenter on the short unit training data

  29. Results of Translation Experiments on EBMT • Translation performance is worse than MOSES, because EBMT suffers from low accuracy of Chinese parser • Improvement by short unit is not significant because the Chinese parser is not trained on short unit segmented training data

  30. Outline • Word Segmentation Problems • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work

  31. Short Unit Effectiveness on MOSES Baseline (BLEU=49.38) Input: 本/论文/中/,/提议/考虑/现存/实现/方式/的/ 功能 / 适应性 /决定/对策/目标/的/保密/基本/设计法/。 Output: 本/論文/で/は/,/提案/する/ 適応/的 /対策/を/決定/する/セキュリティ/基本/設計/法/を/考える/現存/の/実現/方式/の/ 機能 /を/目標/と/して/いる/. • Short unit (BLEU=56.33) • Input:本/论文/中/,/提议/考虑/现存/实现/方式/的/ 功能 / 适应/性 /决定/对策/目标/的/保密/基本/设计/法/。 Output: 本/論文/で/は/,/提案/する/考え/現存/の/実現/方式/の/ 機能/的 / 適応/性 /を/決定/する/対策/目標/の/セキュリティ/基本/設計/法/を/提案/する/. Reference 本/論文/で/は/,/対策/目標/を/現存/の/実現/方式/の/ 機能/的 / 適合/性 /も/考慮/して/決定/する/セキュリティ/基本/設計/法/を/提案/する/ . • (In this paper, we propose a basic security design method also consider functional suitability of the existing implementation method for determining countermeasures target.)

  32. Number of Extracted Lexicons • The number of extracted lexicons deceased after short unit transformation because duplicated lexicons increased

  33. Short Unit Transformation Percentage • NICT Chinese Treebank • 6,623 tokens out of 257,825 been transformed to 13,469 short unit tokens, the percentage is 2.57% • CTB 7 • 19,983 tokens out of 718,716 been transformed to 41,336 short unit tokens, the percentage is 2.78%

  34. Short Unit Transformation Problems (1/3) • Improper transformation problem 好意(favor) ・・・ 不(not)/ 好意 (favor)/ 思(think) 不好意思 (sorry) long token extracted lexicons short unit tokens

  35. Short Unit Transformation Problems (2/3) • Transformation ambiguity problem 充电(charge) 电器(electric equipment) ・・・ 充电(charge)/ 器(device) 充电器(charger) 充(charge)/ 电器(electric equipment) long token extracted lexicons short unit tokens

  36. Short Unit Transformation Problems (3/3) • POS Tag assignment problem 实验(test) ・・・ 被_NN(be)/ 实验_NN(test)/ 者_NN(person) 被实验者_NN (test subject) long token extracted lexicons short unit tokens The correct POS tag for “被 (be)” should be LB (“被”in long bei-const)

  37. Outline • Word Segmentation Problems • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work

  38. Bai et al., 2008 • Proposed a method of learning affix rules from a aligned Chinese-English bilingual terminology bank to adjust Chinese word segmentation in the parallel corpus directly

  39. Wang et al., 2010 • Proposed a method based on transfer rules and a transfer database. • The transfer rules are extracted from alignment results of annotated Chinese and segmented Japanese training data • The transfer database is constructed using external lexicons, and is manually modified

  40. Conclusion • We proposed an approach of exploiting common Chinese characters in Chinese word segmentation optimization for Chinese-Japanese MT • Experimental results of Chinese-Japanese MT on a phrase-based SMT system indicated that our approach can improve MT performance significantly

  41. Future Work • Solve the short unit transformation problems • Evaluate the proposed approach on parallel corpus of other domains

More Related