70 likes | 79 Vues
Domain Mixing for Chinese-English Translation. Chris Leege. The Project. Goal Translate Chinese novels into English using Neural Machine Translation Challenges Chinese to English translation requires a lot of data There aren’t many Chinese and English parallel corpora
E N D
Domain Mixing for Chinese-English Translation Chris Leege
The Project • Goal • Translate Chinese novels into English using Neural Machine Translation • Challenges • Chinese to English translation requires a lot of data • There aren’t many Chinese and English parallel corpora • The majority of Chinese and English parallel corpora are in domains other than novels, mostly news, UN reports, or subtitles
Data • Casia2015 Chinese-English parallel corpus • One million parallel sentences from around the web. • Chinese Novel Corpus • Manually aligned • 45,000 parallel sentences • 2,096,000 characters
Models Pure Casia2015 corpus Pure novel corpus Naïve mixed corpus Mixed corpus with target tokens • Effective Domain Mixing for Neural Machine Translation
Results Figure 2. BLEU Scores for the four models. Models on the y-axis, test data on the x-axis
Results Figure 3. BLEU Scores for the second four models. Models on the y-axis, test data on the x-axis
Conclusion • Possible Issues • Casia2015 too heterogenous • Not enough data • Next Steps • Try again with a larger, more homogenous corpus, such as the UN corpus