100 likes | 224 Vues
This document outlines the advancements in machine translation systems developed by the ASR group, focusing on hardware configuration, data pre-processing, and evaluation methodologies. It highlights the critical role of pre-processing techniques including encoding conversions and segmentation, and analyzes various data sources for training, such as NIST and IWSLT. The achievements of the system in notable evaluations are discussed, emphasizing system combination for improved performance. It concludes with key insights on the importance of high-quality data and practical approaches to enhance translation systems.
E N D
CAS-IA System Description Jinhua Du CNGL July 23, 2008
Outline • Hardware in IA • Pre-process & Data • MT System Configuration for Evaluation • Achievements • Conclusions
Hardware • Machines • Parallel Computing • Condor • Grid Computing Module developed by ASR group
Pre-process & Data • Pre-processing • encoding conversion & filter • punctuation and number conversion (full-shaped -> half-shaped, etc.) • case conversion (only the initial alphabet of the initial word), abbreviation processing • Chinese word segment (ICT or IA tool), English tokenization • Data for NIST • Parallel: 3.4 M (if adds UN corpus, up to 10M) • Monolingual: 3.4M + 9.6M(gigaword1&2) + 1.4M(giga3) = 14.4M • Data for IWSLT • Parallel: BTEC(20K or 40K); LDC • Monolingual: BTEC; Gigaword • Data Filter: only need the high correlation data, very important for spoken evaluation (More better data, more better performance)
System Configuration • Modules • Pre-processing • Alignment Post-preprocessing & Models Generation • Decoding & MER Training • System Combination & Post-Processing
Achievements (zh-en) • The 3rd MT Symposia in China ( rank 3) • Limited (830K pairs) • Unlimited (3M pairs)
Achievements (zh-en) • NIST MT Eval. 2008
Achievements (zh-en) • IWSLT2008 • More systems to be combined • 2 PB systems developed by CASIA • Moses • SAMT (CMU) • Hierarchical PB • BTG-based system (Xiong) • Better performance
Conclusions • More better data, better performance • System combination is very helpful to improve the performance • Evaluation is different from theoretical research: empirical methods and tricks are usually more effective • For better rank, should be prepare in advance and build a temporarily team for evaluation • Evaluation is a horrible thing for student: more time, more energy and no paper (joke but true) • Develop systems for application purpose