Integrating Speech Recognition and Machine Translation

Integrating Speech Recognition and Machine Translation Spyros Matsoukas, Ivan Bulyko, Bing Xiang, Kham Nguyen, Richard Schwartz, John Makhoul

Integration Issues • Machine Translation (MT) system is trained on text data, so it expects • segments that correspond to foreign sentences • properly placed punctuation marks • numbers, dates, monetary amounts, abbreviations, etc., as they appear in ordinary text • However, Speech-To-Text (STT) output • is segmented automatically on long pauses • resulting segments may be too short, or may cross sentence boundaries • has no punctuation • punctuation needs to be automatically added prior to translation • has numbers, dates, etc., in spoken form • output can be parsed to convert numbers to written form

STT/MT Pipeline • Initial set of experiments ran MT on the 1-best hypothesis from STT

STT Components • STT-A • EARS RT04 Arabic BN system • Word pronunciations based on graphemes • Acoustic models estimated using Maximum Mutual Information (MMI) and Speaker Adaptive Training (SAT) on 100 hours of BN audio data • 3-gram language model trained on 400 million words of news text • STT-B • Uses morphological analyzer and automatic methods to infer short vowels in word pronunciations • Trained on an additional 50 hours of acoustic training data • STT-C • Makes use of additional language model training data

MT Components • MT-A • System developed during the period Sep 2004 – Apr 2005 • Phrase-based translation model, trained on 100M words of Arabic/English UN and news bitext • 3-gram English LM, trained on 2 billion words of text (mostly newswire) • Translation based on posterior probability P(English | Foreign) • MT-B • Uses a combination of generative and posterior translation probabilities • Includes a phrase segmentation score • Uses a method to compensate for over-estimated translation probabilities • Optimizes decoding weights by minimizing TER on N-best lists TER results on the 2002 and 2004 MT Eval sets

Test Data • Tested integration on bnat05 • 6-hour set from several sources from Jan 2001 and Nov 2003 • Test set consists of both Modern Standard Arabic (MSA) and Arabic dialect segments • All system comparisons based on TER • MT system output automatically scored against single reference transcription, with mixed case

Integration Results • Effect of STT accuracy, segmentation and punctuation on MT accuracy • At current MT performance level: • large improvements in STT accuracy result in small TER gain • significant TER reduction (2.7% absolute) can be obtained by improving sentence boundary detection • full punctuation helps translation only marginally

Optimizing STT segmentation for MT • Tuned the audio segmentation procedure in order to output segments that match the reference in terms of average length • 1.6% absolute TER gain for optimizing segmentation • Additional gains can be obtained by • Converting spoken numbers to written form prior to translation (0.4-0.5% TER reduction) • re-defining STT output segmentation, using linguistic information

Sentence Boundary Detection (SBD) • Used a hidden-event language model (HELM) to detect sentence boundaries in the 1-best STT output • 4-gram HELM, trained 850M words of Arabic news with Kneser-Ney smoothing • Silence duration can be integrated as observation into HMM search • Explored various configurations • SBD-1: Use only LM to insert periods within speaker turns • SBD-2: Use LM and silence duration jointly • SBD-3: Bias the LM to insert boundaries at a higher rate (by 30-50%), then remove boundaries with lowest model posteriors while constraining the maximum sentence length

SBD Results • Effect of HELM-based SBD on MT accuracy, starting from one of two audio segmentations • audio-seg-1: 9.47 sec / segment • audio-seg-2: 13.60 sec / segment • HELM has larger effect on Modern Standard Arabic (MSA) regions, where STT accuracy is high • SBD can be applied safely on top of any audio segmentation

Optimizing MT on Speech Data • MT accuracy can be enhanced by optimizing MT decoding weights on broadcast speech data • Optimization can compensate for differences in style between newswire text and STT transcript (esp. on broadcast conversations) • Optimization Issue: • MT optimization requires one-to-one mapping between translation hypotheses and references on the tuning set • Non-trivial to tune on translations of automatically segmented STT output • Solutions: • Re-segment STT output according to reference segmentation prior to translation, then use translation hypotheses for tuning • Tune based on translations of the STT reference transcriptions

MT Optimization Results • Updated development sets • Results • MT02: tuning on translations of the 2002 NIST MT evaluation set • BNC-STT: tuning on translations of manually segmented (according to reference) STT output • BNC-REF: tuning on translations of reference transcripts

Conclusions and Future Research • Results on 1-best STT/MT integration show that sentence boundary detection has a large impact on MT performance • Segmentation should be based on both audio and STT transcript • Better performance is expected by coupling STT and MT more tightly • Have begun running MT on consensus networks from STT output • Will explore joint optimization of STT and MT system parameters • At current operating point, improvements in MT will have the largest effect

Integrating Speech Recognition and Machine Translation