510 likes | 530 Vues
Spoken Language Translation. Spoken Language Translation. Spoken Language Translation. Spoken language translation (SLT) is to directly translate spoken utterances into another language. Major components Automatic Speech Recognition (ASR) Machine Translation (MT) Text-to-Speech (TTS). ASR.
E N D
Spoken Language Translation Intelligent Robot Lecture Note
Spoken Language Translation Intelligent Robot Lecture Note
Spoken Language Translation • Spoken language translation (SLT) is to directly translate spoken utterances into another language. • Major components • Automatic Speech Recognition (ASR) • Machine Translation (MT) • Text-to-Speech (TTS) ASR MT TTS Target Sentence Source Speech Source Sentence Target Speech 버스 정류장이 어디에 있나요? Where is the bus stop? Intelligent Robot Lecture Note
Spoken Language Translation • In comparison with written language, • Speech and especially spontaneous speech poses additional difficulties for the task of automatic translation. • Typically, these difficulties are caused by errors of the speech recognition step, which is carried out before the translation process. • As a result, the sentence to be translated is not necessarily well-formed from a syntactic point-of-view. • Why a statistical approach for machine translation? • Even without recognition errors, structures of spontaneous speech differ from those of written language. • The statistical approach • Avoid hard decisions at any level of the translation process • For any source sentence, a translated sentence in the target language is guaranteed to be generated. Intelligent Robot Lecture Note
Coupling ASR to MT • Motivation • ASR cannot secure an error-free system • One best of ASR could be wrong • SLT must be designed robust to speech recognition errors • MT could be benefited from wide range of supplementary information provided by ASR • MT quality may depend on WER of ASR • Strong correlation between recognition and translation quality • WER of ASR decreases in a set of hypotheses • Idea : Exploitation of more transcriptions • SLT systems vary in the degree to which SMT and ASR are integrated within the overall translation process. Intelligent Robot Lecture Note
Coupling ASR to MT • Loose coupling • SMT uses ASR output (1-best, N-best, lattice, or confusion network) as input for 1-way module communication • Tight coupling • The whole search space of ASR and MT is integrated ASR SMT TTS Target Sentence Source Speech 1-best, N-best, Lattice, or CN Target Speech ASR + SMT TTS Target Sentence Source Speech Target Speech Intelligent Robot Lecture Note
Coupling ASR to MT • Statistical spoken language translation • Given a speech input x in the source language, find the best translation e • F(o) is a set of possible transcriptions • Loose coupling : 1-best, N-best, lattice, or confusion network • Tight coupling : full search space • Pr(f,e|x) : speech translation model • Acoustic and translation features Intelligent Robot Lecture Note
Coupling ASR to MT • Loose coupling vs. Tight couplings Intelligent Robot Lecture Note
ASR Outputs • Automatic speech recognition (ASR) is a process by which an acoustic speech signal is converted into a set of words. • Architecture SMT Feature Extraction Decoding Speech Signals ASR outputs ( 1-best, N-best, Lattice, or CN ) Network Construction Speech DB Acoustic Model Pronunciation Model Language Model HMM Estimation G2P Text Corpora LM Estimation Intelligent Robot Lecture Note
ONE TWO ONE THREE Sentence HMM ONE TWO THREE ONE W AH N Word HMM ONE W Phone HMM 2 1 3 ASR Outputs • Network Structure • Decoding of HMM-based ASR • Searching the best path in a huge HMM-state lattice Intelligent Robot Lecture Note
ASR Outputs • 1-best • The best path could find from back tracking • Why a 1-best “word” sequence? • Storing the backtracking pointer table for state sequence takes a lot of memory • Usually a backtrack pointer storing : The previous words before the current word • N-best • Traceback not only from the 1st-best, also from the 2nd best and 3rd best, etc. • Methods • Directly from search backtrack pointer table – Exact N-best algorithm, Word pair N-best algorithm, A* search using Viterbi score as heuristic • Generate lattice first, then generate N-best from lattice Intelligent Robot Lecture Note
ASR Outputs • Lattice • A word-based lattice • A compact representation of state-lattice • Only word node are involved • From the decoding backtracking pointer table • Only record all the links between word nodes • From N-best list • Become a compact representation of N-best Intelligent Robot Lecture Note
ASR Outputs • Confusion Network (L. Mangu et al., 2000) • Or “Sausage Network” • Or “Consensus Network” • A weighted directed graph with a start node, an end node, and word labels over its edges • Each path from the start node to the end node goes through all the other nodes • From lattice • Multiple alignment Intelligent Robot Lecture Note
Loose Coupling : 1-best • The best hypothesis produced by the ASR system is passed as a text to the MT system. • Baseline • Simple structure • Fast translation • The speech recognition module and translation module are running rather independently • Lacks joint optimality • No use of multiple transcriptions • Supplementary information easily available from the ASR system were not exploited in the translation process Intelligent Robot Lecture Note
Loose Coupling : 1-best • Structure • Recognition • Translation ASR SMT TTS Target Sentence Source Speech 1-best Target Speech Intelligent Robot Lecture Note
Loose Coupling : N-best • N hypotheses are translated by a text MT decoder and re-ranked according to ASR & SMT scores (R. Zhang et al., 2004) • Structure ASR SMT Rescore NxM translation Source Speech N-best Best translation Intelligent Robot Lecture Note
Loose Coupling : N-best • ASR module • To generate N-best speech recognition hypotheses • : n-th best speech recognition hypothesis • SMT module • To generate M-best translation hypotheses • : m-th best translation hypotheses produced from • Rescore module • To rescore all NXM translations • Key component • Log linear model • Features derived from ASR and SMT are combined in this module to rescore translation candidates. Intelligent Robot Lecture Note
Loose Coupling : N-best • Rescore : Log-linear models • : all possible translation hypotheses • : m-th feature in log value • ASR features : acoustic model, source language model • SMT features : target language model, phrase translation model, distortion model, length model, … • : weight of each feature Intelligent Robot Lecture Note
Loose Coupling : N-best • Parameter optimization (F.J. Och, 2003) • Objective function • : translation output after log-linear model rescoring • : references of English sentences • : automatic translation quality metrics • BLUE : A weighted geometric mean of the n-gram matches between test and reference sentences plus a short sentence penalty • NIST : An arithmetic mean of the n-gram matches between test and reference sentences • mWER : multiple reference word error rate • mPER : multiple reference position independent word error rate Intelligent Robot Lecture Note
Loose Coupling : N-best • Parameter optimization : Direction Set Methods Change initial lambda Change Direction Local optimization Local lambda Best lambda Intelligent Robot Lecture Note
Loose Coupling : Lattice • Lattice-based MT • Input • Word lattices produced by the ASR system • Directly integrate all models in the decoding process • Phrase based lexica, single word based lexica, recognition features • Problem • How to translate the word lattices? • Approach • Joint probability approach • WFST (E. Matusov et al., 2005) • Phrase-based approach • Log-linear model (E. Matusov et al., 2005) • WFST (L. Mathias et al., 2006) Intelligent Robot Lecture Note
Loose Coupling : Lattice • Structure ASR Rescore Source Speech Best translation Word lattice Intelligent Robot Lecture Note
Loose Coupling : Lattice • From the derived decision rule • : Standard acoustic model • : Target language model • : Translation model • Source language model? • To take into account requirement for the well-formedness of the source sentence, the translation model has to include context dependency on the previous source words • This dependency for the whole sentence can be approximated by including a source language model Intelligent Robot Lecture Note
Loose Coupling : Lattice(Joint Probability Approach : WFST) • Joint probability approach • The conditional probability term and can be rewritten when using a joint probability translation model • This simplifies coupling the systems • The joint probability translation model can be used instead of the usual LM in ASR Intelligent Robot Lecture Note
Loose Coupling : Lattice(Joint Probability Approach : WFST) • WFST-Based Joint Probability System • The joint probability MT system is implemented with WFST • First, the training corpus is transformed based on a word alignment • Then, a statistical m-gram model is trained on the bilingual corpus • This language model is represented as a finite-state transducer which is the final translation model • vorrei|I’d_like del|some gelato|ice_cream • per|ε favore|please Intelligent Robot Lecture Note
Loose Coupling : Lattice(Joint Probability Approach : WFST) • WFST-Based Joint Probability System • Searching for the best target sentence is done in the composition of the input represented as a WFST and the translation transducer . • Coupling the FSA system with ASR is simple • The output of the ASR represented as WFST can be used directly as input to the MT search • Feature • Only Acoustic, translation probability • The Source LM scores are not included • The joint m-gram translation probability serve as a source LM Intelligent Robot Lecture Note
Loose Coupling : Lattice(Phrase-based Approach : Log-linear Model) • Probability distributions are represented as features in a loglinear model • The translation model probability is decomposed into several probabilities • Acoustic model and source langue model probabilities are also included • For a hypothesized recognized source sentence f1J and a hypothesized translation e1I, let k → (jk, ik), k = 1,…,K be a monotone segmentation of the sentence pair into K bilingual phrases Intelligent Robot Lecture Note
Loose Coupling : Lattice (Phrase-based Approach : Log-linear Model) • Features • The m-gram target langue model • The phrasal lexicon models • The phrase translation probabilities are computed as a log-linear interpolation of the relative frequencies • The single word based lexicon models Intelligent Robot Lecture Note
Loose Coupling : Lattice (Phrase-based Approach : Log-linear Model) • Features (con’t) • c1, c2 : word, phrase penalty feature • The recognition model • The acoustic model probability • The m-gram source langue model probability • Optimization • All features are scaled with a set of exponents λ = {λ1,…,λ7} and μ = {μ1,μ2}. • The scaling factors are optimized in a minimum error training framework iteratively by performing 100 to 200 translations of a development set • The criterion : WER, BLEU, mWER, mPER Intelligent Robot Lecture Note
Loose Coupling : Lattice (Phrase-based Approach : Log-linear Model) • Practical aspects of lattice translation • Generation of Word Lattices • In a first step, We mapped all entities that were not spoken words onto the empty arc label ε • The time information is not used - Remove it from the lattices • The structure is compressed by applying ε-removal, determinization, and minimization • This step significantly reduced runtime without changing the results • Phrase Extraction • The number of different phrase pairs is very large • Candidate phrase pairs have to be kept in main memory • In case of ASR word lattice input, the lattice for each test utterance is traversed, and only phrases which match sequences of arcs in the lattice are extracted • Thus only phrases which can be used in translation will be loaded Intelligent Robot Lecture Note
Loose Coupling : Lattice (Phrase-based Approach : Log-linear Model) • Practical aspects of lattice translation (Con’t) • Pruning • A word lattice of high density as input → an enormous search space → pruning is necessary • Coverage pruning and histogram pruning • Based on the total costs of a hypothesis • It may also be necessary to prune the input word lattices • Advantage • The utilization of multiple features • The direct optimization for an objective error measure • Disadvantage • A less efficient search • Heavy pruning unavoidable Intelligent Robot Lecture Note
Loose Coupling : Lattice (Phrase-based Approach : WFST) • Statistical Modeling for Text Translation • Ω : All foreign phrase sequences that could have generated the foreign text • The translation system effectively translates phrase sequences, rather than word sequences • This is done by first mapping the sentence into all its phrase sequences Intelligent Robot Lecture Note
Loose Coupling : Lattice (Phrase-based Approach : WFST) • Phrase Sequence Lattice contains the phrase sequence that can be extracted from the text • All phrase sequences correspond to the unique foreign sentence • Here, a phrase is a sequence of word which can be translated • Different phrase sequences lead to different translations • The lattice is unweighted Intelligent Robot Lecture Note
Loose Coupling : Lattice (Phrase-based Approach : WFST) • Statistical Modeling for Speech Translation • The Target Phrase Mapping transducer is applied to the foreign language ASR word lattice • L·Ω : The likely foreign phrase sequences that could have generated the foreign speech • The translation system still effectively translates phrase sequences, rather than word sequences • These are extracted from the ASR lattice, with ASR score, rather than from a text sentence Intelligent Robot Lecture Note
Loose Coupling : Lattice (Phrase-based Approach : WFST) • Phrase Sequence Lattice contains the phrase sequences that can be extracted from the text • Phrase sequences correspond to the translatable word sequences in the lattice • The lattice contains weights from the ASR system • Translating this foreign phrase lattice is MAP translation of the foreign speech under the generative model Intelligent Robot Lecture Note
Loose Coupling : Lattice (Phrase-based Approach : WFST) • Spoken language translation is recast as an ASR analysis problem in which the goal is to extract translatable foreign language phrases from ASR word lattices • Step 1. Perform foreign language ASR to generate a foreign language word lattice L • Step 2. Analyze the foreign language word lattice and extract the phrases to be translated • Step 3. Build the target language phrase mapping transducer Ω • Step 4. Compose L and to create the foreign language ASR Phrase Lattice Ω • Step 5. Translate the foreign language phrase lattice • ASR and MT must be very compatible for this approach Intelligent Robot Lecture Note
Loose Coupling : Confusion Network • CN-based decoder (N. Bertoldi et al., 2005) • Input • Confusion network represented as a matrix • Text vs. CN • Text • CN • Problem • How to translate confusion network input? Intelligent Robot Lecture Note
Loose Coupling : Confusion Network • Solution • Simple! • CN-based SLT decoder can be developed starting from phrase-based SMT decoder • CN-based SLT decoder is substantially the same as the phrase-based SMT decoder apart from the way the input is managed • Compare to N-best methods • N-best Decoder • Does not advantage from overlaps among N-best • CN Decoder • Exploits overlaps among hypotheses Intelligent Robot Lecture Note
Loose Coupling : Confusion Network • Phrase-based Translation Model • Phrase • Sequence of consecutive words • Alignment • Map between CN and target phrases one word per column aligned with a target phrase • Search criterion • is a log-linear phrase-based model Intelligent Robot Lecture Note
Loose Coupling : Confusion Network • Log-Linear Phrase-based Translation Model • The conditional distribution is determined through suitable real valued feature functions , and takes the parametric form: • Feature functions • Language model • Fertility models • Distortion models • Lexicon model • Likelihood of the path within CN • True length of the path Intelligent Robot Lecture Note
Loose Coupling : Confusion Network • Step-wise translation process • Translation is performed with a step-wise process • Each step translates a sub-CN and produces a target phrase • The process starts with a empty translations • After each step, we get a partial translation • A partial translation is complete if the whole input CN is translated • Complexity Reduction • Recombining theories • Beam search • Reordering constraints • Lexicon pruning • Confusion network pruning Intelligent Robot Lecture Note
Loose Coupling : Confusion Network • Algorithms Intelligent Robot Lecture Note
Loose Coupling : Confusion Network • Step-wise translation process Intelligent Robot Lecture Note
Loose Coupling Intelligent Robot Lecture Note
Tight Coupling • Theory (H. Ney, 1999) • Three factors • Pr(e) : target language model • Pr(f|e) : translation model • Pr(x|f) : acoustic model Baye’s Rule Introduce f as hidden variable Baye’s Rule Assume x doesn’t depend on target language Sum to Max Intelligent Robot Lecture Note
Tight Coupling • ASR vs. Tight Coupling (SLT) • Brute Force Method • Instead of incorporating LM into standard Viterbi algorithm, incorporating P(e) and P(f|e) • Very complicated • Not feasible ASR SLT Acoustic Model Acoustic Model vs Source LM Target LM Translation Model Intelligent Robot Lecture Note
Tight Coupling • WFST-Based Joint Probability System (Fully integration) • The ASR search network • A composition of WFSTs • : the HMM topology • : the context-dependency • : the lexicon • : the LM • Only need to replace the source LM by the translation model • Speech translation search network ST • Result • Small improvement of translation quality • But, very slow Intelligent Robot Lecture Note
Tight Coupling • Bleu scores against lattice density (S.Saleem et al, 2004) • Improvements from tighter coupling may only be observed when ASR lattices are sparse, i.e. when there are only few hypothesized words per spoken word in the lattice • This would mean that a fully integrated speech translation would not work at all. Intelligent Robot Lecture Note
Tight Coupling • Possible issues of tight coupling • In ASR, source n-gram LM is very closed to the best configuration • The complexity of the algorithm is too high, approximation is still necessary to make it work • The current approaches still haven’t really implement tight-coupling • Conclusion • The approach seem to be haunted by very high complexity of search algorithm construction Intelligent Robot Lecture Note
Reading List • L. Mangu, E. Brill, A. Stolcke. 2000. Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech and Language 14(4), 373-400. • V. H. Quan, M. Federico, M. Cettolo. 2005. Integrated N-best Re-ranking for Spoken Language Translation. EuroSpeech. • R. Zhang, G. Kikui, H. Yamamoto, T. Watanabe, F. Soong, and W. K. Lo. 2004. A unified approach in speech-to-speech translation: Integrating features of speech recognition and machine translation. In Proc. of Coling 2004. • F.J. Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. of ACL. • E. Matusov, S. Kanthak, and H. Ney. 2005. On the Integration of Speech Recognition and Statistical Machine Translation. in Proc. Interspeech 2005. • E. Matusov, H. Ney, R. Schluter. 2005. Phrase-based Translation of Speech Recognizer Word Lattices Using Loglinear Model Combination. ASRU 2005. Intelligent Robot Lecture Note