Cutting-Edge Language Technologies for Translation and Speech Recognition

Machine Translation & Automated Speech Recognition Jaime Carbonell With Richard Stern and Alex Rudnicky Language Technologies Institute Carnegie Mellon University Telephone: +1 412 268 7279 Fax: +1 412 268 6298 jgc@cs.cmu.edu For Samsung, February 20, 2008

Outline of Presentation • Machine Translation (MT): • Three major paradigms for MT (transfer, example-based, statistical) • New method: Context-Based MT • Automated Speech Recognition (ASR): • The Sphinx family of recognizers • Environmental Robustness for ASR • Intelligent Tutoring for English • Native Accent: Pronunciation product • Grammar, Vocabulary & other tutotring • Components for Virtual English Village

“…right information” “…right people” “…right time” “…right medium” “…right language” “…right level of detail” IR (search engines) Routing, personalization Anticipatory analysis Speech: ASR, TTS Machine translation Summarization, Drill down Language TechnologiesBILL OF RIGHTS TECHNOLGY (e.g.)

Machine Translation:Context is Needed to Resolve Ambiguity Example: English  Japanese Powerline – densen (電線) Subwayline – chikatetsu(地下鉄) (Be) online – onrain (オンライン) (Be) on theline – denwachuu (電話中) Lineup – narabu (並ぶ) Lineone’s pockets – kanemochi ni naru (金持ちになる) Line one’s jacket – uwagi o nijuu ni suru (上着を二重にする) Actor’s line – serifu (セリフ) Get alineon – joho o eru (情報を得る) Sometimes local context suffices (as above) . . . but sometimes not

CONTEXT: More is Better • Examples requiring longer-range context: • “The linefor the new play extended for 3 blocks.” • “The line for the new play was changed by the scriptwriter.” • “The line for the new play got tangled with the other props.” • “The line for the new play better protected the quarterback.” • Better MT requires better context sensitivity • Syntax-rule based MT  little or no context • EBMT  potentially long context on exact substring matches • SMT  short context with proper probabilistic optimization • CBMT  long context with optimization (but still experimental)

Interlingua Semantic Analysis Sentence Planning Transfer Rules Syntactic Parsing Source (Arabic) Target (English) Direct: SMT, EBMT Types of Machine Translation Text Generation

Example-Based Machine Translation Example English:I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English:The tallest man is my father. Mapudungun:Chi doy fütra chi wentru fey ta inche ñi chaw. English:I would like to meetthe tallest man Mapudungun (new):Ayükefun trawüaelChi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.

Statistical Machine Translation: The Three Ingredients E K English Language Model English-Korean Translation Model p(E) p(K|E) K E* Decoder E*=arg max p(E|K)= argmax p(E) p(K|E)

Alignments Les Propositions ne seront Pas mises En Application maintenant The proposal Will Not Now Be Implemented Translation models built in terms of hidden alignments p(F|E)= Each English word “generates” zero or more French words.

Conditioning Joint Probabilities • The modeling problem is how to approximate these terms to achieve: • Statistically robust estimates • Computationally efficient training • An effective and accurate model

The Gang of Five IBM++ Models A series of five models is sequentially trained to “bootstrap a detailed translation model from parallel corpora: • Word-byword translation. Parameters p(f|e) between pairs of words. • Local alignments. Adds alignment probabilities • Fertilities and distortions. Allow a source word to generate zero or more target words, then place words in position with p(j|i,sentence lengths). • Tablets. Groups words into phrases • Primary idea of Example-Based MT incorporated into SMT • Currently seeking meaningful (vs arbitrary phrases), e.g. NPs. PPs. • Non-deficient alignments. Don’t ask. . .

Multi-Engine Machine Transaltion • MT Systems have different strengths • Rapidly adaptable: Statistical, example-based • Good grammar: Rule-Based (linguisitic) MT • High precision in narrow domains: KBMT • Minority Language MT: Learnable from informant • Combine results of parallel-invoked MT • Select best of multiple translations • Selection based on optimizing combination of: • Target language joint-exponential model • Confidence scores of individual MT engines

Illustration of Multi-Engine MT

Long-Range Context More is better How to incorporate it? How to do so tractably? “I conjecture that MT is AI complete” – H. Simon State-of-the-Art Trigrams  4 or 5-grams in LM only (e.g., Google) SMT  SMT+EBMT (phrase tables, etc.) Parallel Corpora More is better Where to find enough of it? What about rare languages? “There’s no data like more data” – IBM SMT circa 1992 State-of-the-Art Avoids rare languages (GALE only does Arabic & Chinese) Symbolic rule-learning from minimalist parallel corpus (AVENUE project at CMU) Key Challenges for Corpus-Based MT

Parallel Text: Requiring Less is Better (Requiring None is Best ) • Challenge • There is just not enough to approach human-quality MT for most major language pairs (we need ~100X more) • Much parallel text is not on-point (not on domain) • Rare languages and some distant pairs have virtually no parallel text • Context-Based (CBMT) Approach • Invented and patented by Meaningful Machines (in New York) • Requires no parallel text, no transfer rules . . . • Instead, CBMT needs • A fully-inflected bilingual dictionary • A (very large) target-language-only corpus • A (modest) source-language-only corpus [optional, but preferred]

CACHE DATABASE Cross-Language N-gram Database CMBT System Source Language N-gram Segmenter Parser Parser INDEXED RESOURCES N-GRAM BUILDERS (Translation Model) Bilingual Dictionary Flooder (non-parallel text method) Target Corpora Edge Locker [Source Corpora] TTR Stored N-gram Pairs Approved N-gram Pairs Gazetteers Substitution Request N-gram Candidates N-GRAM CONNECTOR Overlap-based Decoder Target Language

Step 1: Source Sentence Chunking • Segment source sentence into overlapping n-grams via sliding window • Typical n-gram length 4 to 9 terms • Each term is a word or a known phrase • Any sentence length (for BLEU test: ave-27; shortest-8; longest-66 words) S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S2 S3 S4 S5 S2 S3 S4 S5 S6 S3 S4 S5 S6 S7 S4 S5 S6 S7 S8 S5 S6 S7 S8 S9

Target Word Lists T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-a T6-a T6-b T6-c Flooding Set Step 2: Dictionary Lookup • Using bilingual dictionary, list all possible target translations for each source word or phrase Source Word-String S2 S3 S4 S5 S6 Inflected Bilingual Dictionary

T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-a T6-a T6-b T6-c Step 3: Search Target Text • Using the Flooding Set, search target text for word-strings containing one word from each group Flooding Set • Find maximum number of words from Flooding Set in minimum length word-string • Words or phrases can be in any order • Ignore function words in initial step (T5 is a function word in this example)

T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-a T6-a T6-b T6-c T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-b T(x) T2-d T(x) T(x) T6-c T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-b T(x) T2-d T(x) T(x) T6-c Target Candidate 1 Step 3: Search Target Text (Example) Flooding Set T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T3-b T(x) T2-d T(x) T(x) T6-c T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) Target Corpus

T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-a T6-a T6-b T6-c T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T4-a T6-b T(x) T2-c T3-a T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T4-a T6-b T(x) T2-c T3-a Target Candidate 2 Step 3: Search Target Text (Example) Flooding Set T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) Target Corpus

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-c T2-b T4-e T5-a T6-a T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-c T2-b T4-e T5-a T6-a Target Candidate 3 Step 3: Search Target Text (Example) T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-a T6-a T6-b T6-c Flooding Set T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) Target Corpus Reintroduce function words after initial match (e.g. T5)

Step 4: Score Word-String Candidates • Scoring of candidates based on: • Proximity (minimize extraneous words in target n-gram  precision) • Number of word matches (maximize coverage  recall)) • Regular words given more weight than function words • Combine results (e.g., optimize F1 or p-norm or …) Target Word-String Candidates Proximity 3rd 1st 1st Word Matches 3rd 2st 1st “Regular” Words 3rd 1st 1st Scoring --- --- --- Total Scoring 3rd 2nd 1st T3-b T(x) T2-d T(x) T(x) T6-c T4-a T6-b T(x) T2-c T3-a T3-c T2-b T4-eT5-a T6-a

T(x1) T3-c T2-b T4-e T(x2) T4-a T6-b T(x3) T2-c T3-b T(x3) T2-d T(x5) T(x6) T6-c T3-b T(x3) T2-d T(x5) T(x6) T6-c T4-a T6-b T(x3)T2-c T3-a T4-a T6-b T(x3)T2-c T3-a T3-c T2-b T4-eT5-a T6-a T3-c T2-b T4-eT5-a T6-a T2-b T4-e T5-a T6-a T(x8) T6-b T(x3) T2-c T3-a T(x8) Step 5: Select Candidates Using Overlap(Propagate context over entire sentence) T(x1) T2-d T3-c T(x2) T4-b Word-String 1 Candidates T(x1) T3-c T2-b T4-e T(x2) T4-a T6-b T(x3) T2-c T3-b T(x3) T2-d T(x5) T(x6) T6-c Word-String 2 Candidates T4-a T6-b T(x3)T2-c T3-a T3-c T2-b T4-eT5-a T6-a T2-b T4-e T5-a T6-a T(x8) Word-String 3 Candidates T6-b T(x11) T2-c T3-a T(x9) T6-b T(x3) T2-c T3-a T(x8)

Best translations selected via maximal overlap T(x2) T4-a T6-b T(x3) T2-c T4-a T6-b T(x3)T2-c T3-a Alternative 1 T6-b T(x3) T2-c T3-a T(x8) T(x2) T4-a T6-b T(x3) T2-c T3-a T(x8) T(x1) T3-c T2-b T4-e T3-c T2-b T4-eT5-a T6-a Alternative 2 T2-b T4-e T5-a T6-a T(x8) T(x1) T3-c T2-b T4-e T5-a T6-a T(x8) Step 5: Select Candidates Using Overlap

A (Simple) Real Example of Overlap Flooding  N-gram fidelity Overlap  Long range fidelity a United States soldier N-grams generated from Flooding United States soldier died soldier died and two others died and two others were injured two others were injured Monday N-grams connected via Overlap a United States soldier died and two others were injured Monday A soldier of the wounded United States died and other two were east Monday Systran

Comparative System Scores 0.85 0.8 Human Scoring Range 0.7533 0.7447 0.7209 0.7 0.6 BLEU SCORES 0.5610 0.5551 0.5137 0.5 0.4 0.3859 0.3 MM Spanish Google Chinese (‘06 NIST) Systran Spanish SDL Spanish MM Spanish (Non-blind) Google Arabic (‘05 NIST) Google Spanish ’08 top lang Based on same Spanish test set

Historical CBMT Scoring 0.85 Human Scoring Range 0.8 .7533 .7365 .7354 .7447 .7059 .6950 0.7 .7276 .6645 .6929 .6456 .6694 .6374 .6165 .6393 .6144 .6267 0.6 BLEU SCORES .6129 .5670 .5953 Blind 0.5 Non-Blind 0.4 .3743 0.3 Apr 04 Aug 04 Dec 04 Apr 05 Aug 05 Dec 05 Apr 06 Aug 06 Dec 06 Apr 07 Aug 07 Dec 07

Illustrative Translation Examples Un coche bomba estalla junto a una comisaría de policía en Bagdad MM: A car bomb explodes next to a police station in Baghdad Systran: A car pump explodes next to a police station of police in bagdad Hamas anunció este jueves el fin de su cese del fuego con Israel MM: Hamas announced Thursday the end of the cease fire with israel Systran: Hamas announced east Thursday the aim of its cease-fire with Israel Testigos dijeron haber visto seis jeeps y dos vehículos blindados MM:Witnesses said they have seen sixjeeps and two armored vehicles Systran:Six witnesses said to have seen jeeps and two armored vehicles

A More Complex Example • Un soldado de Estados Unidos murió y otros dos resultaron heridos este lunes por el estallido de un artefacto explosivo improvisado en el centro de Bagdad, dijeron funcionarios militares estadounidenses. • MM: A United States soldier died and two others were injured monday by the explosion of an improvised explosive device in the heart of Baghdad, American military officials said. • Systran: A soldier of the wounded United States died and other two were east Monday by the outbreak from an improvised explosive device in the center of Bagdad, said American military civil employees.

Summing up CBMT • What if a source word or phrase is not in the bilingual dictionary? • Automatically Find near synonyms in source, • Replace and retranslate • Current Status • Developing a Chinese-to-English CBMT system (no results yet) • Pre-commercial (field-testing) • CMU participating in improving CBMT technology • Advantages • Most accurate MT method in existance • Does not require parallel text • Disadvantages • Run time speed: 20s/w (7/07)  1.5s/w (2/08)  0.1s/w(12/08) • Requires heavy-duty servers (not yet miniturizable)

Flat Seed Generation The highly qualified applicant visits the company. Der äußerst qualifizierte Bewerber besucht die Firma. ((1,1),(2,2),(3,3),(4,4),(5,5),(6,6)) S::S [det adv adjn v det n] [det adv adj n v det n] ((x1::y1) (x2::y2)…. ((x4 agr) = *3-sing) … ((y3 case) = *nom)…) Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word-aligned sentence pairs, abstracted only to POS level; no syntactic structure 2) Add compositional structure to Seed Rule by exploiting previously learned rules 3) Seeded Version Space Learning group seed rules by constituent sequences and alignments, seed rules form s-boundary of VS; generalize with validation Compositionality Adjust rule to reflect compositionality NP rule can be used to translate part of the sentence; keep / add context constraints, eliminate unnecessary ones Seeded Version Space Learning Group seed rules into version spaces: … … NP v det n … Notes: 1) Partial order of rules in VS 2) Generalization via merging NP::NP [det adv adj n] [det adv adj n] ((x1::y1)… ((y4 agr) = (x4 agr) ….) S::S [NP v det n] [NP n v det n] ((x1::y1) (x2::y2)…. … ((y1 case) = *nom)…) Merge two rules: 1) Deletion of constraint 2) Raising of two value to one Agreement constraint, e.g. ((x1 num) = *pl), ((x3 num) = *pl) ((x1 num) = (x3 num) 3) Use merged rule to translate

N Version Spaces (Mitchell, 1980) Anything G boundary b “Target” concept S boundary Specific Instances

Original & Seeded Version Spaces • Version-spaces (Mitchell, 1980) • Symbolic multivariate learning • S & G sets define lattice boundaries • Exponential worst-case: O(bN) • Seeded Version Spaces (Carbonell, 2002) • Generality level  hypothesis seed • S & G subsets  effective lattice • Polynomial worst case: O(bk/2), k=3,4

Xn Ym Det Adj N  Det N Adj (Y2 num) = (Y3 num) (Y2 gen) = (Y3 gen) (X3 num) = (Y2 num) N Seeded Version Spaces (Carbonell, 2002) G boundary “Target” concept S boundary “The big book”  “ el libro grande”

Xn Ym G boundary N S boundary Seeded Version Spaces k Seed (best guess) “Target” concept “The big book”  “ el libro grande”

Transfer Rule Learning • One SVS per transfer rule • Alignment + generalization level = seed • Apply SVS(seed,k) learning method • High confidence seed  small k [k too small increases prob(failure)] • Low confidence seed  larger k [k too large increases computation & elicitation] • For translation use SVS f-centroid • If success % > threshold, halt SVS • If no rule found, k := K+1 & redo (to max K)

Sample learned Rule

Compositionality ;;L2: h &niin h rxb b h bxirwt ;;L1: the widespread interest in the election NP::NP [Det N Det Adj PP] -> [Det Adj N PP] ((X1::Y1)(X2::Y3) (X3::Y1)(X4::Y2) (X5::Y4) ((Y3 num) = (X2 num)) ((X2 num) = sg) ((X2 gen) = m) ((Y1 def) = +) ) Training example Rule type Component sequences Component alignments Agreement constraints Value constraints

Automated Speech Recognition at CMU:Sphinx and related technologies Sphinx is the name for a family of speech recognition engines – developed by Reddy, Lee & Rudnicky First continuous-speech, speaker-independent large vocabulary system Standard continuous HMM technology Multiple languages Open source code base Miniaturized and integrated with MT, dialog, and TTS Basis for Native Accent & Carnegies Speech prouducts Janus system developed–by Waibel and Schultz Started as Neural-Net Approach Now standard continuous HMM system Multiple languages NOT open source code base Held WER advantage over Sphinx till recently (now both are equal)

Sphinx ASR at Carnegie Mellon Sphinx systems have evolved along two parallel batch Technology innovation occurs in the Research branch, then transitions to the Live branch The Live branch focuses on the additional requirements of interactive real-time recognition and on the integration with higher-level components such as spoken language understanding and dialog management Live ASR Research ASR Sphinx 1986 SphinxL Sphinx-2 1992 Sphinx-2L 1999 Sphinx-3.x Sphinx-3.7 2006 PocketSphinx

Some features of Sphinx

Spoken language processing action task & domain knowledge generation synthesis meaning speech recognition dialogue discourse understanding text January 4, 2020 Speech Group 43

A generic speech system Domain Agent Domain agent Domain agent human speech Signal Processing Parser Dialog Manager Language Generator display Decoder post Parser Speech Synthesizer effector machine speech January 4, 2020 Speech Group 44

Decoding speech Reduce dimensionality of signal noise conditioning Signal processing Transcribe speech to words Decoder Lexical models Acoustic models Language models Corpus-base statistical models January 4, 2020 Speech Group 45

A model for speech recognition Speech recognition as an estimation problem A = sequence of acoustic observations W = string of words January 4, 2020 Speech Group 46

Recognition A Bayesian formulation of the recognition problem the “fundamental equation” of speech recognition event evidence posterior prior conditional The “model” January 4, 2020 Speech Group 47

Understanding speech Grammar Ontology design, language acquisition Parser Extract semantic content from utterance Post parser Introduce context and world knowledge into interpretation Context Domain Agents Grounding, knowledge engineering January 4, 2020 Speech Group 48

Interacting with the user Task schemas Task analysis Context Dialog manager Guide interaction through task Map user inputs and system state into actions Domain agent Interact with back-end(s) Interpret information using domain knowledge Domain agent Domain agent Database Live data (e.g. Web) Domain expert Knowledge engineering January 4, 2020 Speech Group 49

Communicating with the user Rules Database Language Generator Decide what to say to user (and how to phrase it) Speech synthesizer Construct sounds and intonation Display Generator Action Generator Database Rules January 4, 2020 Speech Group 50

Cutting-Edge Language Technologies for Translation and Speech Recognition

Cutting-Edge Language Technologies for Translation and Speech Recognition

Presentation Transcript

Biologically Inspired Noise-Robust Speech Recognition for Both Man and Machine

Technical Seminar presentation on Speech Recognition using DWT

Speech Recognition

Machine Translation: Interlingual Methods

FLST: Speech Recognition

Speech Recognition

Speaker Independent Arabic Speech Recognition Using Support Vector Machine

CrowdFlow Integrating Machine Learning with Mechanical Turk for Speed-Cost-Quality Flexibility

Making machine translation work

Integrating Speech Recognition and Machine Translation

Named Entities in Domain Unlimited Speech Translation

Approaches to Machine Translation

Speech Recognition

Introduction to the Course and to Speech Synthesis

Spoken Language Translation

ISSUES IN SPEECH RECOGNITION Shraddha Sharma

Machine Translation

MM9 Speech Communication

Machine Translation Speech Translation

Chapter 7 Speech Recognition Framework

Real-Time Speech Recognition