Robust Translation of Spontaneous Speech: A Multi-Engine Approach

Seventeenth International Joint Conference on Artificial Intelligence, IJCAI-01 Seattle Wednesday, 8 August 2001 Robust Translation of Spontaneous Speech: A Multi-Engine Approach Wolfgang Wahlster German Research Center for Artificial Intelligence DFKI GmbH www.dfki.de/~wahlster

Mobile Speech-to-Speech Translation of Spontaneous Dialogs As the name Verbmobil suggests, the system supports verbal communication with foreign dialog partners in mobile situations. 1 face-to-face conversations telecommunication 2

Mobile Speech-to-Speech Translation of Spontaneous Dialogs Verbmobil Speech Translation Server Conference Call: The Verbmobil Speech Translation Server connects GSM cell phone users

Robust Realtime Translation with Verbmobil At a German Airport: An American business man calls the secretary of a German business partner.

Outline l Verbmobil‘s Multi-Blackboard and Multi-Engine Architecture l Exploiting Underspecification in a Multi-Stratal Semantic Representation Language l Combining Deep and Shallow Processing Strategies for Robust Dialog Translation l Evaluation and Technology Transfer l Lessons Learned and Conclusions

German German GermanEnglish English German English English Telephone-based Dialog Translation Verbmobil Server Cluster German Dialog Partner l ISDN Conference Call (3 Participants): -German Speaker -Verbmobil -American Speaker l Speech-based Set-up of the Conference Call Bianca/Brick XS BinTec ISDN-LAN Router American Dialog Partner LINUX Server Sun Server 450 Sun ULTRA 60/80

Verbmobil: The First Speech-Only Dialog Translation System American Speaker: “Verbmobil” (Voice Dialing) Mobile GSM Phone Mobile DECT Phone

Verbmobil: The First Speech-Only Dialog Translation System American Speaker: “Verbmobil” (Voice Dialing) Connect to the Verbmobil Speech-to-Speech Translation Server +49 631 3111911 Mobile GSM Phone Mobile DECT Phone

Verbmobil: The First Speech-Only Dialog Translation System American Speaker: “Verbmobil” (Voice Dialing) Connect to the Verbmobil Speech-to-Speech Translation Server +49 631 3111911 Mobile GSM Phone Mobile DECT Phone Verbmobil: “Welcome to the Verbmobil Translation System. Please speak the telephone number of your partner.”

Verbmobil: The First Speech-Only Dialog Translation System American Speaker: “Verbmobil” (Voice Dialing) Connect to the Verbmobil Speech-to-Speech Translation Server +49 631 3111911 Mobile GSM Phone Mobile DECT Phone Verbmobil: “Welcome to the Verbmobil Translation System. Please speak the telephone number of your partner.” American Speaker: “0177555”

Verbmobil: The First Speech-Only Dialog Translation System American Speaker: “Verbmobil” (Voice Dialing) Connect to the Verbmobil Speech-to-Speech Translation Server +49 631 3111911 Foreign Participant is placed into the Conference Call Mobile GSM Phone Mobile DECT Phone Verbmobil: “Welcome to the Verbmobil Translation System. Please speak the telephone number of your partner.” To German Participant To American Participant American Speaker: “0177555” Verbmobil: Verbmobil hat eine neue Verbindung aufgebaut. Bitte sprechen Sie jetzt. Verbmobil: Welcome to the Verbmobil server. Please start your input after the beep.

Verbmobil is a Multilingual System English (American) German Japanese German Chinese (Mandarine) German It supports bidirectional translation between:

Verbmobil Partner TU-BRAUNSCHWEIG DAIMLERCHRYSLER RHEINISCHE FRIEDRICH WILHELMS-UNIVERSITÄT BONN LUDWIG MAXIMILIANS UNIVERSITÄT MÜNCHEN Phase 2 UNIVERSITÄT BIELEFELD UNIVERSITÄT DES SAARLANDES TECHNISCHE UNIVERSITÄT MÜNCHEN UNIVERSITÄT HAMBURG FRIEDRICH- ALEXANDER- UNIVERSITÄT ERLANGEN-NÜRNBERG RUHR-UNIVERSITÄT BOCHUM EBERHARDT-KARLS UNIVERSITÄT TÜBINGEN UNIVERSITÄT STUTTGART UNIVERSITÄT KARLSRUHE  W. Wahlster, DFKI

Three Levels of Language Processing Speech Telephone Input Acoustic Language Models Speech Recognition What has the caller said? 100 Alternatives Word Lists Sprachanalyse Speech Analysis Grammar Reduction of Uncertainty What has the caller meant? 10 Alternatives Lexical Meaning Speech Under- stan- ding Discourse Context Knowledge about Domain of Discourse What does the caller want? Unambiguous Understanding in the Dialog Context

Challenges for Language Engineering Close-Speaking Microphone/ Headset Push-to-talk Speaker Dependent Isolated Words Monolog Dictation Speaker Independent Information- seeking Dialog Read Continuous Speech Telephone, Pause-based Segmentation Increasing Complexity Spontaneous Speech Open Microphone, GSM Quality Multiparty Negotiation Speaker adaptive Verbmobil Input Conditions Naturalness Adaptability Dialog Capabilities

Verbmobil II: Three Domains of Discourse Scenario 2 Travel Planning & Hotel Reservation Scenario 3 PC-Maintenance Hotline Scenario 1 Appointment Scheduling

Verbmobil II: Three Domains of Discourse Scenario 2 Travel Planning & Hotel Reservation Scenario 3 PC-Maintenance Hotline Scenario 1 Appointment Scheduling When? What? When? Where? How? When? Where? How?

Verbmobil II: Three Domains of Discourse Scenario 2 Travel Planning & Hotel Reservation Scenario 3 PC-Maintenance Hotline Scenario 1 Appointment Scheduling When? What? When? Where? How? When? Where? How? Focus on temporal expressions Integration of special sublanguage lexica Focus on temporal and spatial expressions

Verbmobil II: Three Domains of Discourse Scenario 2 Travel Planning & Hotel Reservation Scenario 3 PC-Maintenance Hotline Scenario 1 Appointment Scheduling When? What? When? Where? How? When? Where? How? Focus on temporal expressions Integration of special sublanguage lexica Focus on temporal and spatial expressions Vocabulary Size: 6000 Vocabulary Size: 30000 Vocabulary Size: 10000

Context-Sensitive Speech-to-Speech Translation Wann fährt der nächste Zug nach Hamburg ab? When does the next train to Hamburg depart? Wo befindet sich das nächste Hotel? Whereis the nearest hotel? Verbmobil Server

The Control Panel of Verbmobil

Verbmobil‘s Massive Data Collection Effort Transliteration Variant 1 Transliteration Variant 2 Lexical Orthography Canonical Pronounciation Manual Phonological Segmentation 3,200 dialogs (182 hours) with 1,658 speakers 79,562 turns distributed on 56 CDs, 21.5 GB Automatic Phonological Segmentation Word Segmentation Prosodic Segmentation Dialog Acts Noises Superimposed Speech Syntactic Category Word Category Syntactic Function Prosodic Boundaries The so-called Partitur (German word for musical score) orchestrates fifteen strata of annotations

Extracting Statistical Properties from Large Corpora Segmented Speech with Prosodic Labels Treebanks & Predicate- Argument Structures Annotated Dialogs with Dialog Acts Aligned Bilingual Corpora Transcribed Speech Data Machine Learning for the Integration of Statistical Properties into Symbolic Models for Speech Recognition, Parsing, Dialog Processing, Translation Neural Nets, Multilayered Perceptrons Probabilistic Transfer Rules Hidden Markov Models Probabilistic Automata Probabilistic Grammars

Multilinguality Japanese German English 100 90 80 Word accuracy [%] 70 60 50 '97 '98 2000 '99.1 '99.2 '99.3 VM1

Multilinguality Language Identification (LID) German Recognizer Independent LID- Module w1 … wn Speech English Recognizer Japanese Recognizer

From a Multi-Agent Architecture to a Multi-Blackboard Architecture Verbmobil I Verbmobil II  Multi-Agent Architecture  Multi-Blackboard Architecture M3 M1 M2 M3 Blackboards M1 M2 BB 1 BB 2 BB 3 M4 M5 M6 M4 M5 M6 Each module must know, which module produces what data  Direct communication between modules  Heavy data traffic for moving copies around  All modules can register for each blackboard dynamically  No direct communication between modules  No copies of representation structures (word lattice, VIT chart)

Multi-Blackboard/Multi-Engine Architecture Module 2.1 Module 1.1 Module 3.1 2.2 3.2 1.2 . . . . . . Blackboard 3 Syntactic Representation: Parsing Results Blackboard 1 Preprocessed Speech Signal Blackboard 4 Semantic Representation: Lambda DRS Blackboard 5 Dialog Acts Blackboard 2 Word Lattice Module 5.1 Module 4.1 Module 6.1 5.2 4.2 6.2 . . . . . .

A Multi-Blackboard Architecture for the Combinationof Results from Deep and Shallow Processing Modules Command Recognizer Channel/Speaker Adaptation Audio Data Spontaneous Speech Recognizer Prosodic Analysis

A Multi-Blackboard Architecture for the Combinationof Results from Deep and Shallow Processing Modules Command Recognizer Channel/Speaker Adaptation Audio Data Spontaneous Speech Recognizer Prosodic Analysis Statistical Parser Chunk Parser Word Hypotheses Graph with Prosodic Labels Dialog Act Recognition HPSG Parser

A Multi-Blackboard Architecture for the Combinationof Results from Deep and Shallow Processing Modules Command Recognizer Channel/Speaker Adaptation Audio Data Spontaneous Speech Recognizer Prosodic Analysis Statistical Parser Chunk Parser Word Hypotheses Graph with Prosodic Labels Dialog Act Recognition HPSG Parser Semantic Construction Semantic Transfer VITs Underspecified Discourse Representations Robust Dialog Semantics Generation

VIT (Verbmobil Interface Terms) as a Multi-Stratal Representation Language l used as a common representation scheme for information exchange between all components and processing threads l design inspired by underspecified discourse representation structures (UDRS, Reyle/Kamp 1993) l compact representation of lexical and structured ambiguities and scope underspecifications of quantifiers, negations and adverbs l variable-free sets of non-recursive terms: [beginning (35, i37), arg3 (35, i37 ,i38),come (27, i35),arg1 (27, i35, i36),decl (37, h43),pron (26, i36),at (36, i35, i37),mofy (34 ,i38, aug),def (28, i37, h42, h41),udef (31, i38, h45, h44)], l streams of literals as flat multi-stratal representations that are very efficient for incremental processing

VIT for ‘He is coming at the beginning of August‘ Vit (vitID (sid (104,a,en,10,80,1,en,y,semantics), % Segment Identifier [word (he, 1, [26]), word(is, 2, []),word(coming, 3, [27]),word(at, 4, [36]),word(the ,5, [28]),word(beginning, 6, [35]),word(of, 7, [35]),word(``August'', 8, [34])]),% WHG String index (38, 25 ,i35), % Index [beginning (35, i37), arg3 (35, i37 ,i38),come (27, i35),arg1 (27, i35, i36),decl (37, h43),pron (26, i36),at (36, i35, i37),mofy (34 ,i38, aug),def (28, i37, h42, h41),udef (31, i38, h45, h44)], % Conditions [in_g (26, 25), in_g (37, 38), in_g (27, 25), in_g (28, 30),in_g (31, 33), in_g (34, 32),in_g (35, 29), in_g (36, 25),leq (25, h41), leq (25, h43),leq (29, h42), leq (29, h44),leq (30, h43), leq (32, h45),leq (33, h43)], % Scope and Grouping Constraints [s_sort (i35, situation), s_sort (i37, time),s_sort (i38, time)],% Sortal Specifications for Instance Variables [dialog_act (25, inform), dir (36, no),prontype (i36, third,std)], % Discourse and Pragmatics [cas (i36, nom), gend (i36, masc),num (i36, sg), num (i37, sg), num (i38, sg),pcase (l135, i38, of)], % Syntax [ta_aspect (i35, progr), ta_mood (i35, ind),ta_perf (i35, nonperf),ta_tense (i35, pres)], % Tense and Aspect [pros_accent (35)] % Prosody

Information between Layers is Linked TogetherUsing Constant Symbols Instances are constants interpreted as skolemized variables [word (he, 1, [26]), word(is, 2, []),word(coming, 3, [27]),word(at, 4, [36]),word(the ,5, [28]),word(beginning, 6, [35]),word(of, 7, [35]),word(``August'', 8, [34])]),% WHG String [beginning (35, i37), arg3 (35, i37 ,i38),come (27, i35),arg1 (27, i35, i36),decl (37, h43),pron (26, i36),at (36, i35, i37),mofy (34 ,i38, aug),def (28, i37, h42, h41),udef (31, i38, h45, h44)], % Conditions [s_sort (i35, situation), s_sort (i37, time),s_sort (i38, time)],% Sorts [cas (i36, nom), gend (i36, masc),num (i36, sg), num (i37, sg),], % Syntax

Information between Layers Linked TogetherUsing Constant Symbols Instances are constants interpreted as skolemized variables [word (he, 1, [26]), word(is, 2, []),word(coming, 3, [27]),word(at, 4, [36]),word(the ,5, [28]),word(beginning, 6, [35]),word(of, 7, [35]),word(``August'', 8, [34])]),% WHG String [beginning (35, i37), arg3 (35, i37 ,i38),come (27, i35),arg1 (27, i35, i36),decl (37, h43),pron (26, i36),at (36, i35, i37),mofy (34 ,i38, aug),def (28, i37, h42, h41),udef (31, i38, h45, h44)], % Conditions [s_sort (i35, situation), s_sort (i37, time),s_sort (i38, time)],% Sorts [cas (i36, nom), gend (i36, masc),num (i36, sg), num (i37, sg),], % Syntax

The Use of Underspecified Representations Two Readings in the Source Language Wir telephonierten mit Freunden aus Schweden. A compact representation of scope ambiguities in a logical language without using disjunctions Underspecified Semantic Representation Ambiguity Preserving Translations Two Readings in the Target Language We called friends from Sweden.

Verbmobil is the First Dialog Translation System that Uses Prosodic Information Systematicallyat All Processing Stages Speech Signal Word Hypotheses Graph Multilingual Prosody Module Prosodic features: l duration l pitch l energy l pause Boundary Information Boundary Information Sentence Mood Accented Words Prosodic Feature Vector Dialog Act Segmentation and Recognition Search Space Restriction Lexical Choice Speaker Adaptation Constraints for Transfer Speech Synthesis Dialog Understanding Translation Parsing Generation

Using Syntactic-Prosodic Boundaries to Speed-Upthe Parsing Process yes S1 no problem S4 Mister Mueller S4 when would you like to go to HannoverS4 without boundaries: # chart edges: 1256 runtime: 1.31 secs with boundaries: #chart edges: 632 runtime: 0.62 secs speed-up: 53%

Robust Translation of Spontaneous Speech: A Multi-Engine Approach