From Speech Recognition Towards Speech Understanding

Heraeus-Seminar „Speech Recognition and Speech Understanding“ Physikzentrum Bad Honnef, April 5, 2000 From Speech Recognition Towards Speech Understanding Wolfgang Wahlster German Research Center for Artificial Intelligence, DFKI GmbH Stuhlsatzenhausweg 3 66123 Saarbruecken, Germany phone: (+49 681) 302-5252/4162 fax: (+49 681) 302-5341 e-mail: wahlster@dfki.de WWW:http://www.dfki.de/~wahlster

Outline 1. Speech-to-Speech Translation: Challenges for Language Technology 2. A Multi-Blackboard Architecture for the Integration of Deep and Shallow Processing 3. Integrating the Results of Multiple Deep and Shallow Parsers 4. Packed Chart Structures for Partial Semantic Representations 5. Robust Semantic Processing: Merging and Completing Discourse Representations 6. Combining the Results of Deep and Shallow Translation Threads 7. The Impact of Verbmobil on German Language Industry 8. SmartKom: Integrating Verbmobil Technology Into an Intelligent Interface Agent 9. Conclusion

Signal-Symbol-Signal Transformations inSpoken Dialog Systems Sub- symbolic Proces- sing Sub- symbolic Proces- sing Input Speech Signal Output Speech Signal Symbolic Processing Speech Recognition Speech Understanding & Generation Speech Synthesis

Three Levels of Language Processing Speech Telephone Input Acoustic Language Modells Speech Recognition What has the caller said? 100 Alternatives Word Lists Sprachanalyse Speech Analysis Grammar What has the caller meant? 10 Alternatives Lexical Meaning Reduction of Uncertainty Speech Under- standing Discourse Context Knowledge about Domain of Discourse What does the caller want? Unambiguous Understanding in the Dialog Context

Challenges for Language Engineering Input Conditions Naturalness Adaptability Dialog Capabilities Close-Speaking Microphone/Headset Push-to-talk Speaker Dependent Isolated Words Monolog Dictation Speaker Independent Information- seeking Dialog Read Continuous Speech Telephone, Pause-based Segmentation Increasing Complexity Spontaneous Speech Open Microphone, GSM Quality Multiparty Negotiation Speaker adaptive Verbmobil

Telephone-based Dialog Translation German German GermanEnglish English German English English Verbmobil Server Cluster German Dialog Partner l ISDN Conference Call (3 Participants) l German Speaker: Verbmobil: American Speaker l Speech-based Set-up of the Conference Call Bianca/Brick XS BinTec ISDN-LAN Router Sun ULTRA 60/80 LINUX Server American Dialog Partner Sun Server 450

Context-Sensitive Speech-to-Speech Translation Wann fährt der nächste Zug nach Hamburg ab? When does the next train to Hamburg depart? Wo befindet sich das nächste Hotel? Where is the nearest hotel? Verbmobil Server Final Verbmobil Demos:l ECAI-2000 (Berlin) l CeBIT-2000 (Hannover) l COLING-2000 (Saarbrücken)

Dialog Translation 1 Wenn ich den Zug um 14 Uhr bekomme, bin ich um 4 in Frankfurt. If I get the train at 2 o‘clock I am in Frankfurt at 4 o‘clock. Am Flughafen könnten wir uns treffen. We could meet at the airport.

Dialog Translation 2 Abends könnten wir Essen gehen. We could go out for dinner in the evening. Wann denn am Abend? What time in the evening?

Dialog Translation 3 Ich könnte für 8 Uhr einen Tisch reservieren. I could reserve a table for 8 o‘clock.

Verbmobil II: Three Domains of Discourse Scenario 2 Travel Planning & Hotel Reservation Scenario 3 PC-Maintenance Hotline Scenario 1 Appointment Scheduling When? What? When? Where? How? When? Where? How? Focus on temporal expressions Integration of special sublanguage lexica Focus on temporal and spatial expressions Vocabulary Size: 2500/6000 Vocabulary Size: 15000/30000 Vocabulary Size: 7000/10000

Data Collection with Mulitiple Input Devices Room Microphone GSM Mobile Phone Close- Speaking Microphone ISDN Phone > 43 CDs of transliterated speech data, aligned translations > 5.000 Dialogs> 50.000 Turns>10.000 Lemmata

Extracting Statistical Properties from Large Corpora Segmented Speech with Prosodic Labels Treebanks & Predicate- Argument Structures Annotated Dialogs with Dialog Acts Aligned Bilingual Corpora Transcribed Speech Data Machine Learning for the Integration of Statistical Properties into Symbolic Models for Speech Recognition, Parsing, Dialog Processing, Translation Neural Nets, Multilayered Perceptrons Probabilistic Transfer Rules Hidden Markov Models Probabilistic Automata Probabilistic Grammars

Verbmobil Partner TU-BRAUNSCHWEIG DAIMLERCHRYSLER RHEINISCHE FRIEDRICH WILHELMS-UNIVERSITÄT BONN LUDWIG MAXIMILIANS UNIVERSITÄT MÜNCHEN Phase 2 UNIVERSITÄT BIELEFELD UNIVERSITÄT DES SAARLANDES TECHNISCHE UNIVERSITÄT MÜNCHEN UNIVERSITÄT HAMBURG FRIEDRICH- ALEXANDER- UNIVERSITÄT ERLANGEN-NÜRNBERG RUHR-UNIVERSITÄT BOCHUM EBERHARDT-KARLS UNIVERSITÄT TÜBINGEN UNIVERSITÄT STUTTGART UNIVERSITÄT KARLSRUHE  W. Wahlster, DFKI

The Control Panel of Verbmobil

From a Multi-Agent Architecture to a Multi-Blackboard Architecture Verbmobil I Verbmobil II  Multi-Agent Architecture  Multi-Blackboard Architecture M1 M3 M2 M3 M1 Blackboards M2 BB 1 BB 2 BB 3 M4 M5 M6 M5 M4 M6  Each module must know, which module produces what data  Direct communication between modules  Each module has only one instance  Heavy data traffic for moving copies around  Multiparty and telecooperation applications are impossible  Software: ICE and ICE Master  Basic Platform: PVM  All modules can register for each blackboard dynamically  No direct communication between modules  Each module can have several instances  No copies of representation structures (word lattice, VIT chart)  Multiparty and Telecooperation applications are possible  Software: PCA and Module Manager  Basic Platform: PVM

Multi-Blackboard/Multi-Agent Architecture Module 1 Module 3 Module 2 Blackboard 3 Syntactic Representation: Parsing Results Blackboard 4 Semantic Representation: Lambda DRS Blackboard 1 Preprocessed Speech Signal Blackboard 5 Dialog Acts Blackboard 2 Word Lattice Module 4 Module 6 Module 5

A Multi-Blackboard Architecture for the Combinationof Results from Deep and Shallow Processing Modules Command Recognizer Channel/Speaker Adaptation Audio Data Spontaneous Speech Recognizer Prosodic Analysis Statistical Parser Chunk Parser Word Hypothesis Graph with Prosodic Labels Dialog Act Recognition HPSG Parser Semantic Construction Semantic Transfer VITs Underspecified Discourse Representations Robust Dialog Semantics Generation

Integrating Shallow and Deep Analysis Components in a Multi-Blackboard Architecture Augmented Word Lattice Statistical Parser Chunk Parser HPSG Parser partial VITs Chart with a combination of partial VITs partial VITs partial VITs Robust Dialog Semantics Combination and knowledge- based reconstruction of complete VITs Complete and Spanning VITs

VHG: A Packed Chart Representation of Partial Semantic Representations l Incremental chart construction and anytime processing l Rule-based combination and transformation of partial UDRS coded as VITs l Selection of a spanning analysis using a bigram model for VITs (trained on a tree bank of 24 k VITs) l Chart Parser using cascaded finite-state transducers (Abney, Hinrichs) l Statistical LR parser trained on treebank (Block, Ruland) l Very fast HPSG parser (see two papers at ACL99, Kiefer, Krieger et al.) Semantic Construction

Robust Dialog Semantics: Deep Processing of Shallow Structures Goals of robust semantic processing (Pinkal, Worm, Rupp) l Combination of unrelated analysis fragments l Completion of incomplete analysis results l Skipping of irrelevant fragments Method: Transformation rules on VIT Hypothesis Graph: Conditions on VIT structures  Operations on VIT structures The rules are based on various knowledge sources: l lattice of semantic types l domain ontology l sortal restrictions l semantic constraints Results: 20% analysis is improved, 0.6% analysis gets worse

Semantic Correction of Recognition Errors Wir treffen uns Kaiserslautern. (We are meeting Kaiserslautern.) We are meeting in Kaiserslautern. German English

Robust Dialog Semantics: Combining and Completing Partial Representations Let us meet (in) the late afternoon to catch the train to Frankfurt the late afternoon the train to Frankfurt meet to catch Let us The preposition ‚in‘ is missing in all paths through the word hypothesis graph. A temporal NP is transformed into a temporal modifier using a underspecified temporal relation: [temporal_np(V1)]  [typeraise_to_mod (V1, V2)] & V2 The modifier is applied to a proposition: [type (V1, prop), type (V2, mod)] [apply (V2, V1, V3)] & V3

The Understanding of Spontaneous Speech Repairs I need a car next Tuesday oops Monday Editing Phase Repair Phase Original Utterance Reparans Hesitation Reparandum Recognition of Substitutions Transformation of the Word Hypothesis Graph I need a car next Monday Verbmobil Technology: Understands Speech Repairs and extracts the intended meaning Dictation Systems like: ViaVoice, VoiceXpress, FreeSpeech, Naturally Speaking cannot deal with spontaneous speech and transcribe the corrupted utterances.

Automatic Understanding and Correction of Speech Repairs in Spontaneous Telephone Dialogs Wir treffen uns in Mannheim, äh, in Saarbrücken. (We are meeting in Mannheim, oops, in Saarbruecken.) We are meeting in Saarbruecken. German English

Integrating a Deep HPSG-based Analysis with Probabilistic Dialog Act Recognition for Semantic Transfer HPSG Analysis Probabilistic Analysis of Dialog Acts (HMM) Robust Dialog Semantics Dialog Act Type VIT Dialog Act Type Recognition of Dialog Plans (Plan Operators) Semantic Transfer Dialog Phase

The Dialog Act Hierarchy used for Planning,Prediction, Translation and Generation GREETING_BEGIN GREETING_END GREETING INTRODUCE POLITENESS_FORMULA THANK DELIBERATE BACKCHANNEL CONTROL_DIALOG INIT DEFER CLOSE MANAGE_TASK Dialog Act REQUEST_SUGGEST REQUEST_CLARIFY REQUEST_COMMENT REQUEST_COMMIT REQUEST SUGGEST INFORM FEEDBACK COMMIT DEVIATE_SCENARIO REFER_TO_SETTING DIGRESS EXCLUDE CLARIFY GIVE_REASON CLARIFY_ANSWER PROMOTE_TASK REJECT FEEDBACK_NEGATIVE EXPLAINED_REJECT ACCEPT CONFIRM FEEDBACK_POSITIVE

Combining Statistical and Symbolic Processing for Dialog Processing Dialog-Act based Translation Dialog Module Context Evaluation Statistical Prediction Dialog Act Predictions Context Evaluation Main Proprositional Content Focus Plan Recognition Dialog Phase Transfer by Rules Dialog Act Dialog-Act based Translation Dialog Memory Dialog Act Generation of Minutes

Statistical Dialog Act Recognition D = argmax P(D’ | W) D’ D = argmax P(W | D’) P(D’) D’ D = argmax P(W | D’) P(D’ | H) D’  Statistical approach: find most probable dialog act D for words W :  Bayes’ formula:  Use dialog context H :  Approximation of a-priori word probabilities P(W | D) and dialog act probabilities P(D | H) from the corpus

Learning of Probabilistic Plan Operators from Annotated Corpora ( OPERATOR-s-10523-6 goal [IN-TURN confirm-s-10523 ?SLASH-3314 ?SLASH-3316] subgoals (sequence [IN-TURN confirm-s-10521 ?SLASH-3314 ?SLASH-3315] [IN-TURN confirm-s-10522 ?SLASH-3315 ?SLASH-3316]) PROB 0.72) ( OPERATOR-s-10521-8 goal[IN-TURN confirm-s-10521 ?SLASH-3321 ?SLASH-3322] subgoals (sequence [DOMAIN-DEPENDENT accept?SLASH-3321 ?SLASH-3322]) PROB 0.95) ( OPERATOR-s-10522-10 goal[IN-TURN confirm-s-10522 ?SLASH-3325 ?SLASH-3326] subgoals (sequence [DOMAIN-DEPENDENT confirm ?SLASH-3325 ?SLASH-3326]) PROB 0.83)

Automatic Generation of Multilingual Protocolsof Telephone Conversations Dialog Translation by Verbmobil Multilingual Generation of Protocols HTML-Document In English Transfered by Internet or Fax HTML-Document In English Transfered by Internet or Fax German Dialog Partner American Dialog Partner

Automatic Generation of Minutes A and B greet each other. A: (INIT_DATE, SUGGEST_SUPPORT_DATE, REQUEST_COMMENT_DATE) I would like to make a date. How about the seventeenth? Is that ok with you? B: (REJECT_DATE, ACCEPT_DATE) The seventeenth does not suit me. I’m free for one hour at three o’clock. A: (SUGGEST_SUPPORT_DATE) How about the sixteenth in the afternoon? B: (CLARIFY_QUERY, ACCEPT_DATE, CONFIRM) The sixteenth at two o’clock? That suits me. Ok. A and B say goodbye. Minutes generated automatically on 23 May 1999 08:35:18 h

Dialog Protocol Participants: Speaker B, Speaker A Date: 22.3.2000 Time: 8:57 AM to 10:03 AM Theme: Appointment schedule with trip and accommodation DIALOGUE RESULTS: Scheduling: Speaker B and speaker A will meet in the train station on the 1st of march 2000 at a quarter to 10 in the morning. Travelling: There the trip from Hamburg to Hanover by train will start on the 2nd of march at 10 o'clock in the morning. The way back by train will start on the 2nd of march at half past 6 in the evening. Accommodation: The hotel Luisenhof in Hanover was agreed on. Speaker A is taking care of the hotel reservation. Summary automatically generated at 22.3.2000 12:31:24 h

Spoken Clarification Dialogs between the User and the Verbmobil System English Translation German Input Verbmobil System User 2 User 1 Clarification Subdialog in German English Input Clarification caused by: l speech recognition problemsconfusion with similar words (Sonntag vs. Sonntags) unknown words (heuer  dieses Jahr) llack of context knowledge  lexical ambiguity (Noch einen Termin bitte!) l inconsistency with regard to  inconsistent date (Freitag, 24. Oktober) the system’s knowledge

Competing Strategies for Robust Speech Translation Concurrent processing modules of Verbmobil combine deep semantic translation with shallow surface-oriented translation methods. Word Lattice Expensive, but precise Translation Cheap, but approximate Translation time out? l Principled and compositional syntactic and semantic analysis l Semantic-based transfer of Verbmobil Interface Terms (VITs) as set of underspecified DRS l Case-based Translation l Dialog-act based translation l Statistical translation Selection of best result Results with Confidence Values Results with Confidence Values Acceptable Translation Rate

Architecture of the Semantic Transfer Module Bilingual Dictionary Refined VIT (L1) Refined VIT (L2) Lexical Transfer Monolingual Refinement Rules Monolingual Refinement Rules Refinement Refinement Phrasal Transfer VIT (L2) VIT (L1) Disambiguation Rules Disambiguation Rules Phrasal Dictionary Underspecified VIT (L1) Underspecified VIT (L2)

Extensions of Discourse Representation Theory The Verbmobil version of - DRT (Pinkal et al.) includes various extension of DRT: llambda:  - abstraction over DRSs lmerge operator: combination of DRSs lfunctional application: basic composition operation lquants feature: allows scope-free semantic representation lalfa expressions: representation of anaphoric elements with underspecified reference lanchors list: representation of deictic information lepsilon expressions: underspecification of elliptical expressions lmodal expressions: representation of propositional attitudes

Three English Translations of the German Word “Termin” Found in the Verbmobil Corpus Subsumption Relations in the Domain Model 1. Verschieben wir den Termin. Let’s reschedule the appointment 2. Schlagen Sie einen Termin vor. Suggest a date. 3.Da habe ich einen Termin frei. I have got a free slot there. scheduled event default temporal_specification set_start_time time_interval appointment slot date

Entries in the Transfer Lexicon: German  English (Simplified) tau_lex (termin, appointment, pred_sort (subsumption (scheduled_event))). tau_lex (termin, date, pred_sort (subsumption (set_start_time)). tau_lex (termin, slot, pred_sort (subsumption (time_interval))). tau_lex (verschieben, reschedule, [tau (#S), tau (#0)], pred_args ([#S, #0 & pred_sort (scheduled_event)])) tau_lex (ausmachen, make, [tau (#S), tau (#0)], pred_args ([#S, #0 & pred_sort (scheduled_event)])) tau_lex (ausmachen, fix, [tau (#S), tau (#0)], pred_args ([#S, #0 & pred_sort (set_start_time)])) tau_lex (freihaben, have_free, [tau (#S), tau (#0)], pred_args ([#S, #0 & pred_sort (time_interval)]))

Context-Sensitive Translation Exploiting a Discourse Model Example: Three different translations of the German word Platz room / table / seat Nehmen wir dieses Hotel, ja. Let us take this hotel. Ich reserviere einen Platz. I reserve a room. 1 Machen wir das Abendessen dort. Let us have dinner there. Ich reserviere einen Platz. I reserve a table. 2 Gehen wir ins Theater. Let us go to the theater. Ich möchte Plätze reservieren. I would like to reserve seats. 3 All other dialog translation systems translate sentece by sentence without taking the dialog context into account.

The Use of Underspecified Representations Two Readings in the Source Language Wir telephonierten mit Freunden aus Schweden. A compact representation of scope ambiguities in a logical language without using disjunctions Underspecified Semantic Representation Ambiguity Preserving Translations Two Readings in the Target Language We called friends from Sweden.

The Control Panel of Verbmobil

Integrating Deep and Shallow Processing: Combining Results from Concurrent Translation Threads Segment 1 Wenn wir den Termin vorziehen, Segment 1 If you prefer another hotel, Segment 2 das würde mir gut passen. Segment 2 please let me know. Statistical Translation Case-Based Translation Dialog-Act Based Translation Semantic Transfer Alternative Translations with Confidence Values Selection Module Segment 1 Translated by Semantic Transfer Segment 2 Translated by Case-Based Translation

A Context-Free Approach to the Selection of the Best Translation Result SEQ := Set of all translation sequences for a turn SeqSEQ := Sequence of translation segments s1, s2, ...sn Each translation thread provides for every segment an online confidence value confidence (thread.segment) Input: Task: Compute normalized confidence values for translated Seq CONF (Seq) =  Length(segment) * (alpha(thread) + beta(thread) * confidence(thread.segment)) segment  Seq Best (SEQ) = {Seq  SEQ | Seq is maximal element in (SEQ CONF) Output:

Learning the Normalizing Factors Alpha and Beta from an Annotated Corpus Turn := segment1, segment2...segmentn For each turn in a training corpus all segments translated by one of the four translation threads are manually annotated with a score for translation quality. For the sequence of n segments resulting in the best overall translation score at most 4n linear inequations are generated, so that the selected sequence is better than all alternative translation sequences. From the set of inequations for spanning analyses ( 4n) the values of alpha and beta can be determind offline by solving the constraint system.

Example of a Linear Inequation Used for Offline Learning Turn := Segment_1 Segment_2 Segment_3 Statistical Translation = STAT Case-based Translation = CASE Dialog-Act Based Translation = DIAL Semantic Transfer = SEMT quality (CASE, Segment_1), quality (SEMT, Segment_2), quality (STAT, Sement_3) is optimal Length (Segment_1) * (alpha (CASE ) + beta (CASE) * confidence (CASE, Segment_1)) Length (Segment_2) * (alpha (SEMT) + beta (SEMT) * confidence (SEMT, Segment_2)) Length (Segment_3) * (alpha (STAT) + beta (STAT) * confidence (STAT, Segment_3)) > Length (Segment_1) * (alpha (DIAL) + beta (DIAL) * confidence (DIAL, Segment_1)) Length (Segment_2) * (alpha (DIAL) + beta (DIAL) * confidence (DIAL, Segment_2)) Length (Segment_3) * (alpha (DIAL) + beta (DIAL) * confidence (DIAL, Segment_3))

The Context-Sensitive Selection of the Best Translation Using probabilities of dialog acts in the normalization process CONF (Seq) =  Length (segment) * (alpha (thread) + dialog-act (thread, segment) + beta (thread) * confidence (thread, segmnet)) e.g. Greet (Statistical_Translation, Segment > Greet (Semantic_Transfer, Segment) Suggest (Semantic_Transfer, Segment) > Suggest (Case_based Translation, Segment) segment  Seq Exploiting meta-knowledge If the semantic transfer generates  x disambiguation tasks then increase the alpha and beta values for semantic transfer. e.g. einen Termin vorziehen  prefer/give priority to/bring forward <a date> Observation: Even on the meta-control level (selection module) a hybrid approach is advantageous.

Verbmobil: Long-Term, Large-Scale Funding and Its Impact l Funding by the German Ministry for Education and Research BMBF Phase I (1993-1996) $ 33 M Phase II (1997-2000) $ 28 M l 60% Industrial funding according to shared cost model $ 17 M l Additional R&D investments of industrial partners $ 11 M Total $ 89 M l > 400 Publications (>250 refereed) l > Many Patents l > 10 Commercial Spin-off Products l > Many new Spin-off Companies l > 100 New jobs in German Language l > 50 Academics transferred to Industry Industry Philips, DaimlerChrysler and Siemens are leaders in Spoken Dialog Applications

SmartKom: Intuitive Multimodal Interaction DAIMLERCHRYSLER Uinv. Of Munich European Media Lab Project Budget: $ 34 M Project Duration: 4 years The SmartKom Consortium: Main Contractor Project Management Testbed Software Integration DFKI Saarbrücken Saarbrücken MediaInterface Berkeley Dresden Univ. of Stuttgart Heidelberg Univ. of Erlangen Munich Stuttgart Ulm Aachen  W. Wahlster, DFKI

The Architecture of the SmartKom Agent (cf. Maybury/Wahlster 1998) Input Processing Media Interaction Management Media Analysis Analysis Language Media Fusion Graphics Discourse Modeling Gesture Biometrics Information Applications People Application Interface Media Design Intention Recognition Design Language User(s) User Modeling Graphics Gesture Animated Presentation Agent Presentation Design Output Rendering User Model Task Model Domain Model Media Models Discourse Model Representation and Inference  W. Wahlster, DFKI

From Speech Recognition Towards Speech Understanding

From Speech Recognition Towards Speech Understanding

Presentation Transcript

Speech Recognition

Speech Recognition

Speech Recognition and Understanding

Speech Recognition

Speech Recognition

Speech recognition

Speech Recognition

Speech recognition

Speech Recognition

Towards Superhuman Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

SPEECH RECOGNITION:

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition