660 likes | 683 Vues
This lecture note explores a system supporting human-computer interaction through various input/output modes like voice, pen, gesture, and facial expression. It discusses the advantages, applications, and architectures of multimodal interfaces in enhancing task performance and user preference.
E N D
Multimodal Dialog Intelligent Robot Lecture Note
Multimodal Dialog System Intelligent Robot Lecture Note
Multimodal Dialog System A system which supports human-computer interaction over multiple different input and/or output modes. Input: voice, pen, gesture, face expression, etc. Output: voice, graphical output, etc. Applications GPS Information guide system Smart home control Etc. 여기에서 여기로 가는 제일 빠른 길 좀 알려 줘. voice pen Intelligent Robot Lecture Note
Motivations Speech: the Ultimate Interface? + Interaction style: natural (use free speech) Natural repair process for error recovery + Richer channel – speaker’s disposition and emotional state (if system’s knew how to deal with that..) - Input inconsistent (high error rates), hard to correct error e.g., may get different result, each time we speak the same words. - Slow (sequential) output style: using TTS (text-to-speech) How to overcome these weak points? Multimodal interface!! Intelligent Robot Lecture Note
Advantages of Multimodal Interface Task performance and user preference Migration of Human-Computer Interaction away from the desktop Adaptation to the environment Error recovery and handling Special situations where mode choice helps Intelligent Robot Lecture Note
Task Performance and User Preference Task performance and user preference for multimodal over speech only interfaces [Oviatt et al., 1997] 10% faster task completion, 23% fewer words, (Shorter and simpler linguistic constructions) 36% fewer task errors, 35% fewer spoken disfluencies, 90-100% user preference to interact this way. • Speech-only dialog system • Speech: Bring the drink on the table to the side of bed • Multimodal dialog System • Speech: Bring this to here Pen gesture: Easy, Simplified user utterance ! Intelligent Robot Lecture Note
Migration of Human-Computer Interaction away from the desktop Small portable computing devices Such as PDAs, organizers, and smart-phones Limited screen real estate for graphical output Limited input no keyboard/mouse (arrow keys, thumbwheel) Complex GUIs not feasible Augment limited GUI with natural modalities such as speech and pen Use less space Rapid navigation over menu hierarchy Other devices Kiosks, car navigation system… No mouse or keyboard Speech + pen gesture Intelligent Robot Lecture Note
Adaptation to the environment Multimodal interfaces enable rapid adaptation to changes in the environment Allow user to switch modes Mobile devices that are used in multiple environments Environmental conditions can be either physical or social Physical Noise: Increases in ambient noise can degrade speech performance switch to GUI, stylus pen input Brightness: Bright light in outdoor environment can limit usefulness of graphical display Social Speech many be easiest for password, account number etc, but in public places users may be uncomfortable being overheard Switch to GUI or keypad input Intelligent Robot Lecture Note
Error Recovery and Handling Advantages for recovery and reduction of error: Users intuitively pick the mode that is less error-prone. Language is often simplified. Users intuitively switch modes after an error The same problem is not repeated. Multimodal error correction Cross-mode compensation - complementarity Combining inputs from multiple modalities can reduce the overall error rate Multimodal interface has potentially Intelligent Robot Lecture Note
Special Situations Where Mode Choice Helps Users with disability People with a strong accent or a cold People with RSI Young children or non-literate users Other users who have problems when handle the standard devices: mouse and keyboard Multimodal interface let people choose their preferred interaction style depending on the actual task, the context, and their own preferences and abilities. Intelligent Robot Lecture Note
Multimodal Dialog System Architecture Architecture of QuickSet [] Multi-agent architecture Speech/ TTS Natural Language Multimodal Integration Map Interface VR/AR Interfaces MAVEN BARS Inter-agent Communication Language Facilitator routing, triggering, dispatching, Simulators Sketch/ Gesture Java-enabled Web pages COM objects ICL — Horn Clauses Other user interfaces Other Facilitators CORBA bridge WebSvcs (XML, SOAP, …) Databases Intelligent Robot Lecture Note
Multimodal Language Processing Intelligent Robot Lecture Note 12
Multimodal Reference Resolution Multimodal Reference Resolution Need to resolve references (what the user is referring to) across modalities. A user may refer to an item in a display by using speech, by pointing, or both Closely related with Multimodal Integration 여기에서 여기로 가는 제일 빠른 길 좀 알려 줘. voice pen Intelligent Robot Lecture Note
Multimodal Reference Resolution Multimodal Reference Resolution Finds the most proper referents to referring expressions. [Chai et al., 2004] Referring expression Refer to a specific entity or entities Given by a user’s inputs (most likely in speech inputs) Referent An entity which the user refers Referent can be an object that is not specified by current utterance. 여기에서 여기로 가는 가장 빠른 길 좀 알려줘 Speech 여기 여기 Gesture g2 g1 롯데 백화점 Object 버거킹 Intelligent Robot Lecture Note
Multimodal Reference Resolution Multimodal Reference Resolution Hard case Multiple and complex gesture inputs. E.g.) in information guide system 이거랑 이것들이랑 가격 좀 비교 해 줄래 User: 이건 가격이 얼마지? (물건 하나를 선택한다) System: 만 오천원 입니다. User: 이거랑 이것들이랑 가격 좀 비교 해 줄래 (물건 세 개를 선택 한다) Speech 이것들 이거 Gesture g2 g3 g1 Time Speech 이것들 이거 ? Gesture g2 g3 g1 Time Intelligent Robot Lecture Note
Multimodal Reference Resolution Multimodal Reference Resolution Using linguistic theories to guide the reference resolution process. [Chai et al., 2005] Conversation Implicature Givenness Hierarchy Greedy algorithm for finding the best assignment for a referring expression given a cognitive status. Calculate the match score between referring expressions and referent candidates. Matching score Finds the best assignments by using greedy algorithm object selectivity compatibility measurement Likelihood of status Intelligent Robot Lecture Note
Meaning Meaning Multimodal Integration / Fusion Combined Meaning Multimodal Integration • Combining information from multiple input modalities to understand user’s intention and attention • Multimodal reference resolution is a special case of multimodal integration • Speech + pen gesture. • The case where pen gestures can express meaning of deictic or grouping only. Intelligent Robot Lecture Note
Multimodal Integration • Issues: • Nature of multimodal integration mechanism • Algorithmic – procedural • Parser / Grammars – Declarative • Does approach treat one mode as primary? • Is gesture a secondary dependent mode? • Multimodal reference resolution • How temporal and spatial constraints are expressed • Common meaning representation for speech and gesture • Two main approaches • Unification-based multimodal parsing and understanding [Johnston, 1998] • Finite-state transducer for multimodal parsing and understanding [Johnston et al., 2000 Intelligent Robot Lecture Note
Unification-based multimodal parsing and understanding • Parallel recognizers and “understanders” • Time-stamped meaning fragments for each stream • Common framework for meaning representation – typed feature structures • Meaning fusion operations – unification • Unification is an operation that determines the consistency of two pieces of partial information, • And if they are consistent combines them into a single result • Whether a given gestural input is compatible with a given piece of spoken input. • And if they are, combine them into a single result • Semantic, and spatiotemporal constraints • Statistical ranking • Flexible asynchronous architecture • Must handle unimodal and multimodal input Intelligent Robot Lecture Note
Unification-based multimodal parsing and understanding • Temporal Constraints [Oviatt et al., 1997] • Speech and gesture overlap, or • Gesture precedes speech by <= 4 seconds • Speech does not precede gesture Given sequence speech1; gesture; speech2 Possible grouping speech1; (gesture; speech2) Finding [Oviatt et al. 2004, 2005] - Users have a consistent temporal integration style adapt Intelligent Robot Lecture Note
Unification-based multimodal parsing and understanding • Each unimodal inputs are represented as feature structure [Holzapfel et al., 2004] • Very common representation in Comp. Ling. – FUG, LFG, PATR • e.g., lexical entries, grammar rules, etc. • e.g., “please switch on the lamp” • And there are some predefined rules for resolving the deictic reference and integrating multimodal inputs Attr1: val1 Attr2: val2 Attr3: Attr4: val4 Type2 Type Intelligent Robot Lecture Note
“Draw a line” Unification-based multimodal parsing and understanding • An example From speech (one of many hyp’s) Object: Color: green Label: draw a line Create_line Object: Color: green Label: draw a line Location: Line Create_line Create_line + From pen gesture Coordlist [ (12143,12134), (12146,12134), … ] Location: Coordlist [ (12143,12134), (12146,12134), … ] Location: Line ISA Create_line Line Cross-mode compensation command Xcoord: 15487, Ycoord: 19547 Location: Point command Intelligent Robot Lecture Note
Unification-based multimodal parsing and understanding • Advantages of multimodal integration via typed feature structure unification • Partiality • Structure sharing • Mutual Compensation (cross-mode compensation) • Multimodal discourse Intelligent Robot Lecture Note
Unification-based multimodal parsing and understanding • Mutual Disambiguation (MD) • Each input mode provides a set of scored recognition hypotheses • MD derives the best joint interpretation by unification of meaning representation fragments • PMM = αPS + βPG + C • Learn α, β and C over a multimodal corpus • MD stabilizes system performance in challenging environments gesture object multimodal speech g1 o1 mm1 s1 mm2 g2 o2 s2 s3 g3 o3 mm3 g4 mm4 Intelligent Robot Lecture Note
Finite-state Multimodal Understanding • Modeled by a 3-tape finite state device • Speech and gesture stream (gesture symbols) • Their combined meaning (meaning symbols) • Device take speech and gesture as inputs and create the meaning output. • Simulated by two transducers • G:W aligning speech and gesture • G*W:M composite alphabet of speech and gesture symbols as inputs and outputs meaning • Speech and gesture input will be composed by G:W • Then G_W will be composed by G*W:M Intelligent Robot Lecture Note
these two restaurants for numbers phone ten american show new rest 2 SEM(r12,r15) sel G area loc SEM(points…) hw 0 Finite-state Multimodal Understanding • Representation of speech input modality • Lattice of words • Representation of gesture input modality • Represent range of recognitions as lattice of symbols Intelligent Robot Lecture Note
SEM (r12,r15) <cmd> <obj> <type> </obj> phone </rest> <rest> </cmd> </type> Finite-state Multimodal Understanding • Representation of combined meaning • Also represented as lattice • Paths in meaning lattice are well-formed XML <cmd> <info> <type>phone</type> <obj><rest>r12,r15</rest></obj> </info> </cmd> Intelligent Robot Lecture Note
Finite-state Multimodal Understanding • Multimodal Grammar Formalism • Multimodal context-free grammar (MCFG) • e.g., HEADPL restaurants:rest:<rest> ε:SEM:SEM ε: ε:</rest> • Terminals are multimodal tokens consisting of three components: • Speech stream : Gesture stream : Combined meaning (W:G:M) • e.g., “put that there” S ε:ε:<cmd> PUTV OBJNP LOCNP ε: ε:</층> PUTV ε:ε:<act> put:ε:put ε:ε:</act> OBJNP ε:ε:<obj> that:Gvehicle:εε:SEM:SEM ε:ε:</obj> LOCNP ε:ε:<loc> there:Garea:εε:ε:</loc> S LOCNP OBJNP PUTV that Speech put there Gvehicle v1 Garea a1 Gesture <act>put</act> <obj>v1</obj> <cmd> Meaning <loc>a1</loc> </cmd> Intelligent Robot Lecture Note
ε:SEM:SEM department:Gd;dept( ε:ε:) this:ε:ε 2 3 4 person:Gp;person( email:ε:email([ ε:ε:]) 5 6 1 0 that:ε:ε and:ε:, page:ε:page([ Finite-state Multimodal Understanding • Multimodal Grammar Example • Speech: email this person and that organization • Gesture: Gp SEM Go SEM • Meaning: email([ person(SEM) , org(SEM ) ]) S V NP ε:ε:]) NP DET N NP NP CONJ NP CONJ and:ε:, V email:ε:email([ V page:ε:page([ DET this:ε:ε DET that:ε:ε N person:Gp:person( ε:SEM:SEM ε:ε:) N organization:Go:org( ε:SEM:SEM ε:ε:) N department:Gd:dept( ε:SEM:SEM ε:ε:) organization:Go:org( Intelligent Robot Lecture Note
these two restaurants for numbers phone ten american show new rest 2 SEM(r12,r15) sel G area loc SEM(points…) hw 0 Finite-state Multimodal Understanding integration processing Speech lattice 3-Tape Multimodal Finite-state Device Gesture lattice Meaning lattice <cmd> <type> phone Intelligent Robot Lecture Note
ε:SEM:SEM department:Gd;dept( ε:ε:) this:ε:ε email that this person organization 2 3 4 and 1 0 2 5 6 3 4 person:Gp;person( email:ε:email([ ε:ε:]) 5 6 1 0 that:ε:ε and:ε:, page:ε:page([ Finite-state Multimodal Understanding • An example Multimodal Grammar Speech lattice Gesture lattice Gp SEM Go SEM 1 0 2 3 4 SEM email([ , person( ]) SEM ) org( ) 1 0 2 7 5 8 6 9 3 4 Meaning lattice Intelligent Robot Lecture Note
Robustness in Multimodal Dialog Intelligent Robot Lecture Note 32
Robustness in Multimodal Dialog • Gain robustness via • Fusion of inputs from multiple modalities • Using strengths of one mode to compensate for weaknesses of others—design time and run time • Avoiding/correcting errors • Statistical architecture • Confirmation • Dialogue context • Simplification of language in a multimodal context • Output affecting/channeling input • Example approaches • Edit machine in FST based Multimodal integration and understanding • Salience driven approach to robust input interpretation • N-best re-ranking method for improving speech recognition performance Intelligent Robot Lecture Note
Edit Machine in FST based MM integration • Problem of FST based MM integration - mismatch between the user’s input and the language encoded in the grammar ASR: show cheap restaurants thai places in in chelsea Grammar: show cheap thai places in chelsea • How to parse it? • determine which in-grammar string it is most like Edits: show cheap ε thai places in ε chelsea (restaurants and in is deleted) To find this, employ the edit machine ! Intelligent Robot Lecture Note
Handcrafted Finite-state Edit Machines • Edit-based Multimodal Understanding – Basic edit • Transform ASR output so that it can be assigned a meaning by the FST-based Multimodal Understanding model • Find the string with the least costly number of edits that can be assigned an interpretation by the grammar • λg: Language encoded in the multimodal grammar • λs: String encoded in the lattice resulting from ASR • ◦ : composition of transducers Intelligent Robot Lecture Note
Handcrafted Finite-state Edit Machines • Edit-based Multimodal Understanding – 4-edit • Basic edit is quite large and adds an unacceptable amount of latency (5s on average). • Limited number of edit operations (at most 4) Intelligent Robot Lecture Note
Handcrafted Finite-state Edit Machines • Edit-based Multimodal Understanding – Smart edit • Smart edit is a 4-edit machine + heuristics + refinements • Deletion of SLM only words (not found in the grammar) • thai restaurant listings in midtown -> thai restaurant in midtown • Deletion of doubled words • Subway to to the cloisters -> subway to the cloisters • Subdivided cost classes ( icost, dcost 3 classes ) • High cost: slot fillers (e.g. Chinese, cheap, downtown) • Low cost: dispensable words (e.g. please, would ) • Medium cost: all other words • Auto-completion of place names • Algorithm enumerates all possible shortening of places names • Metropolitan Museum of Art, Metropolitan Museum Intelligent Robot Lecture Note
Learning Edit Patterns • User’s input is considered a “noisy” version of the parsable input (clean). Noisy (S): show cheap restaurants thai places in in chelsea Clean (T): show cheap ε thai places in ε chelsea • Translating the user’s input to a string that can be assigned a meaning representation by the grammar Intelligent Robot Lecture Note
Learning Edit Patterns • Noisy Channel Model for Error Correction • Translation probability • Sg: string that can be assigned a meaning representation by the grammar • Su: user’s input utterance • From Markov assumption, (trigram) • Where Su = Su1Su2…Sun and Sg = Sg1Sg2…Sgm • Word Alignment (Sui,Sgi) • GIZA++ Intelligent Robot Lecture Note
Learning Edit Patterns • Deriving Translation Corpus • Finite-state transducer can generate the input strings for given meaning. • Training the translation model corpus string meaning Generated String Target String Multimodal Grammar Generate the strings given meaning Select the closest strings Intelligent Robot Lecture Note
Experiments and Results • 16 first time users (8 male, 8 female). • 833 user interactions (218 multimodal / 491 speech-only / 124 pen-only) • Finding restaurants of various types and getting their names, phone numbers, addresses. • Getting subway directions between locations. • Avg. ASR sentence accuracy: 49% • Avg. ASR word accuracy: 73.4% Intelligent Robot Lecture Note
Experiments and Results • Improvements on concept accuracy Result of 6-fold cross validation Result of 10-fold cross validation Intelligent Robot Lecture Note
A Salience Driven Approach • Modify the language model score, and rescore recognized hypotheses • By using the information of gesture input • Primed Language model • W* = argmaxP(O|W)P(W) Intelligent Robot Lecture Note
A Salience Driven Approach • “People do not make any unnecessary deictic gesture” • Cognitive theory of Conversation Implicature • Speakers tend to make their contribution as informative as is required • And not make their contribution more informative than is required • “Speech and gesture tend to complement each other” • When a speech utterance is accompanied by a deictic gesture, • Speech input – issue commands or inquiries about properties of object • Deictic gesture – indicate the objects of interest • Gesture as an earlier indicator to anticipate the content of communication in the subsequent spoken utterances • 85% of time gestures occurred before corresponding speech unit Intelligent Robot Lecture Note
Graphical display Salience weight A Salience Driven Approach • A deictic gesture can activate several objects on the graphical display • It will signal a distribution of objects that are salient time gesture speech Move this to here Salient Object A cup Intelligent Robot Lecture Note
A Salience Driven Approach • Salient object ‘a cup’ is mapped to the physical world representation • To indicate a salient part of representation • Such as relevant properties or task related to the salient objects. • This salient part of the physical world is likely to be the potential content of speech A cup time gesture speech Move this to here Intelligent Robot Lecture Note
A Salience Driven Approach • Physical world representation • Domain Model • Relevant knowledge about the domain • Domain objects • Properties of objects • Relations between objects • Task models related to objects • Frame-based representation • Frame: domain object • Frame elements: attributes and tasks related to the objects • Domain Grammar • Specifies grammar and vocabularies used to process language inputs • Semantics-based context free grammar • Non-terminal: semantic tag • Terminal: word (value of semantic tag) • Annotated user spoken utterance • Relevant semantic information • N-grams Intelligent Robot Lecture Note
Salience Modeling • Calculating a salience distribution of entities in the physical world • Salience value of entity at time tn is influenced by a joint effect from • Sequence of gestures that happen before tn Intelligent Robot Lecture Note
Salience Modeling Summation ofP(ek|g) for all gestures before time tn Weighted byα Normalizing factor: Summation of salience value of all entities at time tn The closer gesture has higher impact to salience distribution Intelligent Robot Lecture Note
Salience Driven Spoken Language Understanding • Maps the salience distribution to the physical world representation • Uses salient world to influence spoken language understanding • primes language models to facilitate language understanding • Rescoring the hypotheses of speech recognizer by using primed language model score Intelligent Robot Lecture Note