240 likes | 349 Vues
This work explores the design of multimodal interfaces that effectively recognize and integrate various human input modes, such as speech, gestures, and pen input, to enhance user interaction across diverse contexts. It discusses the advantages of multimodal systems, including flexibility, improved error handling, and adaptability to user environments and tasks. The study provides insights into mutual disambiguation of signals in hybrid architectures, comparing performance between accented and native speakers in both stationary and mobile settings. Results emphasize the increased robustness and effectiveness of these systems.
E N D
Designing Robust Multimodal Systems for Diverse Users and Mobile Environments Sharon Oviattoviatt@cse.ogi.edu; http://www.cse.ogi.edu/CHCC/
Introduction to Perceptive Multimodal Interfaces • Multimodal interfaces recognize combined natural human input modes (speech & pen, speech & lip movements) • Radical departure from GUIs in basic features, interface design & architectural underpinnings • Rapid development in 1990s of bimodal systems • New fusion & language processing techniques • Diversification of mode combinations & applications • More general & robust hybrid architectures
Advantages of Multimodal Interfaces • Flexibility & expressive power • Support for users’ preferred interaction style • Accommodate more users,** tasks, environments** • Improved error handling & robustness** • Support for new forms of computing, including mobile & pervasive interfaces • Permit multifunctional & tailored mobile interfaces, adapted to user, task & environment
The Challenge of Robustness:Unimodal Speech Technology’s Achilles’ Heel • Recognition errors currently limit commercialization of speech technology, especially for: • Spontaneous interactive speech • Diverse speakers & speaking styles (e.g., accented) • Speech in natural field environments (e.g., mobile) • 20-50% drop in accuracy typical for real-world usage conditions
Improved Error Handling in Flexible Multimodal Interfaces • Users can avoid errors through mode selection • Users’ multimodal language is simplified, which reduces complexity of NLP & avoids errors • Users mode switch after system errors, which undercuts error spirals & facilitates recovery • Multimodal architectures potentially can support “mutual disambiguation” of input signals
Example of Mutual Disambiguation: QuickSet Interface during Multimodal “PAN” Command
Processing & Architecture • Speech & gestures processed in parallel • Statistically ranked unification of semantic interpretations • Multi-agent architecture coordinates signal recognition, language processing, & multimodal integration
General Research Questions • To what extent can a multimodal system support mutual disambiguation of input signals? • How much is robustness improved in a multimodal system, compared with a unimodal one? • In what usage contexts and for what user groups is robustness most enhanced by a multimodal system? • What are the asymmetries between modes in disambiguation likelihoods?
Study 1- Research Method • Quickset testing with map-based tasks (community fire & flood management) • 16 users— 8 nativespeakers & 8 accented(varied Asian, European & African accents) • Research design— completely-crossed factorial with between-subjects factors: (1) Speaker status (accented, native) (2) Gender • Corpus of 2,000 multimodal commands processed by QuickSet
Videotape Multimodal system processing for accented and mobile users
Study 1- Results • 1 in 8 multimodal commands succeeded due to mutual disambiguation (MD) of input signals • MD levels significantly higher for accented speakers than native ones— 15% vs 8.5% of utterances • Ratio of speech to total signal pull-ups differed for users— .65 accented vs .35 native • Results replicated across signal & parse-level MD
Table 1—Mutual Disambiguation Rates for Native versus Accented Speakers
Table 2- Recognition Rate Differentials between Native and Accented Speakers for Speech, Gesture and Multimodal Commands
Study 1- Results (cont.) Compared to traditional speech processing, spoken language processed within a multimodal architecture yielded: 41.3% reduction in total speech error rate No gender or practice effects found in MD rates
Study 2- Research Method • QuickSet testing with same 100 map-based tasks • Main study: • 16 users with high-endmic(close-talking, noise-canceling) • Research design completely-crossed factorial: (1) Usage Context- Stationary vs Mobile (within subjects) (2) Gender • Replication: • 6 users with low-endmic (built-in, no noise cancellation) • Compared stationary vs mobile
Study 2- Research Analyses • Corpus of 2,600 multimodal commands • Signal amplitude, background noise & SNR estimated for each command • Mutual disambiguation & multimodal system recognition rates analyzed in relation to dynamic signal data
Mobile user with hand-held system & close-talking headset in moderately noisy environment(40-60 dB noise)
Mobile research infrastructure, with user instrumentation and researcher field station
Study 2- Results • 1 in 7multimodal commands succeeded due to mutual disambiguation of input signals • MD levels significantly higher during mobile than stationary system use— 16% vs 9.5% of utterances • Results replicated across signal and parse-level MD
Table 3- Mutual Disambiguation Rates during Stationary and Mobile System Use
Table 4- Recognition Rate Differentials during Stationary and Mobile System Use for Speech, Gesture and Multimodal Commands
Study 2- Results (cont.) Compared to traditional speech processing, spoken language processed within a multimodal architecture yielded: 19-35% reduction in total speech error rate (for noise-canceling & built-in mics, respectively) No gender effects found in MD
Conclusions • Multimodal architectures can support mutual disambiguation & improved robustness over unimodal processing • Error rate reduction can be substantial— 20-40% • Multimodal systems can reduce orclose the recognition rate gapfor challenging users(accented speakers)& usage contexts(mobile) • Error-prone recognition technologies can be stabilized within a multimodal architecture, which functionmore reliably in real-world contexts
Future Directions & Challenges • Intelligently adaptive processing, tailored for mobile usage patterns & diverse users • Improved language & dialogue processing techniques, and hybrid multimodal architectures • Novel mobile & pervasive multimodal concepts • Break the robustness barrier— reduce error rate (For more information— http://www.cse.ogi.edu/CHCC/)