Speech recognition, understanding and conversational interfaces Alexander Rudnicky School of Computer Science http://www.cs.cmu.edu/~air
Outline • Speech • Types of speech interfaces • Speech systems and their structure • Designing speech interfaces • Some applications • SpeechWear • Communicator
Speech as a signal • The difference between speech and sound • “CD” quality vs. intelligible quality • high-quality is 44.1 / 48 kHz • desirable speech bandwidth: 0-8kHz, 16bits • at 16bits/sample: 256kbps (tethered mic) • telephone: 64kbps (and lower) • Compression: • MPEG: 64kbps/channel and up (but not speech-optimal) • CELP: 16kbps … 2.4kbps (optimized for speech)
Speech for communication • The difference between speech and language • Speech recognition and speech understanding
Computers and speech • Transcription • dictation, information retrieval • Command and control • data entry, device control, navigation • Information access • airline schedules, stock quotes • Problem solving • travel planning, logistics
Speech system architecture • SIGNAL PROCESSING • DECODING • UNDERSTANDING • DISCOURSE • ACTION
Signal processing Parser Dialog manager Language Generator Decoder Post parser Speech synthesizer Domain agent Domain agent Domain agent speech display effector A generic speech system speech
Reduce dimensionality of signal • noise conditioning Signal processing • Transcribe speech to words Decoder Decoding speech Acoustic models Language models Corpus-base statistical models
Creating models for recognition Speech data Acoustic models Transcribe* Train Text data Language models Train
Understanding speech Grammar Ontology design, language acquisition Parser • Extract semantic content from utterance Post parser • Introduce context and world knowledge into interpretation Context Domain Agents Grounding, knowledge engineering
Interacting with the user Task schemas Task analysis Context Dialog manager • Guide interaction through task • Map user inputs and system state into actions Domain agent • Interact with back-end(s) • Interpret information using domain knowledge Domain agent Domain agent Database Live data (e.g. Web) Domain expert Knowledge engineering
Communicating with the user Language Generator • Decide what to say to user (and how to phrase it) Speech synthesizer Display Generator Action Generator
Speech recognition and understanding • Sphinx system • speaker-independent • continuous speech • large vocabulary • ATIS system • air travel information retrieval • context management • film clip
Command and control systems • Small vocabularies, fixed syntax • OPEN WINDOW <window_id> • MOVE OBJECT <object_id> to <position> • Applications: • data entry (e.g., zip codes), process control (e.g., electron microscope, darkroom equipment) • Large vocabulary, fixed syntax • Web browsing (?)
SpeechWear • Vehicle inspection task • USMC mechanics, fixed inspection form • Wearable computer (COTS components) • html-based task representation • film clip
Information access • Moderate to very large vocabulary • IVR and frame based systems • Commercial systems: • Nuance: http://www.nuance.com/demo/index.html • SpeechWorks: http://www.speechworks.com/demos/demos.htm • lots of others..
IVR and frame-based systems • Interactive voice response (IVR) • interactions specified by a graph (typically a tree) • Frame systems • ergodic graphs • states defined by multi-item forms
Graph-based systems Welcome to Bank ABC! Please say one of the following: Balance, Hours, Loan, ... What type of loan are you interested in? Please sayone of the following: Mortgage, Car, Personal, ... . . . .
Destination_City: Boston Departure_Date: ______ Departure_Time: ______ Preferred_Airline: ______ . . . Frame-based systems • I would like to fly to Boston • I’d like to go to Boston on Friday, … • When would you like to fly?
Frame-based systems Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . . Transition on keyword or phrase Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . . Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . . Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . . Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . .
Some problems • IVR systems work great, but only for well-structured (& “shallow”) tasks • Frame systems are good for “tasks” that correspond to a single form leading to an action • Neither approach does well with more complex problem-solving activities
Dialog Systems • Problem solving activity; complex task • Order of progression through task depends on user goals (which can change) and system state (a back-end retrieval) and is not predictable. • Track progress and help task along • mixed-initiative dialog • Discourse phenomena • User expect to “converse” with the system
Carnegie Mellon Communicator • A dialog system that supports complex problem solving in a travel planning domain • create an itinerary using air schedule, hotel and car information • 186 U.S. airports (>140k enplanements/yr) • currently: >500 world airports • Web-based data resources • Live and cached flight information • Airport, airline, etc. information
Value schema/handlers transform receptors value Domain Agent
Value_1 Value_2 Value_3 Compound schema transform value + e.g. SQL query Domain Agent
Destination airport Date Time Flight Leg Database lookup Available flights Schema ordering Schema i Value i Schema j Value j Schema k Value k transform Value
Carnegie Mellon Communicator • CMU Communicator • Call: 268-5144 • the information is accurate; you can use it for your own travel planning...
User-aware speech interfaces • Predictable behavior on the system’s part • Users coomunicate at different levels • http://www.speech.cs.cmu.edu/air/papers/InterfaceChars.html
User-aware speech interfaces • Content: task-centric utterances • Possibility: What can I do? • Orientation: Where are we? • Navigation: moving through the task space • Control: verbose/terse, listen! • Customization: define this word
Speech interface guidelines • Speech recognition is errorful • System state is often opaque to the user • http://www.speech.cs.cmu.edu/air/papers/SpInGuidelines/SpInGuidelines.html
Interface guidelines • State transparency • Input control • Error recovery • Error detection • Error correction • Log performance • Application integration
Summary • Speech and language communication • Dialog structure • Interface design