380 likes | 510 Vues
This course offers a comprehensive overview of voice technologies, including speech synthesis (TTS), automatic speech recognition (ASR), and dialog system architectures. We will delve into the historical evolution of these technologies, their relations to linguistic theory, and the importance of voice on the Web. Through practical lab assignments, students will learn to design and develop spoken dialog systems and evaluate call flows. This course will equip you with essential skills for a career in the voice technology industry and prepare you for further academic pursuits.
E N D
Outline • A short history of the field • Speechsynthesis (TTS) • Automaticspeechrecognition (ASR) • Dialog system architectures • Voice on the Web (perhaps show the Siri video) • Voice on the Web and W3C Standards • Relation to linguistictheory • A brief look at the course plan • What this course is not about • What this course could mean to you • Introduction to labassignments and platforms • Designing and developingspoken dialog systems • Present project • Givehomeassignment 1: Call flow design and evaluation • Present Labassignment 1)
A short history of the field • 1966, Joseph Weizenbaum, Eliza • Sundial ATIS Verbmobil • AIML NLP system • VXML "Voice XML", dialog markuplanguage (primarily for telephony) developedinitially by AT&T thenadministered by an industryconsortium and finally a W3C specification. • Voxeo, Tropo
Speech synthesis speech text
Speech recognition text (or some semantic representation) speech
Dialog management • Finite-statebaseddialog management • Framebased (form-based) dialog management • Information-statebased dialog management • Plan baseddialog management
Why voice • Wireless deviceshave small screens and limited input capabilities. • Telephone keypadcangiveusersonly a limitednumber of choices. • Speech technology is improving. • The exchange of information between a person and a computer is becomingmore like a real conversation. • Userswanthands-free or eyes-freeuse. • From a business viewpoint, voiceapplicationsopen up a host of new revenueopportunities. • Thereexistmanymoretelephonesthancomputers with the potential to access the Internet.
Applications • Information providing systems: • weather reports • stock quotes • timetables • Transaction-based systems: • calendarfunctions • shopping • financialtransactions • travel reservations
Components • Naturallanguageunderstanding • Proper Nameidentification • part of speechtagging • parser • dialog manager • output generator • naturallanguage generator • gesture generator • layout engine • input recognizer/decoder • automaticspeechrecognizer • gesturerecognizer • handwritingrecognizer • output renderer • text-to-speechengine • talkinghead • robot or avatar • multi-modal fusion
Types of systems • by modality • text-based • spoken dialog system • graphical user interface • multi-modal • by device • telephone-based systems • PDA systems • in-car systems • robot systems • desktop/laptop systems • native • in-browser systems • in-virtual machine • in-virtual environment • robots • by style • command-based • menu-driven • natural language • speech graffiti • by initiative • system initiative • userinitiative • mixed initiative • by application • information service • command-and-control • entertainment • education/tutorial • edutainment • reminder systems • companion systems • healthcare • eldercare • assistive/access systems
Mobile voice apps • Voice on the Web • http://www.youtube.com/watch?v=OURZpqh-35A&eurl=&feature=player_embedded
Relation to other fields • Phonetics • Phonology • Syntax • Semantics • Pragmatics • spoken language understanding • psycholinguistics • human communication • discourse analysis • human-computer interaction • computational linguistics • NL-parsing • NL-generation • language modeling • multi-modal fusion • multi-modal fission • psychology • cognitive science • affective dialog • user modeling • embodied communication
What this course is not about • Sophisticated dialog management • Multi-modal systems • Non-spoken dialog systems
What this course could mean to you • Will prepare you for writing a thesis in the area of dialog systems (if you so choose) • Will prepare you for work in the industry • A link to the linkedin page
Roles in the process • Dialog designer • VoiceXML programmer • Voice talent • Grammar writer • TTS specialist • Speech recognition specialist • Quality assurance specialist • Server specialist • Manager
Who are the big players in the area? • Google • http://googleblog.blogspot.com/2010/12/can-we-talk-better-speech-technology.html • Microsoft • http://gigaom.com/2010/12/06/microsoft-claims-its-place-in-a-voice-enabled-world/ • Apple • http://www.dailyfinance.com/story/company-news/apples-siri-purchase-heats-up-the-race-toward-a-voice-activated/19458344/ • IBM • http://www.ibm.com/news/in/en/2010/08/20/a896686u56875f96.html • Nuance • http://gigaom.com/2011/01/19/nuance-releases-mobile-sdk-to-speechify-apps/ • Voxeo • AT&T
The Emergence of Speech as a Mobile Platform Market Trends Speech-Enabled Mobile AppsGainingAcceptance • Voice Control in a Mission-Critical Environment • Search Engine for Audio-Visual Content • Instantaneous Language Translation • IBM'sSpoken Web • What's Driving Speech as a Mobile Platform? • Mobile Devices and Peripherals • Cloud Computing • Open Technologies • Mashups and the Programmable Web • Legislation • Closing the (Mobile) Digital Divide • An Overview of Emerging SAAP ApplicationsCurrentSpeech-EquippedDevices are Merely the Tip of the Iceberg • SaaPEnables New Application Interaction • Spoken Alerts • Mobile Reminders • SynthesizedSpeech • Email and Text Messages • Speech-to-Text for Voicemail • SaaPEnablesVoiceUser Interfaces • SpeechRecognition: The Foundation of Speech-EnabledAppsConstrained vs. Natural Language Processing • Automated vs. Hybrid SpeechRecognition • Applications for SpeechRecognition • Speaker Authentication • Email and Text MessagesComposition • Launch and Control Mobile Apps • Special Case: VoiceActivation
W3C Speech Standards Torbjörn Lager
The big picture HTML Webbläsare Webbservrar
The place of speech technology • … speech technology itself has a very long way to go. … the most important thing may turn out to be be not the speech technology itself, but the way in which speech technology connects to all the other technologies. Tim Berners-Lee
The big picture HTML HTML-browser VoiceXML Webb-servers VoiceXML-browser(ASR, TTS)
The What and Why of Standards • Software standards include terminology, languages and protocols specified by committees of experts for widespread use in the software industry. Software standards have both advantages and disadvantages. • Advantages: • developers can create applications using the standard languages that are portable across a variety of platforms; • products from different vendors are able to interact with each other; • a community of experts evolves around the standard and is available to develop products and services based on the standard. • Disadvantages: • some developers feel that standards may inhibit creativity and stall the introduction of superior technology. • However, in the area of speech, vendors are enthusiastic about standards and frequently complain that standards are not developed fast enough. • Emerging speech-technology standards could give a boost to an industry hampered by proprietary software and hardware.
World Wide Web Consortium http://www.w3.org/
W3C Speech Standards • Speech Recognition Grammar Specification (SRGS) – • What the user can say • Semantic Interpretation for Speech Recognition (SISR) – • What the user means • Speech Synthesis Markup Language (SSML) – • What the user hears • Pronunciation Lexicon Specification (PLS) – • How words are pronounced
Intro to XML • Standard for storage and transportation of data • Maintained by W3C (w3.org/TR/REC-xml) • Elements and tags • Well-formedness • Validity • DTD • Editor (Textmate + XMLmate)
Speech synthesis text lang speech voice persona
A peek inside the black box • http://www.explainthatstuff.com/how-speech-synthesis-works.html