1 / 38

Speech Technologies and VoiceXML

Speech Technologies and VoiceXML. try Department of Computer Science National Cheng-Chi University. Reference. [1]Bob Edgar(2001), “ The VoiceXML Handbook ” ,NY:CMP Books. [2]Dave Raggett(2001), ” Getting started with VoiceXML 2.0 ” ,W3C.

Télécharger la présentation

Speech Technologies and VoiceXML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Technologies and VoiceXML try Department of Computer Science National Cheng-Chi University

  2. Reference • [1]Bob Edgar(2001),“The VoiceXML Handbook” ,NY:CMP Books. • [2]Dave Raggett(2001),”Getting started with VoiceXML 2.0”,W3C. • [3]Sun Microsystems(1998),”Java Speech Grammar Format Specification v1.0”,Sun Microsystems. • [4]Chetan Sharma and Jeff Kunins(2002),”VoiceXML:Strategies and Techniques for Effective Voice Application Development with VoiceXML 2.0”,Wiley. • [5]Brian Eberman,Jerry Carter,Darren Meyer,David Goddeau(2002),”Building VoiceXML Browsers with OpenVXI”, NY:ACM Press.

  3. Reference • [6]Microsoft (2002),“Speech Technology Overview ” , http://www.microsoft.com/speech/evaluation/techover/ • [7] VoiceGenie Technologies Inc.(2001),”White Paper:Speaking Freely About The VoiceGenie VoiceXML Gateway and the VoiceXML Interpreter”,VoiceGenie Technologies Inc. • [8]W3C(2002),”VoiceXML Specification v2.0”,W3C. • [9]Chun-Feng,Liao(2002),”Basics of Speech Recognition”,NCCU Computer Center.

  4. Presentation Agenda • Voice technologies Backgrounds • ASR/TTS • Voice browsing with VoiceXML • VoiceXML architecture • Implementations of VoiceXML Platform • VoiceXML document structure • Bringing Voice Technologies into Virtual Environment

  5. Voice Technologies • In the mid- to late 1990s, personal computers started to become powerful enough to support ASR • The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).

  6. Classification of Voice Application • Basic interactive voice response (IVR) • Computer: “For stock quotes, press 1. For trading, press 2. …” • Human: (presses DTMF “1”) • Basic speech ASR • C: “Say the stock name for a price quote.” • H: “Lucent Technologies”

  7. Classification of Voice Application • Advanced speech ASR • C: “Stock Services, how may I help you?” • H: “Uh, what’s Lucent trading at?” • “Near-natural language” ASR • C: “How may I help you?” • H: “Um, yeah, I’d like to get the current price of Lucent Technologies” • C: “Lucent is up two at sixty eight and a half.” • H: “OK. I want to buy one hundred shares at market price.” • C: “…”

  8. Speech Recognition • Capturing speech (analog) signals • Digitizing the sound waves, converting them to basic language units or phonemes, • Constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).

  9. Speech Recognition Process Flow Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )

  10. Speech Recognition Process Flow • Step 1:User Input • The system catches user’s voice in the form of analog acoustic signal . • Step 2:Digitization • Digitize the analog acoustic signal. • Step 3:Phonetic Breakdown • Breaking signals into phonemes.

  11. Speech Recognition Process Flow • Step 4:Statistical Modeling • Mapping phonemes to their phonetic representation using statistics model (ex:HMM) • Step 5:Matching • According to grammar , phonetic representation and Dictionary , the system returns an n-best list (I.e.:a word plus a confidence score • Grammar-the union words or phrases to constraint the range of input or output in the voice application. • Dictionary-the mapping table of phonetic representation and word(EX:thu,theethe)

  12. Speech Synthesis • Speech Synthesis, or text-to-speech, is the process of converting text into spoken language. • Breaking down the words into phonemes; • Analyzing for special handling of text such as numbers, currency amounts. • Generating the digital audio for playback.

  13. Speech Synthesis Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )

  14. Pervasive Computing Model • E-business has changed from client-server model to web-centric model • Once connect to the Internet,one can get any information he want. But people wants more convenient way to connect to Internet. • Lou Gerstner,CEO of IBM:Pervasive Computing Model is billion people interacting with million e-business with trillion devices interconnected.

  15. Voice Browsing • VoiceXML instead of HTML • A voice browser instead of an ordinary web browser • Phone instead of PC.

  16. Show : An Scenario of Using VoiceXML

  17. VoiceXML Overview • A language for specifying voice dialogs. • Voice dialogs use audio prompts and text-to-speech (TTS) for output; touch-tone keys (DTMF) and automatic speech recognition (ASR) for input. • Main input/output device (initially) is the phone. • Leverages the Internet for application development and delivery. • Standard language enables portability.(unifies dialog control languages)

  18. History of VoiceXML Source:VoiceXML forum(http://www.voicexml.org)

  19. Making use of mature Internet Technologies • Leverage existing web application development tools. • Leverage existing web infrastructure for application delivery. • Clean separation of service logic from user interaction.

  20. VoiceXML Platform Architecture

  21. VoiceXML Platform Architecture-1 • Telephone and Telephone network-Connects caller’s telephone with Telephony Server • VoiceXML Gateway • Voice Browser • Audio input-Speech Recognition (ASR), Touchtone (DTMF), Audio recording. • Audio output-Audio playback, Speech Synthesis (TTS) • Interface, Call Controls

  22. VoiceXML Platform Architecture-2 • VoiceXML Documents • Dialog and flow control • Client-side scripting (ECMAScript) • Speech Recognition grammar • Speech Synthesis pronunciation control • Document servers(web server) • Feeding Static VoiceXML documents or audio files. • Application servers • Generate VoiceXML documents dynamically. • Server-side application logic • Connect to Database, or database interface

  23. Voice Gateway

  24. VoiceXML Gateway(detail)

  25. Implementations of VoiceXML Gateways • In Taiwan: • Yes Mobile • Chunghwa Telecom Laboratories • eWings Technologies, Inc • Free • IBM VoiceServerSDK • Open Source • CMU:OpenVXI

  26. [DEMO]How to Write and Run VoiceXML Applications?

  27. [DEMO]Generate VoiceXML Document Dynamically-using ASP.NET

  28. VoiceXML Document Structure.

  29. A Simple VoiceXML Document

  30. [DEMO]VoiceXML /HTML Comparison

  31. Bringing Voice Technologies to 3D Virtual Environment

  32. Related Research • Raymond L.Smith,III and Stephen D.Roberts: • Using voice input command to operate simulation-animation. • The efficiency issues of ASR/TTS are taken into account. • Satoru,Osamu,Katunobu,Takashi,Tomoyoshi,Hideki,Shotaro,Takio and Katsuhiko: • Create 3D virtual user who can speak with user via speaker and microphone. • Virtual User have the ability to learn words and recognize human face.

  33. We can do more.. • Speak to many users who are “moving” in virtual environment. • System are built in distributed environment.(I.e. web) • Make use of XML technology (VoiceXML/SALT).

  34. Problems to Solve • Voice /Animation synchronization. • Protocol integration. • ASR/TTS integration and its performance issues. • Virtual user autonomy. • The “Voice propagation range” issues.

  35. System Design Prototype

  36. Summary • Speech is the most natural way for human to communicate thus it will become an important way in HCI. • VoiceXML has revolutionized speech recognition & telephony application development & deployment. • Adding Speech facilities into 3D virtual environment will make UI more friendly and enable multi-modal input/output. • My research interest on this topic will focus on voice-animation synchronization and enable SR/TTS in distributed 3D virtual environment .

  37. Q & A

More Related