1 / 32

Speech Recognition

Speech Recognition. Yonglei Tao. Voice-Activated GPS. Voice User Interface (VUI). A VUI allows human interaction with computers through a voice/speech platform Basic components System messages Grammars Dialog logic Benefits Loosen some physical constraints such as screen size

mikkel
Télécharger la présentation

Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Recognition Yonglei Tao

  2. Voice-Activated GPS

  3. Voice User Interface (VUI) • A VUI allows human interaction with computers through a voice/speech platform • Basic components • System messages • Grammars • Dialog logic • Benefits • Loosen some physical constraints such as screen size • Provide tools for universal design • disability and situational impairments • Intuitive and efficiency

  4. System Architecture

  5. Components • Endpointing • Speech to endpointed utterance • Feature extraction • Endpointed utterance to feature vectors • Recognition • Feature vectors to word string(s) • Natural language understanding • Word string(s) to meaning(s) • Dialog management • Meaning to actions

  6. Typical Recognition Components

  7. Examples • Book, boot • Write, right • Flew, flu, flue • Eight books • Ate books • I scream • Ice cream

  8. Components • Acoustic models • Internal representation of each basic sound • Dictionary • A list of words and pronunciations • Grammar • Defines all possible strings of words the recognizer can handle • Allows to associate a meaning with those strings • Either rule-based or statistical (created by computing the probability of words occurring in a given context)

  9. Recognition • Recognition search • A recognizer searches the recognition model to find the best-matching word string • Confidence measures • A quantitative measure of how confident the recognizer is for the best-matching string • VUI developers can use those measures in several ways • N-Best processing • A recognizer returns severalresults with the confidence measure for each

  10. Speech Recognition Engines • Microsoft Visual Studio & CMU Sphinx • Grammar • Android • Language model – free form for dictation or web search for short phrases • Google Web Speech API for Web Applications

  11. BNF (Backus-Naur Form) • Notation for context-free grammars • Often used to describe the syntax of programming languages • Also specify the words and patterns of words to be listened for by a speech recognizer • EBNF (Extended Backus-Naur Form) • ABNF (Augmented Backus-Naur Form) • Basis for speech grammar specifications • ABNF for .Net • Regular grammar for Java

  12. Basics ::= meaning "is defined as" | meaning "or" < > include category name Terminal basic component <X> ::= a b c a sequence <Y> ::= a | b | c optional <Z> ::= a | a <Z> one or more

  13. Example • Grammar for a speech recognition calculator

  14. Visual Studio Speech Recognizer

  15. Speech Recognition with Visual Studio • Examples • http://www.phon.ucl.ac.uk/courses/spsci/compmeth/speech/recognition.html • http://blogs.msdn.com/b/devschool/archive/2012/02/06/speech-recognition-using-visual-studio-determining-the-bna.aspx • Grammar Class • http://msdn.microsoft.com/en-us/library/system.speech.recognition.grammar.aspx • GrammarBuilderClass • http://msdn.microsoft.com/en-us/library/system.speech.recognition.grammarbuilder.aspx

  16. Speech Recognition for Java • Sphinx 4 • A speech recognition engine written entirely in Java • Created by CMU, Sun, Mitsubishi, HP, … • Open source • Compliant with JSpeech Grammar Format • Platform- and vendor-independent • Programmer’s guide http://cmusphinx.sourceforge.net/sphinx4/ • An example https://www.assembla.com/code/sonido/subversion/nodes/4/sphinx4/src/apps/edu/cmu/sphinx/demo/helloworld

  17. A Sample Grammar #JSGF V1.0; public <workProgram> = <ask> <action> <program>; <ask> = please | could you; <action> = start | open | stop | close | kill | shut down ; <program> = word | excel | out look | note pad ;

  18. Android Speech Recognition public class MainActivity extends Activity { private static final int VOICE_RECOGNITION = 1; Button speakButton ; TextViewspokenWords; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); speakButton = (Button) findViewById(R.id.button1); spokenWords = (TextView)findViewById(R.id.textView1); } @Override public booleanonCreateOptionsMenu(Menu menu) { // Inflate the menu; this adds items to the action bar if it is present. getMenuInflater().inflate(R.menu.main, menu); return true; }

  19. @Override protected void onActivityResult(intrequestCode, intresultCode, Intent data) { if (requestCode == VOICE_RECOGNITION && resultCode == RESULT_OK) { ArrayList<String> results; results = data.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS); // TODO Do something with the recognized voice strings Toast.makeText(this, results.get(0), Toast.LENGTH_SHORT).show(); spokenWords.setText(results.get(0)); } super.onActivityResult(requestCode, resultCode, data); } public void btnSpeak(View view){ Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH); // Specify free form input intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM); intent.putExtra(RecognizerIntent.EXTRA_PROMPT,"Please start speaking"); intent.putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 1); intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.ENGLISH); startActivityForResult(intent, VOICE_RECOGNITION); } }

  20. Android and Web Speech Recognition • Android Voice Recognition Tutorial • http://www.javacodegeeks.com/2012/08/android-voice-recognition-tutorial.html • Google Web Speech Recognition Examples • http://stiltsoft.com/blog/2013/05/google-chrome-how-to-use-the-web-speech-api/ • http://stackoverflow.com/questions/17635354/developing-a-simple-voice-driven-web-app-using-web-speech-api • http://apprentice.craic.com/tutorials/37

  21. Challenges for VUI Design • People have very little patience for a "machine that does not understand” • VUIs need to respond to input reliably, or they will be rejected by their users • Designing a usable VUI requires interdisciplinary talents of computer science, linguistics and human factors • The closer the VUI matches the user's mental model of the task, the easier it will be to use with little or no training, resulting in both higher efficiency and higher user satisfaction

  22. Natural Language Understanding • Ambiguity • Refers to phrases that look distinct in print but sound similar when spoken, for example, • “Wreck a nice beach” • “Recognize speech” • As the vocabulary and grammar get larger, the potential for ambiguity increases • Short words and phrases are harder to recognize than longer ones

  23. Language Understanding (Cont.) • Deviation • Deviating from what the developer expects • For example, an issue with the question “Is that correct?” • Expecting a simple response like “Yes”, “No”, or “Correct” • Southern speakers would respond with “Yes, ma’am” or “No, ma’am”

  24. Discussion • What you would expect if the user asks to start Microsoft Word? • Please start word • Could you start word • Start word • Please open word • Could you open word • Open word

  25. Discussion (Cont.) • If the grammar accepts only those, determine whether or not the action to open the application can be as follows:

  26. Language Understanding (Cont.) • Keyword Extraction • Important for applications built with a speech recognizer that returns a string containing the actual words spoke by the user • Leaving the application to interpret their semantic meaning • One might say “Computer, find me some information about the flooding in Detroit recently“ • Keywords like “find”, “flooding”, and “Detroit” are crucial for an accurate response from the VUI • Others are filler words

  27. Dialog Management • Multi-modelity • Interaction can occur through different mediums • Need to consider when and which part of the application allows to be multi-model • Grammar • There is a close relationship between what a prompt says and what the caller ends up saying to the system • Especially the words used • Configuration files • You may choose the confidence level at which the recognizer will reject the input rather than return the answer • You may also choose parameters for the endpointer, that is, how long it should listen before timing out

  28. Dialog Management (Cont.) • Error handling • Allow the user to be able to recover after errors and get the dialog with the user back on track • Recognition does not always succeed. When it fails, there are a number of messages the recognizer may return to the application. • Voice recognition accuracy • In-grammar data • Out-grammar data

  29. Error Handling • In-grammar data • Correct Accept • the recognizer returned the correct answer • False Accept • the recognizer returned the wrong answer • False Reject • the recognizer could not find match and gave up • Out-of-grammar data • Correct Reject • the recognizer correctly rejected the input • False Accept • the recognizer returned a value that is wrong because the input is not in the grammar • How to handle each categories?

  30. Error Handing in Android

More Related