1 / 25

EEC-693/793 Applied Computer Vision with Depth Cameras

EEC-693/793 Applied Computer Vision with Depth Cameras. Lecture 12 Wenbing Zhao wenbing@ieee.org. Outline. Speech Recognition How speech recognition works Exploring Microsoft Speech API ( SAPI ) Creating your own grammar and choices for the speech recognition engine

wpotts
Télécharger la présentation

EEC-693/793 Applied Computer Vision with Depth Cameras

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EEC-693/793Applied Computer Vision with Depth Cameras Lecture 12 Wenbing Zhao wenbing@ieee.org

  2. Outline Speech Recognition How speech recognition works Exploring Microsoft Speech API (SAPI) Creating your own grammar and choices for the speech recognition engine Draw What I Want – building a speech-enabled application

  3. How Speech Recognition Works • Kinect microphone array captures the audio stream, and convert the analog audio into digital sound signals • The audio sound signals are sent to the speech recognition engine for recognition • The acoustic model of the speech recognition engine analyzes the audio and converts the sound into a number of basic speech elements, phonemes • Then, the language model is used to analyze the content of the speech and match the word by combining the phonemes with a build-in dictionary • Context sensitive

  4. How Speech Recognition Works

  5. Types of Speech Recognition • Command mode; you say a command at a time for the speech recognition engine to recognize • Sentence mode / diction mode: you say a sentence to perform an operation, e.g., mirror the shape

  6. Microsoft Speech API • Kinect SDK comes with the Microsoft Kinect speech recognition language pack

  7. SpeechRecognitionEngine Class • The InstalledRecognizers method of the speechRecognitionEngine class returns the lists of installed recognizers in the system, and we can filter them out based on the recognizer ID • The SpeechRecognitionEngine class accepts an audio stream from the Kinect sensor and processes it • The SpeechRecognitionEngine class raises a sequence of events when the audio stream is detected: • SpeechDetected is raised if the audio appears to be a speech • SpeechHypothesized then fires multiple times when the words are tentatively detected. • Finally SpeechRecognized is raised when the recognizer finds the speech

  8. Steps for building speech-enabled apps • Enable the Kinect audio source • Start capturing the audio data stream • Identify the speech recognizer • Define the grammar for the speech recognizer • Start the speech recognizer • Attach the speech audio source to the recognizer • Register the event handler for speech recognition • Handle the different events invoked by the speech recognition engine

  9. Identify the speech recognizer private static RecognizerInfo GetKinectRecognizer() { foreach (RecognizerInfo recognizer in SpeechRecognitionEngine.InstalledRecognizers()) { string value; recognizer.AdditionalInfo.TryGetValue("Kinect", out value); if ("True".Equals(value, StringComparison.OrdinalIgnoreCase) && "en-US". Equals(recognizer.Culture.Name, StringComparison.OrdinalIgnoreCase)) { return recognizer; } } return null; } RecognizerInfo recognizerInfo = GetKinectRecognizer();

  10. Define grammar for the speech recognizer Using choice and GrammarBuilder Multiple sets of choices can be added in GrammarBuilder Creating grammar from GrammarBuilder Loading grammar into speech recognizer var colorObjects = new Choices(); colorObjects.Add("red"); colorObjects.Add("green"); colorObjects.Add("blue"); colorObjects.Add("yellow"); colorObjects.Add("gray"); // New Grammar builder for color grammarBuilder.Append(colorObjects); // Another Grammar Builder for object grammarBuilder.Append(new Choices("circle", "square", "triangle", "rectangle")); // Create Grammar from GrammarBuilder var grammar = new Grammar(grammarBuilder);

  11. Define grammar for the speech recognizer Can also build grammar using XML SrgsDocument grammarDoc = new SrgsDocument("mygrammar.xml"); Grammar grammar = new Grammar(grammarDoc); <?xml version="1.0" encoding="UTF-8" ?> <grammar version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0" root="Main"> <rule id="color" scope="public"> <one-of> <item>red</item> <item>green</item> <item>blue</item> </one-of> </rule> </grammar>

  12. Define grammar for the speech recognizer Multiple grammars can be loaded to the recognizer

  13. Define grammar for the speech recognizer private void BuildGrammarforRecognizer(RecognizerInfo recognizerInfo) { var grammarBuilder = new GrammarBuilder { Culture = recognizerInfo.Culture }; // first say Draw grammarBuilder.Append(new Choices("draw")); var colorObjects = new Choices(); colorObjects.Add("red"); colorObjects.Add("green"); colorObjects.Add("blue"); colorObjects.Add("yellow"); colorObjects.Add("gray"); // New Grammar builder for color grammarBuilder.Append(colorObjects); // Another Grammar Builder for object grammarBuilder.Append(new Choices("circle", "square", "triangle", "rectangle")); // Create Grammar from GrammarBuilder var grammar = new Grammar(grammarBuilder); // Creating another Grammar and load var newGrammarBuilder = new GrammarBuilder(); newGrammarBuilder.Append("close the application"); var grammarClose = new Grammar(newGrammarBuilder);

  14. // Start the speech recognizer speechEngine = new SpeechRecognitionEngine(recognizerInfo.Id); speechEngine.LoadGrammar(grammar); // loading grammer into recognizer speechEngine.LoadGrammar(grammarClose); // Attach the speech audio source to the recognizer int SamplesPerSecond = 16000; int bitsPerSample = 16; int channels = 1; int averageBytesPerSecond = 32000; int blockAlign = 2; speechEngine.SetInputToAudioStream( audioStream, new SpeechAudioFormatInfo(EncodingFormat.Pcm, SamplesPerSecond, bitsPerSample, channels, averageBytesPerSecond, blockAlign, null)); // Register the event handler for speech recognition speechEngine.SpeechRecognized += speechRecognized; speechEngine.SpeechHypothesized += speechHypothesized; speechEngine.SpeechRecognitionRejected += speechRecognitionRejected; speechEngine.RecognizeAsync(RecognizeMode.Multiple); } RecognizeAsync(): performs a single, asynchronous recognition operation. The recognizer performs the operation against its loaded and enabled speech recognition grammars

  15. Handle the different events invoked by the speech recognition engine private void speechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e) { } private void speechHypothesized(object sender, SpeechHypothesizedEventArgs e) { wordsTenative.Text = e.Result.Text; } private void speechRecognized(object sender, SpeechRecognizedEventArgs e) { wordsRecognized.Text = e.Result.Text; confidenceTxt.Text = e.Result.Confidence.ToString(); float confidenceThreshold = 0.6f; if (e.Result.Confidence > confidenceThreshold) { CommandsParser(e); } }

  16. private void CommandsParser(SpeechRecognizedEventArgs e) { var result = e.Result; Color objectColor; Shape drawObject; System.Collections.ObjectModel.ReadOnlyCollection<RecognizedWordUnit> words = e.Result.Words; if (words[0].Text == "draw") { string colorObject = words[1].Text; switch (colorObject) { case "red": objectColor = Colors.Red; break; case "green": objectColor = Colors.Green; break; case "blue": objectColor = Colors.Blue; break; case "yellow": objectColor = Colors.Yellow; break; case "gray": objectColor = Colors.Gray; break; default: return; }

  17. var shapeString = words[2].Text; switch (shapeString) { case "circle": drawObject = new Ellipse(); drawObject.Width = 100; drawObject.Height = 100; break; case "square": drawObject = new Rectangle(); drawObject.Width = 100; drawObject.Height = 100; break; case "rectangle": drawObject = new Rectangle(); drawObject.Width = 100; drawObject.Height = 60; break; case "triangle": var polygon = new Polygon(); polygon.Points.Add(new Point(0, 30)); polygon.Points.Add(new Point(-60, -30)); polygon.Points.Add(new Point(60, -30)); drawObject = polygon; break; default: return; }

  18. canvas1.Children.Clear(); drawObject.SetValue(Canvas.LeftProperty, 80.0); drawObject.SetValue(Canvas.TopProperty, 80.0); drawObject.Fill = new SolidColorBrush(objectColor); canvas1.Children.Add(drawObject); } if (words[0].Text == "close" && words[1].Text == "the" && words[2].Text == "application") { this.Close(); } }

  19. Build KinectAudio App Create a new C# WPF project with name DrawShapeFromSpeech Add Microsoft.Kinect reference Add Microsoft.Speech (not System.Speech!!!) Design GUI Adding code

  20. Add Microsoft.Speech assembly

  21. GUI Design Canvas

  22. Adding Code using Microsoft.Kinect; using Microsoft.Speech.Recognition; using Microsoft.Speech.AudioFormat; using System.IO; • Import namespaces • Add member variables: • Register WindowLoaded event handler programmatically KinectSensor sensor; Stream audioStream; SpeechRecognitionEngine speechEngine; public MainWindow() { InitializeComponent(); Loaded += new RoutedEventHandler(WindowLoaded); }

  23. Adding Code: WindowLoaded private void WindowLoaded(object sender, RoutedEventArgs e) { this.sensor = KinectSensor.KinectSensors[0]; this.sensor.Start(); audioStream = this.sensor.AudioSource.Start(); RecognizerInfo recognizerInfo = GetKinectRecognizer(); if (recognizerInfo == null) { MessageBox.Show("Could not find Kinect speech recognizer"); return; } BuildGrammarforRecognizer(recognizerInfo); // provided earlier statusBar.Text = "Speech Recognizer is ready"; }

  24. Add event handler for speechHypothesized Add event handler for speechRecognized CommandsParser() is invoked, which draws the shape spoken You can close the app by saying: close the application Add event handler for speechRecognitionRejected empty Adding Code: code provided earlier

  25. Challenge Tasks • For advanced students, improve the app in the following ways: • Enable both color image and skeleton data streams • Display color image frames (but not the skeleton) • Modify the grammar such that you can add a particular shape to a particular joint location • E.g., draw a red circle at the right hand • Enable drawing by right (or left) hand, using the color and shape you specified in voice command EEC492/693/793 - iPhone Application Development

More Related