EEC-693/793 Applied Computer Vision with Depth Cameras

EEC-693/793Applied Computer Vision with Depth Cameras Lecture 12 Wenbing Zhao wenbing@ieee.org

Outline Speech Recognition How speech recognition works Exploring Microsoft Speech API (SAPI) Creating your own grammar and choices for the speech recognition engine Draw What I Want – building a speech-enabled application

How Speech Recognition Works • Kinect microphone array captures the audio stream, and convert the analog audio into digital sound signals • The audio sound signals are sent to the speech recognition engine for recognition • The acoustic model of the speech recognition engine analyzes the audio and converts the sound into a number of basic speech elements, phonemes • Then, the language model is used to analyze the content of the speech and match the word by combining the phonemes with a build-in dictionary • Context sensitive

How Speech Recognition Works

Types of Speech Recognition • Command mode; you say a command at a time for the speech recognition engine to recognize • Sentence mode / diction mode: you say a sentence to perform an operation, e.g., mirror the shape

Microsoft Speech API • Kinect SDK comes with the Microsoft Kinect speech recognition language pack

SpeechRecognitionEngine Class • The InstalledRecognizers method of the speechRecognitionEngine class returns the lists of installed recognizers in the system, and we can filter them out based on the recognizer ID • The SpeechRecognitionEngine class accepts an audio stream from the Kinect sensor and processes it • The SpeechRecognitionEngine class raises a sequence of events when the audio stream is detected: • SpeechDetected is raised if the audio appears to be a speech • SpeechHypothesized then fires multiple times when the words are tentatively detected. • Finally SpeechRecognized is raised when the recognizer finds the speech

Steps for building speech-enabled apps • Enable the Kinect audio source • Start capturing the audio data stream • Identify the speech recognizer • Define the grammar for the speech recognizer • Start the speech recognizer • Attach the speech audio source to the recognizer • Register the event handler for speech recognition • Handle the different events invoked by the speech recognition engine

Identify the speech recognizer private static RecognizerInfo GetKinectRecognizer() { foreach (RecognizerInfo recognizer in SpeechRecognitionEngine.InstalledRecognizers()) { string value; recognizer.AdditionalInfo.TryGetValue("Kinect", out value); if ("True".Equals(value, StringComparison.OrdinalIgnoreCase) && "en-US". Equals(recognizer.Culture.Name, StringComparison.OrdinalIgnoreCase)) { return recognizer; } } return null; } RecognizerInfo recognizerInfo = GetKinectRecognizer();

Define grammar for the speech recognizer Using choice and GrammarBuilder Multiple sets of choices can be added in GrammarBuilder Creating grammar from GrammarBuilder Loading grammar into speech recognizer var colorObjects = new Choices(); colorObjects.Add("red"); colorObjects.Add("green"); colorObjects.Add("blue"); colorObjects.Add("yellow"); colorObjects.Add("gray"); // New Grammar builder for color grammarBuilder.Append(colorObjects); // Another Grammar Builder for object grammarBuilder.Append(new Choices("circle", "square", "triangle", "rectangle")); // Create Grammar from GrammarBuilder var grammar = new Grammar(grammarBuilder);

Define grammar for the speech recognizer Can also build grammar using XML SrgsDocument grammarDoc = new SrgsDocument("mygrammar.xml"); Grammar grammar = new Grammar(grammarDoc); <?xml version="1.0" encoding="UTF-8" ?> <grammar version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0" root="Main"> <rule id="color" scope="public"> <one-of> <item>red</item> <item>green</item> <item>blue</item> </one-of> </rule> </grammar>

Define grammar for the speech recognizer Multiple grammars can be loaded to the recognizer

Define grammar for the speech recognizer private void BuildGrammarforRecognizer(RecognizerInfo recognizerInfo) { var grammarBuilder = new GrammarBuilder { Culture = recognizerInfo.Culture }; // first say Draw grammarBuilder.Append(new Choices("draw")); var colorObjects = new Choices(); colorObjects.Add("red"); colorObjects.Add("green"); colorObjects.Add("blue"); colorObjects.Add("yellow"); colorObjects.Add("gray"); // New Grammar builder for color grammarBuilder.Append(colorObjects); // Another Grammar Builder for object grammarBuilder.Append(new Choices("circle", "square", "triangle", "rectangle")); // Create Grammar from GrammarBuilder var grammar = new Grammar(grammarBuilder); // Creating another Grammar and load var newGrammarBuilder = new GrammarBuilder(); newGrammarBuilder.Append("close the application"); var grammarClose = new Grammar(newGrammarBuilder);

// Start the speech recognizer speechEngine = new SpeechRecognitionEngine(recognizerInfo.Id); speechEngine.LoadGrammar(grammar); // loading grammer into recognizer speechEngine.LoadGrammar(grammarClose); // Attach the speech audio source to the recognizer int SamplesPerSecond = 16000; int bitsPerSample = 16; int channels = 1; int averageBytesPerSecond = 32000; int blockAlign = 2; speechEngine.SetInputToAudioStream( audioStream, new SpeechAudioFormatInfo(EncodingFormat.Pcm, SamplesPerSecond, bitsPerSample, channels, averageBytesPerSecond, blockAlign, null)); // Register the event handler for speech recognition speechEngine.SpeechRecognized += speechRecognized; speechEngine.SpeechHypothesized += speechHypothesized; speechEngine.SpeechRecognitionRejected += speechRecognitionRejected; speechEngine.RecognizeAsync(RecognizeMode.Multiple); } RecognizeAsync(): performs a single, asynchronous recognition operation. The recognizer performs the operation against its loaded and enabled speech recognition grammars

Handle the different events invoked by the speech recognition engine private void speechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e) { } private void speechHypothesized(object sender, SpeechHypothesizedEventArgs e) { wordsTenative.Text = e.Result.Text; } private void speechRecognized(object sender, SpeechRecognizedEventArgs e) { wordsRecognized.Text = e.Result.Text; confidenceTxt.Text = e.Result.Confidence.ToString(); float confidenceThreshold = 0.6f; if (e.Result.Confidence > confidenceThreshold) { CommandsParser(e); } }

private void CommandsParser(SpeechRecognizedEventArgs e) { var result = e.Result; Color objectColor; Shape drawObject; System.Collections.ObjectModel.ReadOnlyCollection<RecognizedWordUnit> words = e.Result.Words; if (words[0].Text == "draw") { string colorObject = words[1].Text; switch (colorObject) { case "red": objectColor = Colors.Red; break; case "green": objectColor = Colors.Green; break; case "blue": objectColor = Colors.Blue; break; case "yellow": objectColor = Colors.Yellow; break; case "gray": objectColor = Colors.Gray; break; default: return; }

var shapeString = words[2].Text; switch (shapeString) { case "circle": drawObject = new Ellipse(); drawObject.Width = 100; drawObject.Height = 100; break; case "square": drawObject = new Rectangle(); drawObject.Width = 100; drawObject.Height = 100; break; case "rectangle": drawObject = new Rectangle(); drawObject.Width = 100; drawObject.Height = 60; break; case "triangle": var polygon = new Polygon(); polygon.Points.Add(new Point(0, 30)); polygon.Points.Add(new Point(-60, -30)); polygon.Points.Add(new Point(60, -30)); drawObject = polygon; break; default: return; }

canvas1.Children.Clear(); drawObject.SetValue(Canvas.LeftProperty, 80.0); drawObject.SetValue(Canvas.TopProperty, 80.0); drawObject.Fill = new SolidColorBrush(objectColor); canvas1.Children.Add(drawObject); } if (words[0].Text == "close" && words[1].Text == "the" && words[2].Text == "application") { this.Close(); } }

Build KinectAudio App Create a new C# WPF project with name DrawShapeFromSpeech Add Microsoft.Kinect reference Add Microsoft.Speech (not System.Speech!!!) Design GUI Adding code

Add Microsoft.Speech assembly

GUI Design Canvas

Adding Code using Microsoft.Kinect; using Microsoft.Speech.Recognition; using Microsoft.Speech.AudioFormat; using System.IO; • Import namespaces • Add member variables: • Register WindowLoaded event handler programmatically KinectSensor sensor; Stream audioStream; SpeechRecognitionEngine speechEngine; public MainWindow() { InitializeComponent(); Loaded += new RoutedEventHandler(WindowLoaded); }

Adding Code: WindowLoaded private void WindowLoaded(object sender, RoutedEventArgs e) { this.sensor = KinectSensor.KinectSensors[0]; this.sensor.Start(); audioStream = this.sensor.AudioSource.Start(); RecognizerInfo recognizerInfo = GetKinectRecognizer(); if (recognizerInfo == null) { MessageBox.Show("Could not find Kinect speech recognizer"); return; } BuildGrammarforRecognizer(recognizerInfo); // provided earlier statusBar.Text = "Speech Recognizer is ready"; }

Add event handler for speechHypothesized Add event handler for speechRecognized CommandsParser() is invoked, which draws the shape spoken You can close the app by saying: close the application Add event handler for speechRecognitionRejected empty Adding Code: code provided earlier

Challenge Tasks • For advanced students, improve the app in the following ways: • Enable both color image and skeleton data streams • Display color image frames (but not the skeleton) • Modify the grammar such that you can add a particular shape to a particular joint location • E.g., draw a red circle at the right hand • Enable drawing by right (or left) hand, using the color and shape you specified in voice command EEC492/693/793 - iPhone Application Development

EEC-693/793 Applied Computer Vision with Depth Cameras

EEC-693/793 Applied Computer Vision with Depth Cameras

Presentation Transcript

Computer Vision Syndrome

Publish paper in Computer Vision Area

CSCE 643: Introduction to Computer Vision

Welcome to the First Workshop on R G B - D : Advanced Reasoning with Depth Cameras!

CSCI-431 Introduction to Computer Vision

Where computer vision needs help from computer science (and machine learning)

Computer Vision

Machine Vision: Image Capture and Digitization

3-D Computer Vision CSc 83029

Robot vision

Computer Vision Research @ UNR

Lecture 4a: Cameras

Inverse Depth Parameterization for Monocular SLAM Vision Seminar

CSE 185 Introduction to Computer Vision

Computer Vision Aids for the Blind and Low-Vision Patients

Attention in Computer Vision

Stereo Vision System

Depth

Computer Vision cmput 499/615

Computer Vision @

Computer Vision, CS 763

Computer Vision