Speech Recognition

Speech Recognition

What makes speech recognition hard?

Speech Recognition • Task: Identify sequence of words uttered by speaker, given acoustic waveform. • Uncertainty introduced by noise, speaker error, variation in pronunciation, homonyms, etc. • Thus speech recognition is viewed as problem of probabilistic inference.

From Russell and Norvig, Artificial Intelligence Example: “I’m firsty, um, can I hafsomefingto dwink?”

Speech Recognition System Architecture (from Buchsbaum & Giancarlo paper) Acoustic feature extraction Acoustic Features–>Phones model Phones–>Word pronounciation model Language model Here, “lattice” means “Hidden Markov Model”

From Russell and Norvig, Artificial Intelligence Acoustic feature extraction

From Russell and Norvig, Artificial Intelligence

Hidden Markov Models • Markov model: Given stateXt, what is probability of transitioning to next state Xt+1 ? • E.g., word bigram probabilities give P (wordt+1 | wordt ) • Hidden Markov model: There are observable states (e.g., signal S) and “hidden” states (e.g., Words). HMM represents probabilities of hidden states given observable states.

Phone model P( phone | frame features) = P(frame features| phone) P(phone) P(frame features| phone) often represented by Gaussian mixture model

From Russell and Norvig, Artificial Intelligence Acoustic Features–>Phones model

Word Pronunciation model Now we want P (words|phones1:t ) =  P(phones1:t | words) P(words) Represent P(phones1:t | words) as an HMM Phones–>Word pronounciation model

From Russell and Norvig, Artificial Intelligence Example of Phones–>Word pronounciation model

From Russell and Norvig, Artificial Intelligence Language model

To build a speech recognition system, need: • Lots of data • Acoustic signal processing tools • Methods for learning various probability models • Methods for “maximum likelihood” calculation (i.e., search or “decoding”): Suppose we have observations (features from acoustic signal) O= (o1o2o3…on). We want to find W* = (w1w2w3 … wn) such that

To build a speech recognition system, need: • Lots of data • Acoustic signal processing tools • Methods for learning various probability models • Methods for “maximum likelihood” calculation (i.e., search or “decoding”): Suppose we have observations (features from acoustic signal) O= (o1o2o3…on). We want to find W* = (w1w2w3 … wn) such that Search or “decoding” method Language model Combine phone models, segmentation models, word pronunciation models

Emotion recognition in speech(by OES high-school students!) http://www.youtube.com/watch?v=NnbsGyViN3Y

Speech Recognition