170 likes | 497 Vues
Speech Recognition. What makes speech recognition hard? . Speech Recognition. Task: Identify sequence of words uttered by speaker, given acoustic waveform. Uncertainty introduced by noise, speaker error, variation in pronunciation, homonyms, etc.
E N D
Speech Recognition • Task: Identify sequence of words uttered by speaker, given acoustic waveform. • Uncertainty introduced by noise, speaker error, variation in pronunciation, homonyms, etc. • Thus speech recognition is viewed as problem of probabilistic inference.
From Russell and Norvig, Artificial Intelligence Example: “I’m firsty, um, can I hafsomefingto dwink?”
Speech Recognition System Architecture (from Buchsbaum & Giancarlo paper) Acoustic feature extraction Acoustic Features–>Phones model Phones–>Word pronounciation model Language model Here, “lattice” means “Hidden Markov Model”
From Russell and Norvig, Artificial Intelligence Acoustic feature extraction
Hidden Markov Models • Markov model: Given stateXt, what is probability of transitioning to next state Xt+1 ? • E.g., word bigram probabilities give P (wordt+1 | wordt ) • Hidden Markov model: There are observable states (e.g., signal S) and “hidden” states (e.g., Words). HMM represents probabilities of hidden states given observable states.
Phone model P( phone | frame features) = P(frame features| phone) P(phone) P(frame features| phone) often represented by Gaussian mixture model
From Russell and Norvig, Artificial Intelligence Acoustic Features–>Phones model
Word Pronunciation model Now we want P (words|phones1:t ) = P(phones1:t | words) P(words) Represent P(phones1:t | words) as an HMM Phones–>Word pronounciation model
From Russell and Norvig, Artificial Intelligence Example of Phones–>Word pronounciation model
From Russell and Norvig, Artificial Intelligence Language model
To build a speech recognition system, need: • Lots of data • Acoustic signal processing tools • Methods for learning various probability models • Methods for “maximum likelihood” calculation (i.e., search or “decoding”): Suppose we have observations (features from acoustic signal) O= (o1o2o3…on). We want to find W* = (w1w2w3 … wn) such that
To build a speech recognition system, need: • Lots of data • Acoustic signal processing tools • Methods for learning various probability models • Methods for “maximum likelihood” calculation (i.e., search or “decoding”): Suppose we have observations (features from acoustic signal) O= (o1o2o3…on). We want to find W* = (w1w2w3 … wn) such that Search or “decoding” method Language model Combine phone models, segmentation models, word pronunciation models
To build a speech recognition system, need: • Lots of data • Acoustic signal processing tools • Methods for learning various probability models • Methods for “maximum likelihood” calculation (i.e., search or “decoding”): Suppose we have observations (features from acoustic signal) O= (o1o2o3…on). We want to find W* = (w1w2w3 … wn) such that Search or “decoding” method Language model Combine phone models, segmentation models, word pronunciation models
Emotion recognition in speech(by OES high-school students!) http://www.youtube.com/watch?v=NnbsGyViN3Y