Speech Recognition and Hidden Markov Models

Speech Recognition and Hidden Markov Models CPSC4600@UTC/CSE

Hidden Markov models • Probability fundamentals • Markov models • Hidden Markov models • Likelihood calculation

Probability fundamentals • Normalization • discrete and continuous • Independent events • joint probability • Dependent events • conditional probability • Bayes’ theorem • posterior probability • Marginalization • discrete and continuous

Normalisation Discrete: probability of all possibilities sums to one: Continuous: integral over entire probability density function (pdf) comes to one: Joint probability The joint probability that two independent events occur is the product of their individual probabilities:

Conditional probability If two events are dependent, we need to determine their conditional probabilities. The joint probability is now P(A,B) = P(A) P(B|A), (4) where P(B|A) is the probability of event B given that A occurred; conversely, taking the events the other way P(A,B) = P(A|B) P(B). (5)

Bayes’ theorem Equating the RHS of eqs. 4 and 5 gives For example, in a word recognition application we have which can be interpreted as The posterior probability is used to make Bayesian inferences; the conditional likelihood describes how likely the data were for a given class; the prior allows us to incorporate other forms of knowledge into our decision (like a language model); the evidence acts as a normalization factor and is often discarded in practice (as it is the same for all classes).

Marginalization Discrete: probability of event B, which depends on A, is the sum over A of all joint probabilities: Continuous: similarly, the nuisance factor x can be eliminated from its joint pdf with y:

Introduction to Markov Models • Set of states: • Process moves from one state to another generating a sequence of states : • Markov chain property: probability of each subsequent state depends only on what was the previous state: • To define Markov model, the following probabilities have to be specified: transition probabilities and initial probabilities

Example of Markov Model 0.3 0.7 Rain Dry 0.2 0.8 • Two states : ‘Rain’ and ‘Dry’. • Transition probabilities: P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 , P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8 • Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .

Calculation of sequence probability • By Markov chain property, probability of state sequence can be found by the formula: • Suppose we want to calculate a probability of a sequence of states in our example, {‘Dry’,’Dry’,’Rain’,Rain’}. • P({‘Dry’,’Dry’,’Rain’,Rain’} ) = • P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)= • = 0.3*0.2*0.8*0.6

Hidden Markov models. • Set of states: • Process moves from one state to another generating a sequence of states : • Markov chain property: probability of each subsequent state depends only on what was the previous state: • States are not visible, but each state randomly generates one of M observations (or visible states) • To define hidden Markov model, the following probabilities have to be specified: • matrix of transition probabilities A=(aij), aij= P(si| sj) • matrix of observation probabilitiesB=(bi (vm )), bi(vm )= P(vm| si) • initial probabilities =(i), i= P(si) . Model is represented by M=(A, B, ).

Example of Hidden Markov Model 0.3 0.7 Low High 0.6 0.2 0.8 0.6 0.4 0.4 Dry Rain Two states : ‘Low’ and ‘High’ atmospheric pressure.

Example of Hidden Markov Model • Two states : ‘Low’ and ‘High’ atmospheric pressure. • Two observations : ‘Rain’ and ‘Dry’. • Transition probabilities: P(‘Low’|‘Low’)=0.3 , P(‘High’|‘Low’)=0.7 , P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8 • Observation probabilities : P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 , P(‘Rain’|‘High’)=0.4 , P(‘Dry’|‘High’)=0.3 . • Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .

Calculation of observation sequence probability • Suppose we want to calculate a probability of a sequence of observations in our example, {‘Dry’,’Rain’}. • Consider all possible hidden state sequences: • P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} , {‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’}) • where first term is : • P({‘Dry’,’Rain’} , {‘Low’,’Low’})= • P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) = • P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low) • = 0.4*0.6*0.4*0.3

Summary of Markov models State topology: Initial-state probabilities: and state-transition probabilities: Probability of a given state sequence X:

Summary of Hidden Markov models Probability of state i generating a discrete observationot, which has one of a finite set of values, is Probability distribution of a continuous observation ot, which can have one of an infinite set of values, is We begin by considering only discrete observations.

Elements of a discrete HMM, 1. Number of different hidden states N, 2. Number of different observation K, 3. Initial-state probabilities, 4. State-transition probabilities, 5. Discrete emission/output probabilities,

Three main issues using HMMs Compute likelihood of a set of observations with an given HMM model, Evaluation problem. Decoding problem. Learning problem. Decode a state sequence by calculating the most likely path given observation sequence and a HMM model. Optimize the template patterns by training the parameters in the models,

Word recognition example(1). Amherst a 0.5 0.03 b A c 0.005 0.31 z • Typed word recognition, assume all characters are separated. • Character recognizer outputs probability of the image being particular character, P(image|character). Hidden state Observation

Word recognition example(2). • Hidden states of HMM = characters. • Observations = typed images of characters segmented from the image . Note that there is an infinite number of observations • Observation probabilities = character recognizer scores. • Transition probabilities will be defined differently in two subsequent models.

Word recognition example(3). m a m r t h e s Buffalo b u a o f f l A • If lexicon is given, we can construct separate HMM models for each lexicon word. Amherst 0.03 0.5 0.4 0.6 • Here recognition of word image is equivalent to the problem of evaluating few HMM models. • This is an application of Evaluation problem.

Word recognition example(4). a b v f o m r t h e s • We can construct a single HMM for all words. • Hidden states = all characters in the alphabet. • Transition probabilities and initial probabilities are calculated from language model. • Observations and observation probabilities are as before. • Here we have to determine the best sequence of hidden states, the one that most likely produced word image. • This is an application of Decoding problem.

Task 1: Likelihood of an Observation Sequence • What is ? • The likelihood of an observation sequence is the sum of the probabilities of all possible state sequences in the HMM. • Naïve computation is very expensive. Given T observations and N states, there are NT possible state sequences. • Even small HMMs, e.g. T=10 and N=10, contain 10 billion different paths • Solution to this and Task 2 is to use dynamic programming

Forward Probabilities • What is the probability that, given an HMM, at time t the state is i and the partial observation o1 … ot has been generated?

Forward Algorithm • Initialization: • Induction: • Termination:

Forward Algorithm Complexity • In the naïve approach to solving problem 1 it takes on the order of 2T*NT computations • The forward algorithm takes on the order of N2T computations

Character recognition with HMM example A • The structure of hidden states is chosen. • Observations are feature vectors extracted from vertical slices. • Probabilistic mapping from hidden state to feature vectors: 1. use mixture of Gaussian models • 2. Quantize feature vector space.

Exercise: character recognition with HMM(1) s1 s2 s3 A • HMM for character ‘A’ : • Transition probabilities: {aij}= • Observation probabilities: {bjk}=  .8 .2 0   0 .8 .2   0 0 1   .9 .1 0   .1 .8 .1   .9 .1 0  • The structure of hidden states: • Observation = number of islands in the vertical slice. • HMM for character ‘B’ : • Transition probabilities: {aij}= • Observation probabilities: {bjk}=  .8 .2 0   0 .8 .2   0 0 1  B  .9 .1 0   0 .2 .8   .6 .4 0 

Exercise: character recognition with HMM(2) • Suppose that after character image segmentation the following sequence of island numbers in 4 slices was observed: • { 1, 3, 2, 1} • What HMM is more likely to generate this observation sequence , HMM for ‘A’ or HMM for ‘B’ ?

Exercise: character recognition with HMM(3) • HMM for character ‘A’: • HMM for character ‘B’: Hidden state sequence Hidden state sequence Transition probabilities Transition probabilities Observation probabilities Observation probabilities s1 s1 s2s3 s1 s1 s2s3 .8  .2  .2  .9  0  .8  .9 = 0 .8  .2  .2  .9  0  .2  .6 = 0 s1 s2 s2s3 s1 s2 s2s3 .2  .8  .2  .9  .8  .2  .6 = 0.0027648 .2  .8  .2  .9  .1  .8  .9 = 0.0020736 s1 s2 s3s3 s1 s2 s3s3 .2  .2  1  .9  .8  .4  .6 = 0.006912 .2  .2  1  .9  .1  .1  .9 = 0.000324 Total = 0.0023976 Total = 0.0096768 Consider likelihood of generating given observation for each possible sequence of hidden states:

Task 2: Decoding • The solution to Task 1 (Evaluation) gives us the sum of all paths through an HMM efficiently. • For Task 2, we want to find the path with the highest probability. • We want to find the state sequence Q=q1…qT, such that

Viterbi Algorithm • Similar to computing the forward probabilities, but instead of summing over transitions from incoming states, compute the maximum • Forward: • Viterbi Recursion:

Viterbi Algorithm • Initialization: • Induction: • Termination: • Read out path:

Example of Viterbi Algorithm

Voice Biometrics

General Description • Each individual has individual voice components called phonemes. Each phoneme has a pitch, cadence, and inflection • These three give each one of us a unique voice sound. • The similarity in voice comes from cultural and regional influences in the form of accents. • Voice physiological and behavior biometric are influenced by our body, environment, and age.

Voice Capture • Voice can be captured in two ways: • Dedicated resource like a microphone • Existing infrastructure like a telephone • Captured voice is influenced by two factors: • Quality of the recording device • The recording environment • In wireless communication, voice travels through open air and then through terrestrial lines, it therefore, suffers from great interference.

Application of Voice Technology • Voice technology is applicable in a variety of areas. Those used in biometric technology include: • Voice Verification • Internet/intranet security: • on-line banking • on-line security trading • access to corporate databases • on-line information services • PC access restriction software

Voice Verification • Voice biometrics works by digitizing a profile of a person's speech to produce a stored model voice print, or template. • Biometric technology reduces each spoken word to segments composed of several dominant frequencies called formants. • Each segment has several tones that can be captured in a digital format. • The tones collectively identify the speaker's unique voice print. • Voice prints are stored in databases in a manner similar to the storing of fingerprints or other biometric data.

Voice Verification • Voice verification verifies the vocal characteristics against those associated with the enrolled user. • The US PORTPASS Program, deployed at remote locations along the U.S.–Canadian border, recognizes voices of enrolled local residents speaking into a handset. This system enables enrollees to cross the border when the port is unstaffed.

Automatic Speech Recognition • Automatic Speech Recognition systems are different from voice recognition systems although the two are often confused. • Automatic Speech Recognition is used to translate the spoken word into a specific response. • The goal of voice recognition systems is simply to understand the spoken word, not to establish the identity of the speaker.

Automatic Speech Recognition • Automatic Speech Recognition • hands free devices, for example car mobile hands free sets • electronic devices, for example telephone, PC, or ATM cash dispenser • software applications, for example games, educational or office software • industrial areas, warehouses, etc. • spoken multiple choice in interactive voice response systems, for example in telephony • applications for people with disabilities

Difficulties in Automatic Speech Recognition (ASR) • Context Variability Mr. Wright should write to Ms. Wrightright away about his Ford orfour door Honda. • Style Variability • isolated speech recognition is easier than continuous speech recognition • reading recognition is easier than conversational speech recognition • Speaker Variability speaker-independent v.s. speaker-dependent • Environment Variability background noise

Task of ASR The task of speech recognition is to take as input an acoustic waveform and produce as output a string of words.

Acoustic Processing of Speech Two important characteristics of a wave • Frequency and Pitch • The frequency is the number of times per second that a wave repeats itself, or cycles. • Unit: cycles per second are usually called Hertz (Hz) • The pitch is the perceptual correlate of frequency • Amplitude and loudness • The amplitude measures the amount of air pressure variation. • Loudness is the perceptual correlate of the power, which is related to the square of the amplitude.

Acoustic Processing of Speech Feature extraction • Analog-to-digital conversion • Sampling: In order to accurately measure a wave, it is necessary to have at least two samples in each cycle • One measuring the positive part of the wave • The other one measuring the negative part • Thus the maximum frequency wave that can be measured is one whose frequency is half the sample rate. • This maximum frequency for a given sampling rate is called the Nyquist frequency. • Quantization: Representing a real-valued number as an integer.

Acoustic Processing of Speech Spectrum • Based on the insight of Fourier that every complex wave can be represented as a sum of many simple waves of different frequencies. • Spectrum is a representation of these different frequency components.

Acoustic Processing of Speech Smoothing • Goal: Finding where the spectral peaks (formants) are, we could get the characteristic of different sounds.  determining vowel identity • Linear Predictive Coding (LPC) is one of the most common methods. • LPC spectrum is represented by a vector of features. • It is possible to use LPC features directly as the observation of HMMs.

Acoustic Processing of Speech • There are 6 states detected in the spoken digit ZERO i.e 1 , 2 , 3 , 4 , 6 , and 7.

Acoustic Processing of Speech For the given acoustic observation the goal of speech recognition is to find out the corresponding word sequence that has the maximum posterior probability P(W|O) Acoustic Model Language Model

Speech Recognition and Hidden Markov Models

Speech Recognition and Hidden Markov Models

Presentation Transcript

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models for Speech Recognition

EEL 6586-Automatic Speech Processing Hidden Markov Models for Speech Recognition

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition

Isolated-Word Speech Recognition Using Hidden Markov Models

Hidden Markov Models for Automatic Speech Recognition

Hidden Markov Models

Hidden Markov Models

Character Recognition using Hidden Markov Models

Hidden Markov Models

Hidden Markov Model in Automatic Speech Recognition

Hidden Markov Models

Hidden Markov Models for Speech Recognition

Hidden Markov Models