음성인식이론

음성인식이론

통계적 패턴매칭에 의한 음성인식 방법 Templates or models Test pattern Pattern training speech analysis Reference pattern Recognized speech Filter bank LPCC MFCC Pattern classifier Decision logic Dynamic time warping Search algorithm

통계적 패턴매칭에 의한 음성인식 방법 • 공통문제: 음성특징 벡터열 X가 주어졌을 때, 그에 해당하는 단어열 W를 찾는 것 • 해결책: argmaxW P(W|X)=argmaxW P(X|W)P(W) • Bayes 결정이론 (Bayes decision theory) • 만일 P(X|W)와 P(W)를 구할 수 있다면, 최대 사후 추정확률 방법(Maximum A Posteriori (MAP) decoder )이 최소 인식오류를 보장함 • 통계적 패턴인식 과정 • 훈련과정: 훈련용 특징벡터열 X와 특징벡터열 및 단어열의 발생확률분포의 형태 P가 주어졌을 때, 각 단어(class) W에 대한 확률분포 P(X|W)의 파라메터들과 각 단어열 W의 확률분포 P(W)의 파라메터들을 추정하는 것 • 인식과정: MAP decoding over the recognition network

HMM 기반 음성인식 • 기본특성 • 확률모델을 이용한 통계적 패턴인식 방법 • 장점 • 소어휘 고립단어인식에서 불특정화자 대어휘 연속음성인식까지 일관된 방법으로 구현 가능하여 폭 넓게 사용됨 • 우수한 성능, 유연한 구조, 탁월한 확장성 • 단점 • 다량의 훈련용 음성 DB 필요 • 특화된 하드웨어로 구현하기가 어려움

a11 a22 a12 1 2 a21 a32 a13 a23 a31 3 a33 마코프 프로세스 (Markov Process) • 이산시간(Discrete-time), 1차(first-order) Markov chain • 어떤 상태에서 다음 상태로의 천이확률은 현재 상태에 의해서만 결정됨:P[qt=j|qt-1=i, qt-2=k,…] = P[qt=j|qt-1=i] • 상태 천이확률은 시간에 무관함:aij = P[qt=j|qt-1=i] • 관찰되는 이벤트가 각 상태에 해당함

0.4 0.6 0.3 Rain or snow Cloudy 1 2 0.2 0.1 0.3 0.1 0.2 3 Sunny 0.8 마코프 프로세스의 예 • 날씨 추정 문제 • State 1: rain or snow • State 2: cloudy • State 3: sunny • 문제: Probability of “sun-sun-sun-rain-rain-sun-cloudy-sun”? • 해결책: P(O|M)=P[3,3,3,1,1,3,2,3|M]=P[3]P[3|3]2P[1|3]P[1|1]P[3|1]P[2|3]P[3|2]=1.536x10-4

b1(x) aii A B C D aij ajj b3(x) b2(x) A B C D A B C D State seq.: s1 s3 s1 s2 s2 s3 Observed seq. (X): A C D B A C 히든 마코프 모델(Hidden Markov Model: HMM) 이론 • 이중 랜덤 프로세스 • Hidden probabilistic state transition process • Observable probabilistic output process • 상태 천이확률(State transition probability), aij=p(sj|si) • 출력 확률분포(Output probability distribution), bi(x)=p(x|si)

히든 마코프 모델링 예제: Urn-and-Ball Model • 각 항아리에는 4가지색의 공들이 들어 있음. • 임의의 항아리를 선택하고, 그 항아리에서 임의의 공을 집어서 그 색을 불러줌. • 그 공은 다시 원래 항아리에 집어 넣음. • 위 과정을 반복하여 공의 색을 불러줌. • 이러한 공의 색들의 나열을 보고, 원래 어떤 순서로 항아리를 선택했는지를 찾는 문제. Urn 3 Urn 2 Urn 1 Veil Observed Ball Sequence

HMM을 정의하는 파라메터들

HMM으로부터 관측 이벤트열을 발생시키는 과정 • 초기상태확률 p에 따라 첫 상태 q1=i 가 결정된다. • 시간이 t=1로 부여된다. • 상태 i에서 이벤트 출력확률분포 bi(k)에 따라 하나의 출력 이벤트 ot=vk가 출력된다. • 상태 i에 대한 상태천이확률분포 aij에 따라 다음 상태 qt+1=j로의 천이가 일어난다. • 시간을 하나 증가시키고 (t=t+1), 만일 최종 시간에 도달하지 못했으면 (t<T) 3번 과정으로 돌아가고, 도달했으면 이벤트 발생을 종료한다.

HMM의 3가지 기본 문제와 해결방법 • HMM을 실제 문제에 적용하기 위해서는 다음의 세가지 문제에 대한 해결방법이 제시되어야 함. • 관측된 이벤트열 X가 발생될 확률 계산법 (Evaluation problem) • 전향 알고리즘 (Forward algorithm) 혹은 후향 알고리즘(Backward algorithm) • 관측된 이벤트열 X가 어떤 상태천이를 거쳐 발생되었는지를 추정하는 방법(Decoding problem) • 비터비 알고리즘 (Viterbi algorithm) • 훈련과정을 통해 HMM 파라메터들을 추정하는 방법(Estimation problem) • Forward-backward algorithm (or Baum-Welch algorithm)

j i Forward Algorithm • 관측 이벤트열 X와 모델 파라메터 set l가 주어졌을 때, 확률P(X|l)를 효율적으로 계산하는 방법. • Direct computation • 계산량: TNT • Forward algorithm • Define • Initialization • Recursion • Termination • 계산량 : N2T

i j Backward Algorithm • 역방향으로 확률 P(X|l)를 효율적으로 계산하는 방법.

j i Viterbi Algorithm • 관측 이벤트열 X와 모델 파라메터 set l가 주어졌을 때, 모델이 이 관측열을 발생시킬 최적 상태열을 찾는 것. • Viterbi algorithm • Computation in log domain • 계산량: N2T additions

Baum-Welch Algorithm (I) • 훈련용 관측열 X와 주어진 초기 모델 파라메터 set l=(A,B,p)로 구할 수 있는 확률 P(X|l)를 최대화 할 수 있도록 새로운 모델 파라메터 set을 구하는 것. • Maximum likelihood (ML) estimation (continued)

Baum-Welch Algorithm (II)

HMM 구조의 종류 • Ergodic model • Every aij is positive. • Every transition is possible. • Left-to-right (Bakis) model • aij = 0, j<i • Cannot go backward • Good to model temporal structure of speech signals

bj(k) bj(x) bj(x) codebook HMM 출력확률분포의 종류 • Discrete HMM: • Discrete symbols • Continuous HMM: one codebook per state • Mixture Gaussian pdf • Semi-continuous HMM: globally shared codebook

연속밀도(Continuous) HMM • Re-estimation formula

음성인식에 유용한 HMM의 변형 • Null Transition • Produces no output • Jumps from one state to another that produce no observation • Makes network simple • Explicit State Duration Modeling • State duration of a state with self-transition probability aii has exponential distribution: pi(d)=aiid-1(1-aii) • Actual state duration has Gamma distribution f f

HMM 파라메터 추정 방법들 • Maximum likelihood estimation (Baum-Welch algorithm based on EM algorithm)l’ = argmaxl P(X|l) • Maximum mutual information estimationl’ = argmaxl Il(W;X), W: word • Discriminative training for minimum classification error • Minimum recognition error training • Generalized probabilistic descent • Corrective training

Part II • HMM을 이용한 음성인식기 구현 시 실제 고려사항 • HMM에 의한 단어모델 훈련과정 • 단어단위 HMM에 의한 고립단어 인식과정 • 음소모델 훈련 및 고립단어 인식 개요 • 음소모델 훈련과정

HMM을 이용한 음성인식기 구현 시 실제 고려사항 (I) • Scaling • To prevent underflow • Use normalized a • Same recursion • Multiple observation • Re-estimation with L observations

HMM을 이용한 음성인식기 구현 시 실제 고려사항 (II) • Initial estimates of HMM parameters • HMM converges to local maximum and therefore good initial parameter estimates are necessary for rapid and proper convergence. • p, A: random or uniform • B: Essential • Manual segmentation of observation sequence into states and get average of observations within states • Maximum likelihood segmentation of observations and averaging • Segmental k-means segmentation with clustering

HMM을 이용한 음성인식기 구현 시 실제 고려사항 (III) • Size of Training Data • Small data may cause under-estimation problem • Solution: • Increase training data: The more, the better • Reduce the model size or parameter tying (sharing) • Enhance reliability of parameter estimates: Deleted interpolation • Choice of Model • Task-dependency • Number of states • HMM type: Ergodic or left-to-right • Observation densities: Discrete or continuous, single or mixture Gaussian

HMM ㅕ ㄹ ㅓ Features frame Frame shift Speech signals time HMM과 음성신호의 대응관계 • Assume speech signals are generated by HMM • Find model parameters from training data • Compute probability of test speech using the HMM and select the model having maximum likelihood • Left-to-right model: easy to model signal whose properties change over time in a successive manner – e.g., speech

waveform feature il i chil Converged? Yes Speech database Feature Extraction Baum-Welch Re-estimation end No Word HMM l1 l2 l7 HMM에 의한 단어모델 훈련과정

훈련 전에 미리 정의해야 하는 모델 파라메터들 • Number of states (N) • Number of phones in a word • Digit recognition: English N=9, Korean N=6 • Average number of observations in a spoken word • Feature • LPC-cepstrum or MFCC • Derivatives • Observation density • Discrete: M=512-1024 codewords • Continuous: M=1-64, Diagonal covariance

HMM for word 1 l1 Seoul P(X|l1) Likelihood computation . . . Recognized word Speech Select maximum Feature extraction HMM for word V lV Likelihood computation P(X|lV) 단어단위 HMM에 의한 고립단어 인식과정

단어단위 HMM 인식 성능평가 예 • Training data: 10 digit x 100 speakers • Test data: 10 digit x 100 speakers • Results (Error rates): • Speaker-dependent test • LPC/DTW 1.6% LPC/VQ/DTW (discrete) 3.5% • CD/HMM 1.6% VQ/HMM (discrete) 3.7% • Speaker-independent test • LPC/DTW 1.6% • CD/HMM 1.6% • Discussion: • When using VQ, performance degrades. • DTW and CHMM are comparable • Ref: Rabiner & Juang, Fundamentals of Speech Recognition

Test word Recognized word 일 Feature extraction Likelihood computation Decision rule i l HMM “일” i Word model generation HMM “이” a m s HMM “삼” subword model lexicon 일 i l 이 i 삼 s a m “s” “i” Subword model generation “l” “a” … 음소모델 훈련 및 고립단어 인식 개요 Training words

waveform feature il i chil Converged? Yes Speech database Feature Extraction Baum-Welch Re-estimation end No Word model generation lexicon subword model li ll lch 음소모델 훈련과정

Labeled utterance/ Unlabeled Transcriptions feature i l l il il i i i i ch chil chil ch i l Speech database Feature Extraction Viterbi segmentation K-means clustering, accumulate statistics l i Build word model i ch l i Model update Phone model lexicon i i il i l chil ch i l i l Termination condition Yes No ch end Segmental K-means 훈련방법

음향모델 단위 비교 • Word unit • A word consists of an HMM • High accuracy for small vocabulary • A large number of parameters • Should retrain models if vocabulary changes • Subword unit • A word HMM consists of combination of smaller units than word • Can build a new word model by combining subword models • Can build vocabulary-independent systems • Needs small training data because units are shared among words Word: 일 il 이 i 삼 sam 칠 chil Phoneme: 일 i l 이 i 삼 s a m 칠 ch i l

Part III • 연속음성인식의 학제간 연구 필요성 • HMM 기반 연속음성인식 • 강인한 음향모델링을 위한 파라메터 공유 • 연속음성인식을 위한 검색 알고리즘 • 연속음성인식의 성능평가 방법 • 연속숫자음 인식 예

연속음성인식의 학제간 연구 필요성 • 음성학: relationship between speech signal and human vocal tract & hearing mechanisms • 언어학: phonology, syntax, semantics • 통신이론 및 정보이론: parameter estimation, coding and decoding algorithms (stack decoding, Viterbi decoding), Information-theoretic distance measures • 신호처리: spectral analysis, feature extraction, time-varying signal modeling, noise-robust features • 패턴인식: cluster speech patterns and match, Bayesian learning • 전산학: efficient search algorithms, etc

HMM 기반 연속음성인식 이론 • Continuous speech recognition can be formulated as:음성 특징벡터열 X가 주어져 있을 때, 아래의 사후확률을 최대화 시키는 단어열 W를 찾는 것. W’ = argmaxw P(W|X) = argmaxw P(X|W) P(W) • P(X|W): Acoustic model (ML probability)P(W): Language model (a priori probability) • 결국, 어떤 검색 알고리즘을 이용하여 모든 가능한 인식가능 공간에서 최고의 확률을 가지는 단어열을 찾는 것.

Speech signals Word sequence Feature extraction Search One two three. one Network construction two oh Speech database Acoustic model Vocabulary Dictionary Language model one HMM estimation two nine Text corpora LM estimation oh HMM 기반 연속음성인식 과정

weight 1 0 1 2 3 4 freq(kHz) MFCC Filters 대표적인 특징추출 방법 x(n) Hamming window Mel-scale filter bank FFT log |.| DCT

Diphone: 일 sil-i i-l l-sil 이 sil-I I-sil 삼 sil-s s-a a-m m-sil Triphone: 일 sil-i+l i-l+sil 이 sil-i+sil 삼 sil-s+a s-a+m a-m+sil 음향모델링 • Subword Units • Context-independent • Phoneme • Syllable • Small number of parameters • Context-dependent • Diphone • Triphone • Quinphone • High accuracy • Large number of units • Large training data

ONE TWO ONE THREE ONE TWO THREE ONE Sentence HMM W AH N Word HMM ONE W Phone HMM 2 1 3 음향모델 훈련용 네트워크 구성 • Sentence model = (word1 word2 … wordN) • Word model = (phone1 phone2 … phoneM) • Phone model = (state1 state2 state3)

t-ih+n t-ih+ng f-ih+l s-ih+l 강인한 음향모델링을 위한 파라메터 공유 (I) • Share parameters (mean, variance, transition probabilities) if two states have similar distributions • Data-driven clustering (continued)

강인한 음향모델링을 위한 파라메터 공유 (II) • Decision tree-based clustering • For large vocabulary system • Can handle unseen triphones • How can we get model for an unseen triphone t-aw+ih? s-aw+n Cluster center states of phone /aw/ t-aw+n s-aw+t R=consonant? n y R=nasal? L=nasal? y L=stop? n n y y n 4 5 2 3 1 States in each leaf node are tied

John one call Tom two one START START two nine oh nine oh 인식용 문법 네트워크 구성 One One two one three Seven oh two five … Call John Call Tom One One two one three Seven oh two five …

I Intra-word transition Word transition start end 이 I L P(이|x) LM is applied 일 P(일|x) A S P(사|x) 사 Between-word transition S A M P(삼|x) 삼 인식용 전체 네트워크 상세구조

LM score Acoustic log-likelihood Word insertion penalty 모델간 연결 규칙 • Intra-word transitions • Between-word transitions • Add LM score and word insertion penalty

연속음성인식을 위한 검색 알고리즘 • Search is a problem for finding the optimal path from search network. • One-Pass (One-Stage) dynamic programming algorithm • Extension of dynamic programming (DP) strategies for connected pattern matching • Fast, Simple, Efficient, Small memory • Frame synchronous • Widely used for continuous speech recognition

Reference 5 state 4 5 3 4 2 3 1 2 0 s3 1 1 2 3 4 5 6 Frame (time) s2 0 1 2 3 4 5 6 Frame (time) s1 DTW와 Viterbi 알고리즘 비교 • DTW v.s. Viterbi algorithm • Reference template  State • Path constraint  Transition prob • Local distance  Observation prob

Word index State index 공 V S(k) k 구 (t,i,k) i 1 Between-word transition 이 2 Intra-word transition S(1) 일 1 1 1 t T One-Pass DP 알고리즘 • Representation in lattice

음성인식이론

음성인식이론

Presentation Transcript