Speech Recognition as a Pattern Matching Problem

Speech Recognition as a Pattern Matching Problem • Input waveform = X • Each allophone Y modeled by parameters L(Y) • Acoustic Model p(X|Y) modeled by a parameterized function f(X,L) • Language Model p(Y1,…,YN,W1,…,WM) = probability of word & allophone sequence, modeled using very big lookup tables • Recognized word string: (W1,…,WM) = argmaxWSY p(X|Y1,…,YN) p(Y1,…,YN, W1,…,WM)

The Problems of Speech Recognition • What allophones should be distinguished? • Minimum: ~50 phonemes, including schwa • Left and right neighboring phonemes? • Unreleased vs. released stops? • Function word vs. content word? • Lexical stress, Onset vs. Coda, Talker gender? • What acoustic features X? • Spectrum once per 10ms. Pitch discarded • What is the acoustic model f(X,L)? • CDHMM

Prosody-Dependent Allophones • 100 monophones (incl. schwa, unreleased vs. released stops, function vs. content). • Split based on prosodic context: 200-600 prosody-dependent monophones • Split based on left, right phonemes: 300-6000 prosody-dependent triphones

Prosodic Contexts that Might Matter • Accented vs. Unaccented • If word has a pitch accent, phones in the primary-stress syllable are “accented” • Phrase-Initial vs. Phrase-Medial • If word is phrase-initial, phones in onset and nucleus of 1st syllable are “phrase-initial” • Phrase-Final vs. Phrase-Medial • If word is phrase-final, phones in nucleus and coda of last syllable are “phrase-final” • How many levels of “phrase” should we model? How many levels of “accent?” • Boston Radio News database has only enough data for binary distinctions: IP/non-IP, accent/non-accent.

Which Prosodic Contexts Matter? Method • Train L(Y) to maximize logp(X(train,Y)|Y) • Measure logp(X(test,Y)|Y) • For accent-dependent allophone P, does phrase position matter? Compare: logp(X(test,Y)|Y) ?< (1/3) (logp(X(test,Yinitial)|Yinitial) + logp(X(test,Ymedial)|Ymedial) + logp(X(test,Yfinal)|Yfinal)

Which Prosodic Contexts Matter? Vowel Results: Everything Matters • Phrase-initial vowels that vary by accent: 7/12 • aa,ae,ah,ao,ay,ih,iy • Phrase-medial vowels that vary by accent: 13/15 • all but uh,ax • Phrase-final vowels that vary by accent: 6/8 • all but uh,ao • Accented vowels that vary by position: 12/14 • all but uh,oy • Unaccented vowels that vary by position: 10/14 • all but uh,ey,ay,ao

Which Prosodic Contexts Matter? Syllable-Initial Consonants • Phrase-initial onsets that vary by accent: 4/13 • b,h,r,t • Phrase-medial onsets that vary by accent: 20/21 • all but z • Accented onsets that vary by position: 3/14 • s,r,f • Unaccented onsets that vary by position: 18/21 • all but y,g,ch

Which Prosodic Contexts Matter? Syllable-Final Consonants • Phrase-medial codas that vary by accent: 17/19 • all but sh,v • Phrase-final codas that vary by accent: 5/15 • d,f,r,v,z • Accented codas that vary by position: 14/16 • all but ch,d,g • Unaccented codas that vary by position: 17/21 • all but ch,g,p,v

Which Prosodic Contexts Matter? A Model of the Results Vowels Consonants

Acoustic Features for Prosody-Dependent Speech Recognition • Spectrum once per 10ms (MFCC), dMFCC, ddMFCC • Energy, dEnergy, ddEnergy • Pitch: • Correct pitch halving and pitch doubling errors • Compute minF0 per utterance • f(t) = log(F0(t)/minF0) • TDNN or TDRNN computes f*(t) = P( accent(t) | f(t-50ms),…,f(t+50ms) ) • Use f*(t) as “observation” for an HMM

TDRNN with One Output Unit Pitch Accented Output Layer 2nd Hidden Layer 1st Hidden Layer D . . Pitch Unaccented . . . . . . . . D D D Input Layer Internal State Layer F0 P_V

Training the TDRNN to Recognize Pitch Accents

Acoustic Model f(X,L) for Prosody-Dependent Speech Recognition • Normalized phoneme duration is highly correlated with phrase position • Duration is not available before phoneme recognition! • Solution: Semi-Markov model (aka HMM with explicit duration distributions) P(x1,…,xT|Y1,…,YN) = Sd p(d1|Y1)…p(dN|YN) p(x(1)…x(d1)|Y1) p(x(d1+1)…x(d1+d2)|Y2) …

Example: Distributions of Duration, Phrase-Final vs. Phrase-Medial

Some Recognition Results

Work in Progress • Confirm these experiments w/state of the art phoneme set & acoustic features • Improve pitch features; improve duration modeling • Spontaneous speech database Switchboard: • Syntactic parse of available word transcriptions • “Guess” prosody from syntax • Train recognition models • Iteratively improve the prosodic transcription? • Study relationship between prosody, disfluency

Speech Recognition as a Pattern Matching Problem

Speech Recognition as a Pattern Matching Problem

Presentation Transcript

Pattern Recognition

Pattern recognition

Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching

Pattern matching

Pattern Recognition

Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching

Pattern matching

Pattern Recognition

Pattern Matching

Pattern Matching

Pattern Recognition