Syllables and Concepts in Large Vocabulary Continuous Speech Recognition

Syllables and Concepts in Large Vocabulary Continuous Speech Recognition Paul De PalmaPh. D. Candidate Department of Linguistics University of New Mexico Slides available at: www.cs.gonzaga.edu/depalma

An Engineered Artifact • Syllables • Principled word segmentation scheme • No claim about human syllabification • Concepts • Words and phrases with similar meanings • No claim about cognition

Reducing the Search Space ASR answers the question: • What is the most likely sequence of words given an acoustic signal? • Considers many candidate word sequences To Reduce the Search Space • Reduce number of candidates • Using Syllables in the Language Model • Using Concepts in a Concept Component

Syllables in LM: Why? Cumulative Frequency as a Function of Frequency Rank Switchboard (Greenberg, 1999, p. 167)

Most Frequent Words are Monosyllabic (Greenberg, 1999, p. 167) • Polysyllabic words are easier to recognize (Hamalainen, et al. , 2007) • And (of course) fewer syllables than words

Reduce the Search Space 2:Concept Component

The (Simplified) Architecture of an LVCSR System • Feature Extractor • Transforms an acoustic signal into a collection of 39 feature vectors • The province of digital signal processing • Acoustic Model • Collection of probabilities of acoustic observations given word sequences • Language Model • Collection of probabilities of word sequences • Decoder • Guesses a probable sequence of words given an acoustic signal by searching the product of the probabilities found in the acoustic and language models

Simplified Schematic Acoustic Model Feature Extractor Decoder signal Language Model Words

Enhanced Recognizer assumed Acoustic Model assumed P(O|S) Feature Extractor Decoder signal P(S) assumed Syllable Language Model Syllables My Work My Work Syllables, Concepts Concept Component

How ASR Works • Input is a sequence of acoustic observations: O = o1, o2 ,…, ot • Output is a string of words: W = w1, w2 ,…, wn Then (1) “The hypothesized word sequence is that string W in the target language with the greatest probability given a sequence of acoustic observations.”

Operationalizing Equation 1 (1) (2) Using Bayes’ Rule: (3) Since the acoustic signal is the same for each candidate, (3) can be rewritten Decoder (4) Acoustic Model (likelihood O|W) Language Model (prior probability)

LM: A Closer Look • A collection of probabilities of word sequences • p(W) = p(w1…wn) (5) • Can be written by the probability chain rule: (6)

Markov Assumption • Approximate the full decomposition of (6) by looking only a specified number of words into the past • Bigram1 word into the past • Trigram 2 words into the past • … • N-gram n words into the past

Experiment 1: Perplexity • Perplexity: PP(X) • Functionally related to entropy: H(X) • Entropy is a measure of information • Hypothesis • PPX(of syllable LM) < PPX (of word LM) • Syllable LM contains more information

Definitions Let X be a random variable p(x) be its probability function Defs: • H(X) = -∑x∈X p(x) * lg(p(x)) (1) • PP(X) = 2H(X) (2) Given certain assumptions1 and def. of H(X), PP(X) can be transformed to: p(w1…wn)-1/n Perplexity is the nth inverse root of the probability of a word sequence 1. X is an ergodic and stationary process, n is arbitrarily large

Entropy As Information Suppose the letters of a polynesian alphabet are distributed as follows:1 Calculate the per letter entropy H(P) = -∑i∈{p,t,k,a,i,u} p(i) * lg(p(i)) = = 2 ½ bits 2.5 bits on average required to encode a letter (p: 100, t: 00, etc) 1. Manning, C., Schutze, H. (1999). Foundations of Statistiical Natural Language Processing. Cambridge: MIT Press.

Reducing the Entropy • Suppose • This language consists of entirely of CV syllables • We know their distribution • We can compute the conditional entropy of syllables in the language • H(V|C), where V ∈ {a,i,u} and C ∈ {p,t,k} • H(V|C) = 2.44 bits • Entropy for two letters, letter model: 5 bits • Conclusion: The syllable model contains more information than the letter model

Perplexity As Weighted Average Branching Factor • Suppose: • letters in alphabet occur with equal frequency • At every fork we have 26 choices

Reducing the Branching Factor • Suppose ‘E’ occurs 75 times more frequently than any other letter • p(any other letter) = x • 75 * x + 25*x = 1, since there are 25 such letters • x = .01. • Since any letter, wi, is either E or one of the other 25 letters p(wi) = .75 + .01 = .76 and • Still 26 choices at each fork • ‘E’ is 75 times more likely than any other choice • Perplexity is reduced • Model contains more information

Perplexity Experiment • Reduced perplexity in a language model is used as an indicator that an experiment with real data might be fruitful • Technique (for both syllable and word corpora) • Randomly choose 10% of the utterances from a corpus as a test set • Generate a language model from the remaining 90% • Compute the perplexity of the test set given the language model • Compute the mean over twenty runs of step 3

The Corpora • Air Travel Information System (Hemphill, et al., 2009) Word types: 1604 Word tokens: 219,009 Syllable types: 1314 Syllable Tokens: 317,578 • Transcript of simulated human-computer speech (NextIt, 2008) Word types: 482Word tokens: 5,782Syllable types: 537 (This will have repercussions in Exp. 2)Syllable tokens: 8,587

Results • Notice drop in perplexity from words to syllables. • Perplexity of 14.74 for trigram syllable ATIS  At every turn, less than ½ as many choices as for trigram word ATIS

Experiment 2: Syllables in the language Model • Hypothesis: • A syllable language model will perform better than a word-based language model • By what Measure?

Symbol Error Rate • SER = (100 * (I + S + D))/T Where: • I is the number of insertions • S is the number of substitutions] • D is the number of deletions • T is the total number of symbols 1 SER = 100(2+1+1)/5 = .8 1. Alignment performed by a dynamic programming algorithmMinimum Edit Distance

Technique • Phonetically transcribe corpus and reference files • Syllabify corpus and references files • Build language models • Run a recognizer on 18 short human-computer telephone monologues • Compute mean, median, std of SER for 1-gram, 2-gram, 3-gram, 4-gram over all monologues

Results Syllables Compared to Words Syllables Normed by Words

Experiment 3: A Concept Component • Hypothesis: • A recognizer equipped with a post-processor that transforms syllable output to syllable/concept output will perform better than one not equipped with such a processor

Technique • Develop equivalence classes from the training transcript : BE, WANT, GO, RETURN • Map the equivalence classes onto the reference files used to score the output of the recognizer. • Map the equivalence classes onto the output of the recognizer • Determine the SER of the modified output in step 3 with respect to the reference files in step 2.

Results Concepts Compared to Syllables Concepts Normed by Syllables 2% decline overall. Why?

Mapping Was Intended to Produce an Upper Bound On SER • For each distinct syllable string that appears in the hypothesis or reference files, search each of the concepts for a match • If match, substitute concept for the syllable string: ay w_uh_ddl_ay_kdWANT • Misrecognition of a single syllable no insertion

Misalignment Between Training and Reference Files Equivalence classes constructed using only the LM training model transcript • More frequent in reference files: • 1st person singular (I want) • Imperatives (List all flights) • Less frequent in reference files: • 1st person plural (My husband and me want) • Polite forms (I would like) • BE does not appear (There should be, There’s going to be, etc.)

Summary • 1. Perplexity: syllable language model contains more information than a word language model (and probably will perform better) • 2. Syllable language model results in a 14.7% mean improvement in SER • 3. The very slight increase in mean SER for a concept language model justifies further research

Further Research • Test the given system over a large production corpus • Develop of a probabilistic concept language model • Develop necessary software to pass the output of the concept language model on to an expert system

The (Almost, Almost) Last Word “But it must be recognized that the notion ‘probability of a sentence’ is an entirely useless one under any known interpretation of the term.” Cited in Jurafsky and Martin (2009) from a 1969 essay on Quine.

The (Almost) Last Word He just never thought to count.

The Last Word Thanks To my generous committee: Bill Croft, Department of Linguistics George Luger, Department of Computer Science Caroline Smith, Department of Linguistics Chuck Wooters, U.S. Department of Defense

References Cover, T., Thomas, J. (1991). Elements of Information Theory. Hoboken, NJ: John Wiley & Sons. Greenberg, S. (1999) Speaking in shorthand—A syllable-centric perspective for understanding pronunciation variation. Speech Communication, 29, 159-176. Hemphill, C., Godfrey, J., Doddington, G. (2009). The ATIS Spoken Language Systems Pilot Corpus. Retrieved 6/17/09 from: http://www.ldc.upenn.edu/Catalog/readme_files/atis/sspcrd/corpus.html Hamalainen, A., Boves, L., de Veth, J., ten Bosch, L. (2007) On the utility of syllable-based acoustic models for pronunciation variation modeling. EURASIP Journal on Audio, Speech, and Music Processing. 46460, 1-11. Jurafsky, D., Martin, J. (2000) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall. Jurafsky, D., Martin, J. (2009) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall. Manning, C., Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge: MIT Press. NextIt. (2008). Retrieved 4/5/08 from: http:/www.nextit.com. NIST. (2007) Syllabification software. National Institute of Standards: NIST Spoken Language Technology Evaluation and Utility. Retrieved 11/30/07 from: http://www.nist.gov/speech/tools/.

Additional Slides

Transcription of a recording REF: (3.203,5.553) GIVE ME A FLIGHT BETWEEN SPOKANE AND SEATTLEREF: (15.633,18.307) UM OCTOBER SEVENTEENTHREF: (26.827,29.606) OH I NEED A PLANE FROM SPOKANE TO SEATTLEREF: (43.337,46.682) I WANT A ROUNDTRIP FROM MINNEAPOLIS TOREF: (58.050,61.762) I WANT TO BOOK A TRIP FROM MISSOULA TO PORTLANDREF: (73.397,77.215) I NEED A TICKET FROM ALBUQUERQUE TO NEW YORKREF: (87.370,94.098) YEAH RIGHT UM I NEED A TICKET FROM SPOKANE SEPTEMBER THIRTIETH TO SEATTLE RETURNING OCTOBER THIRDREF: (107.381,113.593) I WANT TO GET FROM ALBUQUERQUE TO NEW ORLEANS ON OCTOBER THIRD TWO THOUSAND SEVEN

Transcribed and Segmented1 REF: (3.203,5.553) GIHV MIY AX FLAYTD BAX TWIYN SPOW KAEN AENDD SIY AE DXAXLREF: (15.633,18.307) AHM AAKD TOW BAXR SEH VAXN TIYNTHREF: (26.827,29.606) OW AY NIYDD AX PLEYN FRAHM SPOW KAEN TUW SIY AE DXAXLREF: (43.337,46.682) AY WAANTD AX RAWNDD TRIHPD FRAHM MIH NIY AE PAX LAXS TUWREF: (58.050,61.762) AY WAANTD TUW BUHKD AX TRIHPD FRAHM MIH ZUW LAX TUW PAORTD LAXNDDREF: (73.397,77.215) AY NIYDD AX TIH KAXTD FRAHM AEL BAX KAXR KIY TUW NUW YAORKDREF: (87.370,94.098) YAE RAYTD AHM AY NIYDD AX TIH KAXTD FRAHM SPOW KAEN SEHPD TEHM BAXR THER DXIY AXTH TUW SIY AE DXAXL RAX TER NIXNG AAKD TOW BAXR THERDDREF: (107.381,113.593) AY WAANTD TUW GEHTD FRAHM AEL BAX KAXR KIY TUW NUW AOR LIY AXNZ AAN AAKD TOW BAXR THERDD TUW THAW ZAXNDD SEH VAXN 1. Produced by University of Colorado transcription software (to a version of ARPAbet) , National Institute of Standards (NIST) syllabifier, and my own Python classes that coordinate the two.

With Inserted Equivalence Classes1 • REF: (3.203,5.553) GIHV MIY GO BAX TWIYN SPOW KAEN AENDD SIY AE DXAXL • REF: (15.633,18.307) AHM AAKD TOW BAXR SEH VAXN TIYNTH • REF: (26.827,29.606) OW WANT AX PLEYN FRAHM SPOW KAEN TUW SIY AE DXAXL • REF: (43.337,46.682) WANT AX RAWNDD TRIHPD FRAHM MIH NIY AE PAX LAXS TUW • REF: (58.050,61.762) WANT TUW BUHKD AX TRIHPD FRAHM MIH ZUW LAX TUW PAORTD LAXNDD • REF: (73.397,77.215) WANT GO FRAHM AEL BAX KAXR KIY TUW NUW YAORKD • REF: (87.370,94.098) YAE RAYTD AHM AY WANT GO FRAHM SPOW KAEN SEHPD TEHM BAXR THER DXIY AXTH TUW SIY AE DXAXL RETURN AAKD TOW BAXR THERDD • REF: (107.381,113.593) WANT GO FRAHM AEL BAX KAXR KIY TUW NUW AOR LIY AXNZ AAN AAKD TOW BAXR THERDD TUW THAW ZAXNDD SEH VAXN 1. A subset of a set in which all members share an equivalence relation. WANT is an equivalence class with members, I need, I would like, and so on.

Including Flat Language ModelWord Perplexity

Including Flat LMSyllable Perplexity

Words and Syllables Normed by Flat LM Words Data Normed by Flat LM Syllable Data Normed by Flat LM

Syllabifiers • Syllabifier from National Institute of Standards and Technology (NIST, 2007) • Based on Daniel Kahn’s 1976 dissertation from MIT (Kahn, 1976) • Generative in nature and English-biased

Syllables • Estimates of the number of English syllables range from 1000 to 30,000 • Suggests that there is some difficulty in pinning down what a syllable is. • Usual hierarchical approach syllable onset (C) rhyme Nucleus (V) Coda (C)

Sonority • Sonority rises to the nucleus and falls to the coda • Speech sounds appear to form a sonority hierarchy (from highest to lowest) vowels, glides, liquids, nasals, obstruents • Useful but not absolute: e.g, both depth and spit seem to violate the sonority hierarchy

Syllables and Concepts in Large Vocabulary Continuous Speech Recognition

Syllables and Concepts in Large Vocabulary Continuous Speech Recognition

Presentation Transcript

Large Vocabulary Continuous Speech Recognition (LVCSR)

DIGITAL SIGNAL PROCESSING ARCHITECTURE FOR LARGE VOCABULARY SPEECH RECOGNITION

LINEAR DYNAMIC MODEL FOR CONTINUOUS SPEECH RECOGNITION

TANDEM ACOUSTIC MODELING IN LARGE-VOCABULARY RECOGNITION

NONLINEAR DYNAMIC INVARIANTS FOR CONTINUOUS SPEECH RECOGNITION

NONLINEAR DYNAMIC INVARIANTS FOR CONTINUOUS SPEECH RECOGNITION

Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition

Boosting HMM acoustic models in large vocabulary speech recognition

Hybrid Systems for Continuous Speech Recognition

Usability of Continuous Speech Recognition Programs

Automatic Continuous Speech Recognition

Discriminative Training Approaches for Continuous Speech Recognition

LINEAR DYNAMIC MODEL FOR CONTINUOUS SPEECH RECOGNITION

Applications of Large Vocabulary Continuous Speech Recognition for Fatigue Detection

Large Vocabulary Unconstrained Handwriting Recognition

Utterance verification in continuous speech recognition decoding and training Procedures

A New Verification-Based Fast-Match for Large Vocabulary Continuous Speech Recognition

Hybrid Systems for Continuous Speech Recognition

Network Training for Continuous Speech Recognition

Applications of Large Vocabulary Continuous Speech Recognition for Fatigue Detection