480 likes | 651 Vues
Syllables and Concepts in Large Vocabulary Continuous Speech Recognition. Paul De Palma Ph. D. Candidate Department of Linguistics University of New Mexico Slides available at: www.cs.gonzaga.edu/depalma. An Engineered Artifact. Syllables Principled word segmentation scheme
E N D
Syllables and Concepts in Large Vocabulary Continuous Speech Recognition Paul De PalmaPh. D. Candidate Department of Linguistics University of New Mexico Slides available at: www.cs.gonzaga.edu/depalma
An Engineered Artifact • Syllables • Principled word segmentation scheme • No claim about human syllabification • Concepts • Words and phrases with similar meanings • No claim about cognition
Reducing the Search Space ASR answers the question: • What is the most likely sequence of words given an acoustic signal? • Considers many candidate word sequences To Reduce the Search Space • Reduce number of candidates • Using Syllables in the Language Model • Using Concepts in a Concept Component
Syllables in LM: Why? Cumulative Frequency as a Function of Frequency Rank Switchboard (Greenberg, 1999, p. 167)
Most Frequent Words are Monosyllabic (Greenberg, 1999, p. 167) • Polysyllabic words are easier to recognize (Hamalainen, et al. , 2007) • And (of course) fewer syllables than words
The (Simplified) Architecture of an LVCSR System • Feature Extractor • Transforms an acoustic signal into a collection of 39 feature vectors • The province of digital signal processing • Acoustic Model • Collection of probabilities of acoustic observations given word sequences • Language Model • Collection of probabilities of word sequences • Decoder • Guesses a probable sequence of words given an acoustic signal by searching the product of the probabilities found in the acoustic and language models
Simplified Schematic Acoustic Model Feature Extractor Decoder signal Language Model Words
Enhanced Recognizer assumed Acoustic Model assumed P(O|S) Feature Extractor Decoder signal P(S) assumed Syllable Language Model Syllables My Work My Work Syllables, Concepts Concept Component
How ASR Works • Input is a sequence of acoustic observations: O = o1, o2 ,…, ot • Output is a string of words: W = w1, w2 ,…, wn Then (1) “The hypothesized word sequence is that string W in the target language with the greatest probability given a sequence of acoustic observations.”
Operationalizing Equation 1 (1) (2) Using Bayes’ Rule: (3) Since the acoustic signal is the same for each candidate, (3) can be rewritten Decoder (4) Acoustic Model (likelihood O|W) Language Model (prior probability)
LM: A Closer Look • A collection of probabilities of word sequences • p(W) = p(w1…wn) (5) • Can be written by the probability chain rule: (6)
Markov Assumption • Approximate the full decomposition of (6) by looking only a specified number of words into the past • Bigram1 word into the past • Trigram 2 words into the past • … • N-gram n words into the past
Bigram Language Model Def. Bigram Probability: p(wn | wn-1 ) = count(wn-1wn)/count(wn-1 ) (7) Minicorpus <s>paul wrote his thesis</s> <s>james wrote a different thesis</s> <s>paul wrote a thesis suggested by george</s> <s>the thesis</s> <s>jane wrote the poem</s> (e.g., ) p(paul|<s>) = count(<s>paul)/count(<s>) = 2/5 P(paul wrote a thesis) = p(paul|<s>) * p(wrote|paul) * p(a|wrote) * p(thesis|a) * p(</s>|thesis) = .075 P(paul wrote the thesis) = p(paul|<s>) * p(wrote|paul) * p(the|wrote) * p(thesis|the) * p(</s>|thesis) = .0375
Experiment 1: Perplexity • Perplexity: PP(X) • Functionally related to entropy: H(X) • Entropy is a measure of information • Hypothesis • PPX(of syllable LM) < PPX (of word LM) • Syllable LM contains more information
Definitions Let X be a random variable p(x) be its probability function Defs: • H(X) = -∑x∈X p(x) * lg(p(x)) (1) • PP(X) = 2H(X) (2) Given certain assumptions1 and def. of H(X), PP(X) can be transformed to: p(w1…wn)-1/n Perplexity is the nth inverse root of the probability of a word sequence 1. X is an ergodic and stationary process, n is arbitrarily large
Entropy As Information Suppose the letters of a polynesian alphabet are distributed as follows:1 Calculate the per letter entropy H(P) = -∑i∈{p,t,k,a,i,u} p(i) * lg(p(i)) = = 2 ½ bits 2.5 bits on average required to encode a letter (p: 100, t: 00, etc) 1. Manning, C., Schutze, H. (1999). Foundations of Statistiical Natural Language Processing. Cambridge: MIT Press.
Reducing the Entropy • Suppose • This language consists of entirely of CV syllables • We know their distribution • We can compute the conditional entropy of syllables in the language • H(V|C), where V ∈ {a,i,u} and C ∈ {p,t,k} • H(V|C) = 2.44 bits • Entropy for two letters, letter model: 5 bits • Conclusion: The syllable model contains more information than the letter model
Perplexity As Weighted Average Branching Factor • Suppose: • letters in alphabet occur with equal frequency • At every fork we have 26 choices
Reducing the Branching Factor • Suppose ‘E’ occurs 75 times more frequently than any other letter • p(any other letter) = x • 75 * x + 25*x = 1, since there are 25 such letters • x = .01. • Since any letter, wi, is either E or one of the other 25 letters p(wi) = .75 + .01 = .76 and • Still 26 choices at each fork • ‘E’ is 75 times more likely than any other choice • Perplexity is reduced • Model contains more information
Perplexity Experiment • Reduced perplexity in a language model is used as an indicator that an experiment with real data might be fruitful • Technique (for both syllable and word corpora) • Randomly choose 10% of the utterances from a corpus as a test set • Generate a language model from the remaining 90% • Compute the perplexity of the test set given the language model • Compute the mean over twenty runs of step 3
The Corpora • Air Travel Information System (Hemphill, et al., 2009) Word types: 1604 Word tokens: 219,009 Syllable types: 1314 Syllable Tokens: 317,578 • Transcript of simulated human-computer speech (NextIt, 2008) Word types: 482Word tokens: 5,782Syllable types: 537 (This will have repercussions in Exp. 2)Syllable tokens: 8,587
Results • Notice drop in perplexity from words to syllables. • Perplexity of 14.74 for trigram syllable ATIS At every turn, less than ½ as many choices as for trigram word ATIS
Experiment 2: Syllables in the language Model • Hypothesis: • A syllable language model will perform better than a word-based language model • By what Measure?
Symbol Error Rate • SER = (100 * (I + S + D))/T Where: • I is the number of insertions • S is the number of substitutions] • D is the number of deletions • T is the total number of symbols 1 SER = 100(2+1+1)/5 = .8 1. Alignment performed by a dynamic programming algorithmMinimum Edit Distance
Technique • Phonetically transcribe corpus and reference files • Syllabify corpus and references files • Build language models • Run a recognizer on 18 short human-computer telephone monologues • Compute mean, median, std of SER for 1-gram, 2-gram, 3-gram, 4-gram over all monologues
Results Syllables Compared to Words Syllables Normed by Words
Experiment 3: A Concept Component • Hypothesis: • A recognizer equipped with a post-processor that transforms syllable output to syllable/concept output will perform better than one not equipped with such a processor
Technique • Develop equivalence classes from the training transcript : BE, WANT, GO, RETURN • Map the equivalence classes onto the reference files used to score the output of the recognizer. • Map the equivalence classes onto the output of the recognizer • Determine the SER of the modified output in step 3 with respect to the reference files in step 2.
Results Concepts Compared to Syllables Concepts Normed by Syllables 2% decline overall. Why?
Mapping Was Intended to Produce an Upper Bound On SER • For each distinct syllable string that appears in the hypothesis or reference files, search each of the concepts for a match • If match, substitute concept for the syllable string: ay w_uh_ddl_ay_kdWANT • Misrecognition of a single syllable no insertion
Misalignment Between Training and Reference Files Equivalence classes constructed using only the LM training model transcript • More frequent in reference files: • 1st person singular (I want) • Imperatives (List all flights) • Less frequent in reference files: • 1st person plural (My husband and me want) • Polite forms (I would like) • BE does not appear (There should be, There’s going to be, etc.)
Summary • 1. Perplexity: syllable language model contains more information than a word language model (and probably will perform better) • 2. Syllable language model results in a 14.7% mean improvement in SER • 3. The very slight increase in mean SER for a concept language model justifies further research
Further Research • Test the given system over a large production corpus • Develop of a probabilistic concept language model • Develop necessary software to pass the output of the concept language model on to an expert system
The (Almost, Almost) Last Word “But it must be recognized that the notion ‘probability of a sentence’ is an entirely useless one under any known interpretation of the term.” Cited in Jurafsky and Martin (2009) from a 1969 essay on Quine.
The (Almost) Last Word He just never thought to count.
The Last Word Thanks To my generous committee: Bill Croft, Department of Linguistics George Luger, Department of Computer Science Caroline Smith, Department of Linguistics Chuck Wooters, U.S. Department of Defense
References Cover, T., Thomas, J. (1991). Elements of Information Theory. Hoboken, NJ: John Wiley & Sons. Greenberg, S. (1999) Speaking in shorthand—A syllable-centric perspective for understanding pronunciation variation. Speech Communication, 29, 159-176. Hemphill, C., Godfrey, J., Doddington, G. (2009). The ATIS Spoken Language Systems Pilot Corpus. Retrieved 6/17/09 from: http://www.ldc.upenn.edu/Catalog/readme_files/atis/sspcrd/corpus.html Hamalainen, A., Boves, L., de Veth, J., ten Bosch, L. (2007) On the utility of syllable-based acoustic models for pronunciation variation modeling. EURASIP Journal on Audio, Speech, and Music Processing. 46460, 1-11. Jurafsky, D., Martin, J. (2000) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall. Jurafsky, D., Martin, J. (2009) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall. Manning, C., Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge: MIT Press. NextIt. (2008). Retrieved 4/5/08 from: http:/www.nextit.com. NIST. (2007) Syllabification software. National Institute of Standards: NIST Spoken Language Technology Evaluation and Utility. Retrieved 11/30/07 from: http://www.nist.gov/speech/tools/.
Transcription of a recording REF: (3.203,5.553) GIVE ME A FLIGHT BETWEEN SPOKANE AND SEATTLEREF: (15.633,18.307) UM OCTOBER SEVENTEENTHREF: (26.827,29.606) OH I NEED A PLANE FROM SPOKANE TO SEATTLEREF: (43.337,46.682) I WANT A ROUNDTRIP FROM MINNEAPOLIS TOREF: (58.050,61.762) I WANT TO BOOK A TRIP FROM MISSOULA TO PORTLANDREF: (73.397,77.215) I NEED A TICKET FROM ALBUQUERQUE TO NEW YORKREF: (87.370,94.098) YEAH RIGHT UM I NEED A TICKET FROM SPOKANE SEPTEMBER THIRTIETH TO SEATTLE RETURNING OCTOBER THIRDREF: (107.381,113.593) I WANT TO GET FROM ALBUQUERQUE TO NEW ORLEANS ON OCTOBER THIRD TWO THOUSAND SEVEN
Transcribed and Segmented1 REF: (3.203,5.553) GIHV MIY AX FLAYTD BAX TWIYN SPOW KAEN AENDD SIY AE DXAXLREF: (15.633,18.307) AHM AAKD TOW BAXR SEH VAXN TIYNTHREF: (26.827,29.606) OW AY NIYDD AX PLEYN FRAHM SPOW KAEN TUW SIY AE DXAXLREF: (43.337,46.682) AY WAANTD AX RAWNDD TRIHPD FRAHM MIH NIY AE PAX LAXS TUWREF: (58.050,61.762) AY WAANTD TUW BUHKD AX TRIHPD FRAHM MIH ZUW LAX TUW PAORTD LAXNDDREF: (73.397,77.215) AY NIYDD AX TIH KAXTD FRAHM AEL BAX KAXR KIY TUW NUW YAORKDREF: (87.370,94.098) YAE RAYTD AHM AY NIYDD AX TIH KAXTD FRAHM SPOW KAEN SEHPD TEHM BAXR THER DXIY AXTH TUW SIY AE DXAXL RAX TER NIXNG AAKD TOW BAXR THERDDREF: (107.381,113.593) AY WAANTD TUW GEHTD FRAHM AEL BAX KAXR KIY TUW NUW AOR LIY AXNZ AAN AAKD TOW BAXR THERDD TUW THAW ZAXNDD SEH VAXN 1. Produced by University of Colorado transcription software (to a version of ARPAbet) , National Institute of Standards (NIST) syllabifier, and my own Python classes that coordinate the two.
With Inserted Equivalence Classes1 • REF: (3.203,5.553) GIHV MIY GO BAX TWIYN SPOW KAEN AENDD SIY AE DXAXL • REF: (15.633,18.307) AHM AAKD TOW BAXR SEH VAXN TIYNTH • REF: (26.827,29.606) OW WANT AX PLEYN FRAHM SPOW KAEN TUW SIY AE DXAXL • REF: (43.337,46.682) WANT AX RAWNDD TRIHPD FRAHM MIH NIY AE PAX LAXS TUW • REF: (58.050,61.762) WANT TUW BUHKD AX TRIHPD FRAHM MIH ZUW LAX TUW PAORTD LAXNDD • REF: (73.397,77.215) WANT GO FRAHM AEL BAX KAXR KIY TUW NUW YAORKD • REF: (87.370,94.098) YAE RAYTD AHM AY WANT GO FRAHM SPOW KAEN SEHPD TEHM BAXR THER DXIY AXTH TUW SIY AE DXAXL RETURN AAKD TOW BAXR THERDD • REF: (107.381,113.593) WANT GO FRAHM AEL BAX KAXR KIY TUW NUW AOR LIY AXNZ AAN AAKD TOW BAXR THERDD TUW THAW ZAXNDD SEH VAXN 1. A subset of a set in which all members share an equivalence relation. WANT is an equivalence class with members, I need, I would like, and so on.
Words and Syllables Normed by Flat LM Words Data Normed by Flat LM Syllable Data Normed by Flat LM
Syllabifiers • Syllabifier from National Institute of Standards and Technology (NIST, 2007) • Based on Daniel Kahn’s 1976 dissertation from MIT (Kahn, 1976) • Generative in nature and English-biased
Syllables • Estimates of the number of English syllables range from 1000 to 30,000 • Suggests that there is some difficulty in pinning down what a syllable is. • Usual hierarchical approach syllable onset (C) rhyme Nucleus (V) Coda (C)
Sonority • Sonority rises to the nucleus and falls to the coda • Speech sounds appear to form a sonority hierarchy (from highest to lowest) vowels, glides, liquids, nasals, obstruents • Useful but not absolute: e.g, both depth and spit seem to violate the sonority hierarchy