1 / 29

Automatic Continuous Speech Recognition

Database. text. speech. text. text. text. Scoring. Automatic Continuous Speech Recognition. Automatic Continuous Speech Recognition. Problems with isolated word recognition: Every new task contains novel words without any available training data.

gallia
Télécharger la présentation

Automatic Continuous Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database text speech text text text Scoring Automatic Continuous Speech Recognition

  2. Automatic Continuous Speech Recognition • Problems with isolated word recognition: • Every new task contains novel words without any available training data. • There are simply too many words, and this words may have different acoustic realizations. Increases variability • coarticulation of “words” • Speech velocity • we don´t know the limits of the words.

  3. In CSR, should we use words? Or what is the basic unit to represent salient acoustic and phonetic information?

  4. Model Units Issues • Accurate. • Represent the acoustic realization that appears in different contexts. • Trainable • Generalizable: • New words can be derived

  5. Comparison of Different Units • Words: • Small task. • accurate, trainable, no-generalizable • Large Vocabulary: • accurate, non-trainable, no-generalizable. • Phonemes: • Large Vocabulary: • No-accurate, trainable, over-generalizable

  6. Syllables • English: 30,000 • No-very-accurate, no-trainable, generalizable • Chinese: 1200 tone-dependent syllables • Japanese: 50 syllables for • accurate, trainable, generalizable • Allophones: Realizations of phonemes in different context. • accurate, no-trainable, generalizable • Triphones: Example of allophone.

  7. Traning in Sphinx phonemes set is trained triphones are created triphones are trained senons are created senons are prunned senons are trained: 1-gaussians to 8_or_16-gaussinas

  8. Context Independent: Phonemes • SPHINX: model_architecture/Telefonica.ci.mdef • Context Dependent:Triphone: • SPHINX: model_architecture/Telefonica.untied.mdef

  9. Clustering Acoustic-Phonetic Units • Many Phones have similar effects on the neighboring phones, hence, many triphones have very similar Markov states. • A senone is a cluster of similar Markov states. • Advantages: • More training data. • Less memory used.

  10. Senonic Decision Tree (SDT) • SDT Classify Markov States of Triphones represented in the training corpus by asking Linguistic Questions composed of Conjuntions, Disjunctions and/or negations of a set of predetermined questions.

  11. Linguistic Questions

  12. Decision Tree for Classifying the second state of k-triphone Is left phone (LP) a sonorant or nasal? yes Is LP /s,z,sh,sh/? Is right phone (RP) a back-R? Is RF voiced? Senone 5 Senone 1 Senone 6 Is LP back L or ( LC neither a nasal or RF A LAX-vowel)? Senone 4 Senone 3 Senone 2

  13. When applied to the word welcome Is left phone (LP) a sonorant or nasal? yes Is left phone /s,z,sh,sh/? Is right phone (RP) a back-R? Is RF voiced? Senone 5 Senone 1 Senone 6 Is LP back L or ( LC neither a nasal or RF A LAX-vowel)? Senone 4 Senone 3 Senone 2

  14. The tree can automatically constructed by searching, for each node, the question that the maximum entropy decrease • Sphinx: • Construction: $base_dir/ c_scripts/03.bulidtrees. • Results: $base_dir/trees/Telefonica.unpruned/A-0.dtree • When the tree grows, it needs to be pruned • Sphinx: • $base_dir/ c_scripts/ 04.bulidtrees. • Results:aA • $base_dir/trees/Telefonica.500/A-0.dtree • $base_dir/Telefonica_arquitecture/Telefonica.500.mdef

  15. Subword unit Models based on HMMs

  16. Words • Words can be modeled using composite HMMs • A null transition is used to go from one subword unit to the following /sil/ /uw/ /sil/ /t/

  17. Database text speech text text text Scoring Continuous Speech Training

  18. For each utterance to train, the subword units are concatenated to form words model. • Sphinx: Dictionary • $base_dir/training_input/dict.txt • $base_dir/training_input/train.lbl

  19. /o/ /sil/ /sil/ /f/ /r/ /s/ /x/ /uw/ /t/ /i/ • Let’s assume we are going to train the phonemes in the sentence: • Two four six. • The phonems of this sentence are: • /t//w//o//f//o//r//s//i//x/ • Therefore the HMM will be:

  20. We can estimate the parameters for each HMM using the forward-backward reestimation formulas already definded.

  21. The ability to automatically align each individual HMM to the corresponding unsegmented speech observation sequence is one of the most powerful features in the forward-backward algorithm.

  22. Database text speech text text text Scoring Language Models for Large Vocabulary Speech Recognitin

  23. Instead of using: • The recongition can be imporved using the calculating the Maximum Posteriory Probability: Viterbi Languaje Model

  24. Language Models for Large Vocabulary Speech Recognitin • Goal: • Provide an estimate of the probability of a “word” sequence (w1 w2 w3 ...wQ) for the given recognition task. • This can be solved as follows:

  25. Since, it is impossible to reliable estimate the conditional probabilities, • hence in practice it is used an N-gram language model: • En practice, realiable estimators are obtained for N=1 (unigram) N=2 (bigram) or possible N=3 (trigram). j

  26. Examples: • Unigram: P(Maria loves Pedro)=P(Maria)P(loves)P(Pedro) • Bigram: P(Maria|<sil>)P(loves|Maria)P(Pedro|loves)P(</sil>|Pedro)

  27. CMU-Cambridge Language Modeling Tools • $base_dir/c_scripts/languageModelling

  28. Database text speech text text text Scoring

  29. C(Wi-2 Wi-1 Wi ) P(Wi| Wi-2,Wi-1)= C(Wi-2 Wi-1) where C(Wi-2 Wi-1 )=Total Number Sequence Wi-2 Wi-1 was observed C(Wi-2 Wi-1 Wi ) =Total Number Sequence Wi-2 Wi-1 Wi was observed

More Related