1 / 16

CMU Robust Vocabulary-Independent Speech Recognition System

CMU Robust Vocabulary-Independent Speech Recognition System. Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU. Reference. CMU Robust Vocabulary-Independent Speech Recognition System, Hsiao-Wuen Hon and Kai-Fu Lee, ICASSP 1991. Outline. Introduction Larger Training Database

Télécharger la présentation

CMU Robust Vocabulary-Independent Speech Recognition System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU

  2. Reference • CMU Robust Vocabulary-Independent Speech Recognition System, Hsiao-Wuen Hon and Kai-Fu Lee, ICASSP 1991 NTNU Speech Lab

  3. Outline • Introduction • Larger Training Database • Between-Word Triphone • Decision Tree Allophone Clustering • Summary of Experiments and Results • Conclusions NTNU Speech Lab

  4. Introduction • This paper reports the efforts to improve the performance of CMU’s robust vocabulary-independent (VI) speech recognition systems on the DARPA speaker-independent resource management task • The first improvement involves the incorporation of more dynamic features in the acoustic front-end processing (here add second order differenced cepstra and power) • The second improvement involves the collection of more general English data, from which we can model more phonetic variabilities, such as the word boundary context NTNU Speech Lab

  5. Introduction (cont.) • With more detailed models (such as between-word triphones), coverage on new tasks was reduced • A new decision-tree based subword clustering algorithm to find more suitable models for the subword units not covered in the training set • The vocabulary-independent system suffered much more from differences in the recording environments at TI versus CMU than the vocabulary-dependent system NTNU Speech Lab

  6. Larger Training Database • The vocabulary-independent results improved dramatically as the vocabulary-independent training increased • They add 5000 more general English data into the vocabulary-independent training set, but only obtain a small improvement, reducing the error rate from 9.4% to 9.1% • The subword modeling technique may have reached an asymptote, so that additional sentences are not giving much improvement NTNU Speech Lab

  7. Between-Word Triphone • Because the subword models are phonetic models, one way to model more acoustic-phonetic detail is to incorporate more context information • Between-word triphone are modeling on the vocabulary-independent system by adding three more contexts • Word beginning, Word ending and single-phone word positions • In the past, it has been argued that between-word triphones might be learning grammatical constraints instead of modeling acoustic-phonetic variations • The result shows the contrary, since in vocabulary-independent systems, grammars in the training and recognition are completely different NTNU Speech Lab

  8. Decision Tree Allophone Clustering • At the root of the decision tree is the set of all triphones corresponding to a phone • Each node has a binary “question” about their contexts including left, right and word boundary contexts • e.g. “Is the right phoneme a back vowel?” • These question are created using human speech knowledge and are designed to capture classes of contextual effects • To find the generalized triphone for a triphone, the tree is traversed by answering the questions attached to each node, until a leaf node is reached NTNU Speech Lab

  9. Decision Tree Allophone Clustering (cont.) NTNU Speech Lab

  10. Decision Tree Allophone Clustering (cont.) • The metric for splitting is a information-theoretic distance measure based on the amount of entropy reduction when splitting a node • To find the question that divides node m into nodes a and b, such that P(m)H(m) - P(a)H(a) – P(b)H(b) is maximized NTNU Speech Lab

  11. Decision Tree Allophone Clustering (cont.) • The algorithm to generate a decision tree for a phone is given below • 1. Generate an HMM for every triphone • 2. Create a tree with one (root) node, consisting of all triphones • 3. Find the best composite question for each node • (a) Generate a tree with simple questions at each node • (b) Cluster leaf nodes into two classes, representing the composite question • 4. Split the node with the overall best question • 5. until some convergence criterion is met, go to step 3 NTNU Speech Lab

  12. Decision Tree Allophone Clustering (cont.) • If only simple questions are allowed in the algorithm, the data may be over-fragmented, resulting in similar leaves in different locations of the tree • To deal with this problem by using composite questions (questions that involve conjunctive and disjunctive combinations of all questions and their negations) NTNU Speech Lab

  13. Decision Tree Allophone Clustering (cont.) • The significant improvement here is due to three reasons • 1. improved tree growing and pruning techniques • 2. models in this study are more detailed and consistent, which makes it easier to find appropriate and meaningful questions • 3. triphone coverage is lower in this study, so decision tree based clustering is able to find more suitable models NTNU Speech Lab

  14. Summary of Experiments and Results • All the experiments are evaluated on the speaker-independent DARPA resource management task • A 991-word continuous speech task and a standard word-pair grammar with perplexity 60 was used throughout • The test set consists of 320 sentences from 32 speakers • For the vocabulary-dependent (VD) system, they used the standard DARPA speaker-independent database which consisted of 3990 sentences from 109 speakers to train the system under different configurations NTNU Speech Lab

  15. Summary of Experiments and Results (cont.) • The baseline vocabulary-independent (VI) system, was trained from a total of 15000 VI sentences, 5000 of these were the TIMIT and Harvard sentences and 10000 were General English sentences recorded at CMU NTNU Speech Lab

  16. Conclusions • In this paper, they have presented several techniques that substantially improve the performance of CMU’s vocabulary-independent speech recognition system • We can enhance the subword units by modeling more acoustic-phonetic variations • We would like to refine and constrain the type of questions which can be asked to split the decision tree • There is still a non-negligible degradation for cross recording condition NTNU Speech Lab

More Related