1 / 53

Large Vocabulary Continuous Speech Recognition (LVCSR)

Large Vocabulary Continuous Speech Recognition (LVCSR). Automatic Speech Recognition Spring 2016. Large Vocabulary Continuous Speech Recognition. Sub-word Speech Units. HMM-Based Sub-word Speech Units. Training of Sub-word Units. Training of Sub-word Units. Training Procedure.

edita
Télécharger la présentation

Large Vocabulary Continuous Speech Recognition (LVCSR)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Vocabulary Continuous Speech Recognition (LVCSR) Automatic Speech Recognition Spring 2016

  2. Large Vocabulary Continuous Speech Recognition Automatic Speech Recognition

  3. Sub-word Speech Units Automatic Speech Recognition

  4. Automatic Speech Recognition

  5. Automatic Speech Recognition

  6. HMM-Based Sub-word Speech Units Automatic Speech Recognition

  7. Automatic Speech Recognition

  8. Training of Sub-word Units Automatic Speech Recognition

  9. Training of Sub-word Units Automatic Speech Recognition

  10. Automatic Speech Recognition

  11. Training Procedure Automatic Speech Recognition

  12. Automatic Speech Recognition

  13. Automatic Speech Recognition

  14. Automatic Speech Recognition

  15. Errors and performance evaluation in PLU recognition • Substitution error (s) • Deletion error (d) • Insertion error (i) • Performance evaluation: • If the total number of PLUs is N, we define: • Correctness rate: N – s – d /N • Accuracy rate: N – s – d – i / N Automatic Speech Recognition

  16. Language Models for LVCSR Word Pair Model: Specify which word pairs are valid Automatic Speech Recognition

  17. Statistical Language Modeling Automatic Speech Recognition

  18. Perplexity of the Language Model Entropy of the Source: Assuming independent generation of words: Then, H is called the first order entropy of the source: If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the Source puts out, Automatic Speech Recognition

  19. Perplexity of the Language Model We often compute H based on a finite but sufficiently large Q: H is the degree of difficulty that the recognizer encounters, on average, When it is to determine a word from the same source. Using language model, if the N-gram language model PN(W) is used, An estimate of H is: Automatic Speech Recognition

  20. Perplexity of the Language Model An estimate of H is: In general If the source is ergodicand Perplexity is defined as: Automatic Speech Recognition

  21. مثال الف)B=8 ب)B=4

  22. Overall recognition system based on sub-word units Automatic Speech Recognition

  23. Naval Resource (Battleship) Management Task: 991-word vocabulary NG (no grammar): perplexity = 991 Automatic Speech Recognition

  24. Word pair grammar We can partition the vocabulary into four nonoverlapping sets of words: The overall FSN allows recognition of sentences of the form: Automatic Speech Recognition

  25. WP (word pair) grammar: Perplexity=60 FSN based on Partitioning Scheme: 995 real arcs and 18 null arcs WB (word bigram) Grammar: Perplexity =20 Automatic Speech Recognition

  26. Control of word insertion/word deletion rate In the discussed structure, there is no control on the sentence length We introduce a word insertion penalty into the Viterbi decoding For this, a fixed negative quantity is added to the likelihood score at the end of each word arc Automatic Speech Recognition

  27. Automatic Speech Recognition

  28. Automatic Speech Recognition

  29. Automatic Speech Recognition

  30. Automatic Speech Recognition

  31. State Tying Automatic Speech Recognition

  32. Problem definition • In HMM-based speech recognition, the performance of the system depends critically on how well state output distributions are modeled • And on how well model parameters are learned • The existing tradeoff: • Using a model such that their parameters can be easily estimated • Using a complex model for density distribution of data to have HMM as a good statistical model • Some common simple densities to model output distribution: • Gaussian model • However, it’s a poor model for the distribution of cepstral features • Gaussian mixture model • It’s an effective and simple model Automatic Speech Recognition

  33. Problem definition • The parameters required to specify a mixture of K Gaussians includes K mean vectors, K covariance matrices, and K mixture weights • A recognizer with tens (or hundreds) of thousands of HMM states will require hundreds of thousands (or millions) of parameters to specify all state output densities • Most training corpora cannot provide sufficient training data to learn all these parameters effectively • Parameters for the state output densities of sub-word units that are never seen in the training data can never be learned at all The key problem:maintaining the balance between model complexity and available training data Automatic Speech Recognition

  34. Training the HMM • To train the HMM for a sub-word unit, data from all instances of the unit in the training corpus are used to estimate the parameters • This process could be: • Context-independent • Context-dependent Automatic Speech Recognition

  35. Context-independent parameter training • In context independent model different samples of a unit is gathered from different location in the corpus • The effects of neighboring units are ignored Gather data from separate instances, assign data to states, aggregate data for each state, and find the statistical parameters of each of the aggregates … State #1 State #2 Automatic Speech Recognition

  36. Context-dependent parameter training • Phonemepronunciationdependson environment(allophones, co-articulation) • Context based grouping of observations results in finer, Context-Dependent (CD) models • Triphones: simple and the most widely used model • context: window of length three. Automatic Speech Recognition

  37. Automatic Speech Recognition

  38. What to use? Word models are best when the vocabulary is small (e.g. digits). CI phoneme based models are rarely used Where accuracy is of prime importance, triphone models are usually used If reduced memory footprint and speed are important, e.g. in embedded recognizers, diphone models are often used Higher-order Nphone models are rarely used Automatic Speech Recognition

  39. Triphones To build the HMM for a word, we simply concatenate the HMMs for individual triphones in it Automatic Speech Recognition

  40. Triphones • Triphones at word boundaries are dependent on neighboring words. • cross-word triphones: context spanning word boundaries, important for accurate modeling. • A triphone in the middle of a word sounds different from the same triphone at word boundaries • e.gthe word-internal triphone AX(G,T) from GUT: G AX T • Vs. cross-word triphone AX(G,T) in BIG ATTEMPT • This results in significant complication of the HMM for the language (through which we find the best path, for recognition) • Resulting in larger HMMs and slower search Automatic Speech Recognition

  41. Problems with triphones • Parameters: very large numbers for VLVR. • Number of phones: about 50 • Number of CD phones: possibly , 503 • but not all of them occur (phonotacticconstraints). In practice, about 60000. • Number of HMM parameters: • with16 mixture and 39-dimensional feature vector: 60000 × 3 × (39 × 16 × 2 +16) ≈ 280M • Data sparsity: • some triphones, particularly cross-word triphones, do not appear in sample. Automatic Speech Recognition

  42. Solution • Parameter sharing: cluster parameters with similar characteristics (‘parameter tying’). • clustering HMM states. • Parameter sharing is a technique by which several similar HMM states share a common set of HMM parameters • Since the shared HMM parameters are now trained using the data from all similar states, there are more data available to train any HMM parameter Automatic Speech Recognition

  43. Parameter sharing types • Continuous density HMMs • Individual states may share the same mixture distributions Automatic Speech Recognition

  44. Parameter sharing types Semi-continuous density HMMs: all states may share the same Gaussians, but with different mixture weights Automatic Speech Recognition

  45. Parameter sharing types Semi-continuous density HMMs: all states may share the same Gaussians, but with state-specific mixture weights, and then share the weights as well Automatic Speech Recognition

  46. Two techniques • Data-driven Clustering • Group HMM states together based on the similarity of their distributions, until all groups have sufficient data • The densities used for grouping are poorly estimated in the first place • Has no estimates for unseen sub-word units • Places no restrictions on HMM topologies etc. • Decision trees • Clustering based on expert-specified rules. The selection of rule is data driven • Based on externally provided rules. Very robust if the rules are good • Provides a mechanism for estimating unseen sub-word units • Restricts HMM topologies Automatic Speech Recognition

  47. Decision Tree • Basic principle: • Recursively partition a data set to maximize a pre-specified objective function • The actual objective function used is dependent on the specific decision tree algorithm • The objective is to separate the data into increasingly “pure” subsets, such that most of the data in any subset belongs to a single class • In our case the “classes” are HMM states • Most commonly used tools for induction of decision trees: • CART (classification and regression tree) • C4.5 Automatic Speech Recognition

  48. Decision Tree in our problem • algorithm • Initially, group together all triphones for the same phoneme. • Split group according to decision tree questions based on left or right phonetic context. • All triphones (HMM states) at the same leaf are clustered (tied). • Advantage: • even unseen triphones are assigned to a cluster and thus a model. • Questions • which DT questions? Which criterion? • Example for predefined binary questions • is the phoneme to the left an /l/? • is the phoneme to the right a nasal? • is the phoneme to the right a nasal? Automatic Speech Recognition

  49. Clustering context-dependent phone Automatic Speech Recognition

  50. Splitting Criterion • Criterion: best question is one that maximizes sample likelihood after splitting. • The parent set O1 has a distribution P1(x) • The total log likelihood of all observations in O1 on the distribution of O1 is • The child set O2 has a distribution P2(x) • The total log likelihood of all observations in O2 on the distribution of O2 is • The child set O3 has a distribution P3(x) • The total log likelihood of all observations in O3 on the distribution of O3 is • The total increase in set-conditioned log likelihood of observations due to partitioning O1 is • Partition O1 such that the increase in log likelihood is maximized • Recursively perform this partitioning of each of the subsets to form a tree Automatic Speech Recognition

More Related