html5-img
1 / 16

Improved Speaker Adaptation Using Speaker Dependent Feature Projections

Improved Speaker Adaptation Using Speaker Dependent Feature Projections. Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland. Overview. Baseline system Technical background Heteroscedastic Linear Discriminant Analysis (HLDA)

ashby
Télécharger la présentation

Improved Speaker Adaptation Using Speaker Dependent Feature Projections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland

  2. Overview • Baseline system • Technical background • Heteroscedastic Linear Discriminant Analysis (HLDA) • Constrained Maximum Likelihood Linear Regression (CMLLR) • Speaker Adaptive Training using CMLLR (CMLLR-SAT) • HLDA adaptation • SAT using HLDA adaptation (HLDA-SAT) • Results • Conclusions

  3. Baseline SI system description • PLP front-end, speaker turn based cepstral mean normalization • HLDA used to find ‘optimal’ feature space • Original space consists of 14 cepstral coefficients and energy, plus their first, second and third derivatives (60 total dimensions) • Reduced space has 46 dimensions • Trained three gender independent (GI) HMMs: • Phonetically tied mixture (PTM), within-word triphone model • State Clustered Tied mixture (SCTM) within-word quinphone model • SCTM cross-word quinphone model • Estimated separate HLDA transforms for each model

  4. HLDA • HLDA is being adopted by many state of the art systems • Like LDA, its goal is to find a feature subspace where it is easier to discriminate among a given set of classes • Unlike LDA, it does not assume that the class Gaussian distributions have equal covariance matrix • Formulated within the ML framework • Many choices available for the definition of the classes • Phonemes, tied states, mixture components • Used the SCTM codebook clusters (HMM tied-states) as the classes in this work

  5. CMLLR adaptation • Widely used adaptation method • Estimates a constrained linear transformation to adapt both means and covariances of a set of Gaussians • Equivalent to transforming the input features using the inverse transformation matrix • Reliable row-iterative estimation method is available when the model to be adapted consists of diagonal covariance Gaussians • Formulation can be extended to handle full covariance Gaussians • Easy to compute objective function and first derivative • Used standard gradient descent methods to estimate the ML transformation

  6. Speaker Adaptive Training (SAT) • SAT brings speaker awareness to acoustic model reestimation • Extends set of model parameters by including speaker dependent transformations • Reduces inter-speaker variability, resulting in more compact acoustic models • Improves performance on test data, after speaker adaptation • Multiple flavors of SAT • MLLR-based, with transforms applied to model parameters • Complicated update equations, hard to integrate with MMI • CMLLR-based, with transforms applied to features • Transparently integrates with regular SI reestimation methods (ML, MMI, etc.)

  7. CMLLR-SAT

  8. HLDA adaptation • Possible mismatch between training and testing acoustic conditions might reduce the effectiveness of HLDA • HLDA adaptation alleviates this problem by transforming the test features such that their statistics look more similar to training • Uses CMLLR in the full space, based on the single Gaussian per tied state HMM • The CMLLR transform is then combined with the global HLDA matrix in order to form speaker dependent projections • Most effective when applied to both training and testing

  9. HLDA-SAT

  10. Experimental Setup • Trained gender-independent (GI), band-independent (BI) models on 145 hours of Broadcast News (BN) data, using ML • 6,300 tied states • 25.6 Gaussians per state • Trigram language model (LM), trained on 600M words • 13 M bigrams, 43M trigrams • Tested on h4e97 and h4d03 test sets • Automatic segmentation and speaker clustering • Two decoding passes • Unadapted pass, generating hypotheses for adaptation • Adapted pass, using SI or SAT adapted models

  11. Results-I • Effect of HLDA adaptation using SI models • Significant gain from HLDA adaptation, even on top of CMLLR and MLLR

  12. Results-II • Effect of HLDA adaptation using SAT models • 0.6-0.8% absolute gain from HLDA-SAT compared to CMLLR-SAT

  13. Understanding the improvements • HLDA-SAT extends CMLLR-SAT in two ways • Uses a single Gaussian per state (1gps) model to estimate transforms in full space • Updates HLDA in transformed space • Which of the two has the largest effect in recognition accuracy? • 1gps model allows to estimate CMLLR transforms that move the speakers closer to the canonical model • Reestimating HLDA in the transformed space results in significantly higher objective function value • Tried two variations of HLDA-SAT, in which the SI HLDA is used • HLDA-SAT1: using 1gps-based CMLLR in reduced space • HLDA-SAT2: using 1gps-based CMLLR in full space

  14. Results-III • Effects of HLDA update and full space transforms • Most of the improvement from HLDA-SAT is due to using a 1gps model. The rest is due to updating the HLDA projection in the transformed space

  15. HLDA-SAT on CTS data • Applied HLDA-SAT to English and Mandarin CTS with mixed results • 0.7% gain on Mandarin CTS • 0.1% gain on English CTS • Suspect problem with English CTS run, need more debugging to determine the cause of the poor performance

  16. Conclusions • Significant gain from HLDA adaptation • Additional improvement from HLDA-SAT • Future work: • Find out why there is no gain from HLDA-SAT on English CTS • Extend method to use non-linear transformations

More Related