1 / 14

Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

AN EXPECTATION MAXIMIZATION APPROACH FOR FORMANT TRACKING USING A PARAMETER-FREE NON-LINEAR PREDICTOR. Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003. Outline. Introduction The Model EM Training Format Tracking Experiment Results

Télécharger la présentation

Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AN EXPECTATION MAXIMIZATION APPROACH FOR FORMANT TRACKING USING APARAMETER-FREE NON-LINEAR PREDICTOR Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

  2. Outline • Introduction • The Model • EM Training • Format Tracking • Experiment Results • Conclusion Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

  3. Introduction • Traditional methods use LPC or matching stored templates of spectral cross sections • In either case, formant tracking is error-prone due to not enough candidates or templates • This paper uses a predictor codebook of MFCC to present formant relationships • Also, this method explores the complete formant space, avoiding premature elimination in LPC or template matching Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

  4. The Model • ot = F(xt) + rt • ot is observed MFCC coefficients • xt is vocal tract resonances (VTR) and corresponding bandwidths • F(xt) is the quantized frequency and bandwidth of formants, named predictor codebook • rt is the residual signal Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

  5. Constructing F(x) • All-pole model • Assume there are I formants • x = (F1, B1, F2, B2,……, FI, BI) • Then use z-transfrom to get H(z): • Finally, each quantized VTR x can be transformed into a MFCC series F(x) Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

  6. EM Training (1/2) • Use a single Gaussian to model rt • T frames utterance, θ is parameters (mean and covariance) of Gaussian • Assume formant values x are uniformly distributed, and can take any of C quantized values Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

  7. EM Training (2/2) Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

  8. Formant Tracking (1/2) • Frame-by-Frame Tracking • Formants in each frame are estimated independently • One-to-one Mapping (MAP) • Minimum Mean Squared Error (MMSE) Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

  9. Formant Tracking (2/2) • Tracking with Continuity Constraints • First Order State Model: xt = xt-1 + wt • wt is modeled as a Gaussian with zero mean and diagonal Σw • MAP method below can be estimated using Viterbi search • MMSE is more much complex and this paper uses an approximate method to obtain, which is not well described here Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

  10. Track 3 formants Frequencies are first mapped on mel-scale then uniformly quantized Bandwidths are simply uniformly quantized F1 < F2 < F3, so totally 767500 entries in codebook Gain = 1 MFCC is 12 dimension, without C0 20 utterances of one male speaker are used for EM Experiment Settings Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

  11. Experiment Results, “they were what” Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

  12. Experiment Results, with bandwidth Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

  13. Experiment Results, residual Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

  14. Conclusion • This method is totally unsupervised, needless of any labeling • Works well in unvoiced frames • No gross errors • May be applied to speech recognizing system Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

More Related