Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

AN EXPECTATION MAXIMIZATION APPROACH FOR FORMANT TRACKING USING APARAMETER-FREE NON-LINEAR PREDICTOR Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Outline • Introduction • The Model • EM Training • Format Tracking • Experiment Results • Conclusion Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

Introduction • Traditional methods use LPC or matching stored templates of spectral cross sections • In either case, formant tracking is error-prone due to not enough candidates or templates • This paper uses a predictor codebook of MFCC to present formant relationships • Also, this method explores the complete formant space, avoiding premature elimination in LPC or template matching Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

The Model • ot = F(xt) + rt • ot is observed MFCC coefficients • xt is vocal tract resonances (VTR) and corresponding bandwidths • F(xt) is the quantized frequency and bandwidth of formants, named predictor codebook • rt is the residual signal Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

Constructing F(x) • All-pole model • Assume there are I formants • x = (F1, B1, F2, B2,……, FI, BI) • Then use z-transfrom to get H(z): • Finally, each quantized VTR x can be transformed into a MFCC series F(x) Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

EM Training (1/2) • Use a single Gaussian to model rt • T frames utterance, θ is parameters (mean and covariance) of Gaussian • Assume formant values x are uniformly distributed, and can take any of C quantized values Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

EM Training (2/2) Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

Formant Tracking (1/2) • Frame-by-Frame Tracking • Formants in each frame are estimated independently • One-to-one Mapping (MAP) • Minimum Mean Squared Error (MMSE) Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

Formant Tracking (2/2) • Tracking with Continuity Constraints • First Order State Model: xt = xt-1 + wt • wt is modeled as a Gaussian with zero mean and diagonal Σw • MAP method below can be estimated using Viterbi search • MMSE is more much complex and this paper uses an approximate method to obtain, which is not well described here Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

Track 3 formants Frequencies are first mapped on mel-scale then uniformly quantized Bandwidths are simply uniformly quantized F1 < F2 < F3, so totally 767500 entries in codebook Gain = 1 MFCC is 12 dimension, without C0 20 utterances of one male speaker are used for EM Experiment Settings Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

Experiment Results, “they were what” Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

Experiment Results, with bandwidth Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

Experiment Results, residual Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

Conclusion • This method is totally unsupervised, needless of any labeling • Works well in unvoiced frames • No gross errors • May be applied to speech recognizing system Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007

Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Presentation Transcript

Microsoft Access 2003

Scott Counts Microsoft Research Redmond, WA

Microsoft Research Faculty Summit 2003

Microsoft Access 2003

Nikolaj Bjørner Senior Researcher Microsoft Research Redmond

Nikolai Tillmann Foundations of Software Engineering Microsoft Research Redmond WA, USA

Li Deng Microsoft Research, Redmond

Flat Datacenter Storage Microsoft Research, Redmond

Microsoft Project 2003

Li Deng Microsoft Research Redmond, WA Presented at the Banff Workshop, July 2009

Li Deng Microsoft Research, Redmond, USA Tianjin University, July 4, 2013 (Day 3)

Microsoft PowerPoint 2003

Seattle, WA 22 October 2003 Gordon Bell Microsoft Research

Roberto Togneri University of Western Australia Li Deng Microsoft Research, Redmond

Redmond, WA | March 8-9, 2012 Microsoft / pilhighered

Li Deng Microsoft Research, Redmond, USA Tianjin University, July 2-5, 2013

Nikolaj Bjørner Senior Researcher Microsoft Research Redmond