Speech Signal Representations I Seminar

Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage

Speech Signal Representations I Decomposition of the speech signal (x[n]) as a source (e[n]) passed through a linear time-varying filter (h[n]).

Speech Signal Representations I Estimation of the filter, inspired by: • Speech production models • Linear Predictive Coding (LPC) • Cepstral analysis • Speech perception models (part II) • Mel-frequency cepstrum • Perceptual Linaer Prediction (PLP) Speech recognizers estimate filter characteristics and ignore the source

Speech Signal Representations IShort-Time Fourier Analysis • Spectrogram • Representation of a signal highlighting several of its properties based on short-time Fourier analysis • Two dimensional: time horizontal and frequency vertical • Third ‘dimension’: gray or color level indicating energy

Speech Signal Representations IShort-Time Fourier Analysis • Spectrogram • Narrow band • Long windows (> 20 ms) → • Narrow bandwidth • Lower time resolution, better frequency resolution • Wide band • Short windows ( <10 ms) → • Wide bandwidth • Good time resolution, lower frequency resolution • Pitch synchronous • Requires knowledge of local pitch period

Speech Signal Representations IShort-Time Fourier Analysis • Spectrogram

Speech Signal Representations IShort-Time Fourier Analysis • Window analysis • Series of short segments, analysis frames • Short enough so that the signal is stationary • Usually constant, 20-30 ms • Overlaps possible • Different types of window functions (wm[n]): • Rectangular (equal to no window function) • Hamming • Hanning

Speech Signal Representations IShort-Time Fourier Analysis • Window analysis • Window size must be long enough • Rectangular: N ≥ M • Hamming, Hanning: N ≥ 2M • Pitch period not known in advance → • Prepare for lowest pitch period → • At least 20ms for rectangular or 40ms for Hamming/Hanning (50Hz) • But longer windows give a more average spectrum instead of distinct spectra → • Rectangular window has better time resolution

Speech Signal Representations IShort-Time Fourier Analysis

Speech Signal Representations IShort-Time Fourier Analysis • Window analysis • Frequency response not completely zero outside main lobe → Spectral leakage • Second lobe of a Hamming window is approx. 43dB below main lobe → less spectral leakage • Hamming, Hanning, triangular windows offer less spectral leakage → • Rectangular windows are rarely used despite their better time resolution

Speech Signal Representations IShort-Time Fourier Analysis

Speech Signal Representations IShort-Time Fourier Analysis Short-time spectrum of male voice speech • Time signal /ah/local pitch 110Hz • 30ms rectangularwindow • 15ms rectangular window • 30ms Hammingwindow • 15ms Hammingwindow

Speech Signal Representations IShort-Time Fourier Analysis Short-time spectrum of female voice speech • Time signal /aa/local pitch 200Hz • 30ms rectangularwindow • 15ms rectangular window • 30ms Hammingwindow • 15ms Hammingwindow

Speech Signal Representations IShort-Time Fourier Analysis Short-time spectrum of unvoiced speech • Time signal • 30ms rectangularwindow • 15ms rectangular window • 30ms Hammingwindow • 15ms Hammingwindow

Speech Signal Representations ILinear Predictive Coding • LPC a.k.a. auto-regressive (AR) modeling • All-pole filter is good approximation of speech, with p as the order of the LPC analysis: • Predicts current sample as linear combination of past p samples

Speech Signal Representations ILinear Predictive Coding • To estimate predictor coefficients (ak), use short-term analysis technique • Per segment, minimize the total prediction error by calculating the minimum squared error • Take the derivative, equate it to 0; expressed as a set of p linear equations:the Yule-Walker equations

Speech Signal Representations ILinear Predictive Coding • Solution of the Yule-Walker equations: • Any standard matrix inversion package • Due to the special form of the matrix, efficient solutions: • Covariance methodusing the Cholesky decomposition • Autocorrelation methodusing windows, results in equations with Toeplitz matrices, solved by the Durbin recursion algorithm • Lattice methodequivalent to Levinson Durbin recursionoften used in fixed-point implementations because lack of precision doesn’t result in unstable filters

Speech Signal Representations ILinear Predictive Coding

Speech Signal Representations ILinear Predictive Coding • Spectral analysis via LPC • All-pole (IIR) filter • Peaks at the roots of the denominator

Speech Signal Representations ILinear Predictive Coding • Prediction error • Should be (approximately) the excitation • Unvoiced speech, expect white noise; OK • Voiced speech, expect impulse train; NOK • All-pole assumption not altogether valid • Real speech not perfectly periodic • Pitch synchronous analysis gives better results • LPC order • Larger p gives lower prediction errors • Too large a p results in fitting the individual harmonics →separation between filter and source will not be so good

Speech Signal Representations ILinear Predictive Coding • Prediction error • Inverse LPC filter gives residual signal

Speech Signal Representations ILinear Predictive Coding • Alternatives for the predictor coefficients • Line Spectral Frequencies • local sensitivity • efficiency • Reflection Coefficients • Guaranteed stable → useful for coefficient interpolated over time • Log-area ratios • Flat spectral sensitivity • Roots of the polynomial • Represent resonance frequencies and bandwidths

Speech Signal Representations ICepstral Processing • A homomorphic transformation converts a convolution into a sum:

Speech Signal Representations ICepstral Processing

Speech Signal Representations I Seminar