Estimating Speech Parameters October, 2006 Oregon Health & Science University

Estimating Speech Parameters October, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul Hosom

Estimating and Detecting Speech Parameters • Topics for this lecture: • Computing Energy • Linear Predictive Coding (LPC) (and Estimating Formants) • Estimating Pitch • Detecting Glottalization • Detecting Bursts • Except for energy, these parameters can not be computed with 100% accuracy. Therefore, they are not often used in automatic speech recognition systems. • However, reliable estimation/detection of these parameters can be useful for voice transformation, speech synthesis, and speech analysis. • Pitch estimation methods are quite numerous. Four basic methods and one new method will be presented here. Methods for detecting glottalization and bursts are less prevelant. A few methods for each will be presented here.

Energy “Energy” or “Intensity”: intensity is sound energy transmitted per second (power) through a unit area in a sound field. [Moore p. 9] intensity is proportional to the square of the pressure variation [Moore p. 9] normalized energy = = intensity xn = signal amplitude x at time sample n N = number of time samples

Energy “Energy” or “Intensity”: human auditory system better suited to relative scales: energy (bels) = energy (decibels, dB) = I0 is a reference intensity… if the signal becomes twice as powerful (I1/I0 = 2), then the energy level is 3 dB (3.01023 dB to be more precise) Typical theoretical value for I0 is 20 Pa. (20 Pa is close to the average human absolute threshold for a 1000-Hz sinusoid.) Typical practical value for I0 is 1.0

Energy What is a good value of N? Depends on information of interest: N=1 msec N=5 msec N=20 msec N=80 msec

LPC: Background: Autocorrelation Autocorrelation: measure of periodicity in signal amplitude time

LPC: Model • Linear Predictive Coding (LPC) provides • low-dimension representation of speech signal at one frame • representation of spectral envelope, not harmonics • “analytically tractable” method • some ability to identify formants • LPC models speech as approximate linear combination of previous p samples: • where a1, a2, … ap are constant for each frame of speech. • We can make the approximation exact by including a • “difference” or “residual” term, e(n) = G u(n), which is the • excitation of the signal if the LPC coefficients are a filter: where u(n) is the excitation of the filter and G is a gain term.

LPC: Model If we define the error over some range of values M1 to M2 as: Then we can find ak by setting En/ak = 0 for k = 1,2,…p, obtaining p equations and p unknowns. After some derivation, the p equations can be related to the autocorrelation coefficients Rn(i): and an exact solution to this system of linear equations can be obtained using an iterative process called “Durbin’s Solution”

LPC: Predictor Coefficients as Filter Coefficients The error term e(n) can be written as Taking the z-transform of this equation (noting the time-shift property of the z-transform, ): where A(z) is a transfer function specified by the LPC coefficients. We can write the original signal in terms of the error signal E(z) and the transfer function A(z), and approximate E(z) by a constant gain term (since the error should have a flat spectrum): The LPC coefficients are therefore an all-pole (IIR) filter that models the spectral shape (spectral envelope) of the input speech (formants and spectral tilt due to glottal source).

LPC: Spectral Representation We can compute spectral envelope magnitude from LPC parameters by evaluating the transfer function S(z) for z=ej: because , the log power spectrum  is: Each formant (complex pole) in spectrum requires two LPC coefficients; each spectral slope factor (frequency=0 or Nyquist frequency) requires one LPC coefficient. For 8 kHz speech, 4 formants  LPC order of 9 or 10

LPC: Spectral Representation

LPC: Estimating Formants The transfer function can be re-written as a product where zk are the roots of the predictor polynomial. Roots for (resonant) poles that aren’t at 0 Hz or the Nyquist frequency will occur in pairs, symmetric around the real (x) axis. If we solve for the roots zk, we can determine the frequencies and bandwidths of the poles. j r  -1 1 real(z) imaginary(z) - r -j

LPC: Estimating Formants We can express these complex roots (or poles of the filter) in terms of angle and radius on the unit circle by converting from Cartesian coordinates in the complex plane to polar coordinates . Angle corresponds to frequency, and radius corresponds to bandwidth. So we can determine the pole (or resonant) frequencies and bandwidths (converting to Hz) as: Formants are typically the resonances with the smallest bandwidths.

Pitch Estimation: Autocorrelation Method Autocorrelation of speech signals: (from Rabiner & Schafer, p. 143)

Pitch Estimation: Autocorrelation Method Autocorrelation (AC) can be used to determine F0, by finding the local maximum in the AC signal that is (a) within range of expected F0 values and (b) above a threshold: However, the local maximum does not always correspond with the correct T0 value (F0 = 1/T0) 100 samples = 8000/1000 = 80 Hz = F0min 30 samples = 8000/30 = 266 Hz = F0max

Pitch Estimation: Autocorrelation Method Problems: 1. A high F0 (e.g. 180 Hz) will have two peaks within range, e.g. 180 Hz and 90 Hz. This may cause “pitch halving” error 2. Formants will influence the strength of peaks. For example, if F0 is 120 Hz, but the strongest energy in the waveform is due to the first formant at 240 Hz (e.g. the vowel /i:/), the highest local maxium in the AC may be at 240 Hz (“pitch doubling” error) Want an F0 estimation method that is not sensitive to formants 120 240 360 480 600

Pitch Estimation: SIFT Method SIFT Method: LPC analysis (order 5) of low-pass-filtered waveform (800 Hz) inverse filter: obtain signal without formants autocorrelation: measure periodicity decision based on height of autocorrelation peak time frequency frequency 800 Hz 4000 Hz time frequency frequency

Pitch Estimation: SIFT Method Problems with SIFT: 1. For some sounds (e.g. nasals, vowel-to-silence transitions) the signal is dominated by a single harmonic (close to sine wave).LPC analysis and inverse filtering can remove the formant information and the glottal-source information, leaving only white noise. 2. Still have the problem that a high F0 (e.g. 180 Hz) will have two AC peaks within range, e.g. 180 Hz and 90 Hz. This may cause “pitch halving” error Rather than use autocorrelation, we can measure F0 from information in the spectrum, by identifying harmonics

Pitch Estimation: Harmonic Sieve Method F0 estimation in the spectral domain often uses the “harmonic sieve” method, that relies on the fact that F0 harmonics must occur at multiples of the fundamental frequency. If we sum the power- spectrum energy values at multiples of a given frequency, then the frequency value that yields the largest energy sum should be F0. test F0 = 100 Hz Normalized sum of energies at 100, 200, …, 4000 Hz is maximum test F0 = 110 Hz Normalized sum of energies at 110, 220, …, 3960 Hz is relatively small

Pitch Estimation: Harmonic Sieve Method Problems with harmonic sieve method: 1. In order to resolve harmonics, need high frequency resolution in power spectrum. This requires a large number of waveform samples at each frame (e.g. 256 samples (32 msec) or more). F0 may change quickly within 30 or 40 msec, and these quick changes in F0 can not be reliably identified. 2. This method is susceptible to pitch-doubling errors, because (a) normalization by number of harmonics will reduce energy in low-F0 case to be approximately equal to energy of doubled F0, and (b) if normalization is not performed, bias toward lower F0 values that includes more harmonics.

Pitch Estimation: Cepstral Method Cepstral method: F0 information is encoded in higher cepstral coefficients cepstrum: treat spectrum as signal subject to frequency analysis… 1. Compute log power spectrum 2. Compute FFT of log power spectrum this peak indicates F0 amplitude amplitude quefrency time energy (dB) energy (dB) frequency frequency

Pitch Estimation: Cepstral Method • F0 estimation in the cepstral domain: • compute the cepstrum • determine if there are values within normal F0 ranges that are above some pre-defined threshold • if there are such values, find the maximum value • F0 is computed from the inverse of the index of this maximum F0min F0max

Pitch Estimation: Cepstral Method Problems with the cepstral method: 1. Often there are only a few harmonics (e.g. nasals, /w/) that identify F0. In this case, the height of the cepstral peak will be very low, leading to peak-location (F0) estimation errors 2. Humans can identify pitch from a small number of harmonics, and some of the harmonics may be “missing”. Cepstral method is not robust in these cases perceived pitch = 200 Hz 200 400 600 800 1k 1.2k 1.4k 1.6k 1.8k 2.0k 2.2k 2.4k 2.6k missing fundamental perceived pitch = ??? Hz 200 400 600 800 1k 1.2k 1.4k 1.6k 1.8k 2.0k 2.2k 2.4k 2.6k most harmonics missing perceived pitch = ??? Hz 200 400 600 800 1k 1.2k 1.4k 1.6k 1.8k 2.0k 2.2k 2.4k 2.6k

Pitch Estimation: Dynamic Programming Many errors in F0 estimation occur briefly (especially at the beginning and end of voiced sounds) and have a large difference from (a) the correct F0 value and (b) neighboring F0 values: One method of obtaining a smooth F0 contour is to apply a Viterbi search to a number of F0 estimates at each frame. In this case, the transition probability is constrained so that large changes in F0 from one frame to the next are prohibited maximum value at t=13 is here transition from previous frame limited to range of neighboring T0 values AC (T0) value t = 1 2 3 4 5

Pitch Estimation: Band-Pass Method (new) If the expected F0 range is known in advance, constraints (e.g. limiting F0max and F0min) can be applied to reduce these errors. For F0 estimation for both children and adults, the expected F0 range can be too large (e.g. 50 to 400 Hz) for such constraints to be effective. The “band-pass” algorithm is based on an interpretation Moore’s summarization of pitch identification in humans. In the band-pass method, information from 32 band-pass filters is combined at every frame, and then a Viterbi search provides an F0 contour estimate. This method does not require constraints on the range of F0, as one set of parameters can be used for adult and children’s speech. It does not utilize autocorrelation, LPC, harmonic seive, or cepstrum, and so it does not make assumptions about the correlation between pitch periods, nature of the glottal source, or existence of a large number of harmonics. It is robust to different formant frequencies.

Pitch Estimation: Band-Pass Method (new) STEP (1): FILTERING The speech signal is passed through 32 9-tap IIR filters, with (narrow) filter bandwidths determined from the Equivalent Rectangular Bandwidth (ERB) scale: ERB(f) = 0.108f + 24.7 where f is the center frequency, in Hz. As a result, in most cases, no more than one harmonic occupies one filter’s frequency range. The first filter is centered at 100 Hz, and each subsequent filter has a center frequency approximately one-half bandwidth higher than the previous filter’s center frequency. These filter outputs are approximately sine functions, with frequency equal to the harmonic within (or closest to) the filter’s bandwidth. In cases where multiple harmonics are within one band, the outputs are no longer simple sine functions, and are discarded.

Pitch Estimation: Band-Pass Method (new) STEP (1): FILTERING Filters: From To Bandwidth Idx 82 118 36 1 99 137 38 2 117 156 39 3 135 177 42 4 155 199 44 5 177 223 46 6 197 245 48 7 221 272 51 8 242 296 54 9 … … … 902 1031 129 26 957 1092 135 27 1024 1167 143 28 1085 1235 150 29 1159 1318 159 30 1226 1393 167 31 1309 1484 175 32

Pitch Estimation: Band-Pass Method (new) STEP (2): FIND PERIODICITY The period-to-period maxima of the (sine-wave) filter outputs are located in order to identify the periodicity of the signal at each frame. If the identified periodicity is beyond the frequency limits of the filter (noise or multiple harmonics), the periodicity is set to zero. 841-963 Hz 792-908 Hz 737-847 Hz 692-797 Hz 643-742 Hz 603-697 Hz Find periodicity in each band at this frame by simple location of local maxima.

Pitch Estimation: Band-Pass Method (new) STEP (3): CREATE HISTOGRAM For each frame (e.g. 1 msec), a histogram is computed: (A) Initialize the histogram, with one bin for each periodicity value(154 values, representing 50 – 1334 Hz), with all bins set to 0. (B) Determine the filter output with greatest energy at this frame, Me (C) For each filter with energy > Me- (where  = 12dB) andperiodicity > 0, the histogram is increased by 1.0 near thisperiodicity value p, within the range p-5 to p+5. This increase is repeated for all integer multiples of p. (D) The maximum histogram value, Mh, and bin containing thismaximum, b, are determined. All values are normalizedby Mh. All values at bins greater than 2b are decreasedslightly (by a factor of 0.95) to avoid F0-halving errors. Frequency is determined from periodicity by Fs/p, where Fs is the sampling frequency, and p is a periodicity value.

1500 Hz 1000 Hz 500 Hz 9 12 15 18 21 24 27 30 33 36 39 0 Hz 69 72 75 Pitch Estimation: Band-Pass Method (new) 216 Hz 9x1= 889Hz (3 cnts) 9x2= 444Hz 9x3= 296Hz 9x4= 222Hz 9x8= 111Hz 889 Hz 889 Hz 889 Hz 800 Hz 727 Hz 667 Hz 1309-1484 1226-1393 667 Hz 1159-1318 histogram count 1085-1235 1024-1167 444 Hz 957-1092 902-1031 400 Hz 841-963 792-908 737-847 444 Hz 692-797 643-742 603-697 242 Hz 558-648 522-607 481-562 228 Hz 448-526 412-485 382-452 350-416 323-386 235 Hz 293-352 269-326 242-296 221-272 197-245 177-223 155-199 222 Hz 135-177 117-156 99-137 82-118 889 Hz = periodicity of 9 samples

Pitch Estimation: Band-Pass Method (new) STEP (4): VITERBI SEARCH A Viterbi search is performed on the sequence of histograms at each time frame. Transitions are constrained between frames t and t+1 to change by no more than 2 periodicity values (e.g. 2.5 Hz/msec when F0=100 Hz, 10 Hz/msec when F0=200 Hz). The result is the F0 contour with the largest global histogram value.

Pitch Estimation: Band-Pass Method (new) One issue is the possibility of pitch-halving errors, because lower F0 values can have an equally-large histogram count as the correct F0. This is avoided by (a) finding the first large peak, and (b) multiplying all T0 values above this peak by 0.95 Another issue is that this method will find an F0 value for all frames of speech, even frames that are unvoiced. (There is no threshold as in the cepstral method to determine whether a frame is voiced or unvoiced.)

Pitch Estimation: Band-Pass Method (new) Two corpora were used in evaluation, MWM and LSR corpora. Average F0 for the MWM corpus was 118 Hz (range 50–250 Hz); average F0 for the LSR corpus was 250 Hz (range 96–402 Hz). Evaluation was performed by (a) computing the average absolute difference between correct F0 and measured F0, over all frames at which an F0 value was obtained, (b) computing average percent error (absolute difference / correct F0), over all such frames. In addition, results on the LSR corpus were manually compared with Kay Elemetrics’ CSL F0 estimation when the difference between the results in either vowel of a word exceeded 30 Hz.

Pitch Estimation: Band-Pass Method (new) Results: For the comparison with CSL on the LSR corpus, out of 33 words with at least one vowel having an F0 difference greater than 30 Hz, CSL had 30 errors greater than 30 Hz, while the proposed method had 8 errors greater than 30 Hz.

Detecting Glottalization What is glottalization? (Also called “creaky voice”) Here, we define it as irregular or low-frequency (20-70 Hz) vibration of vocal folds during voicing. (The term “glottalization” also used when describing certain articulations of stop consonants) Glottalization can occur quite often in some speakers, as a speaking style. It may occur frequently at end of sentence, when signaling phoneme boundary between two similar sounds (e.g. “E.E.”), or when signaling word boundary between potentially ambiguous words (e.g. “heavy oak” vs. “heavy yoke”). It may also occur more frequently when a speaker has been talking a lot. Very little published work on detecting glottalization. However, glottalization makes F0 estimation and synthesis difficult, and it may be a relevant factor in diagnosing speech disorders.

Detecting Glottalization Three techniques: PtP Amplitude, Feature Classification, Autocorrelation Peak-to-Peak Amplitude (Cole, 1988) Compute peak-to-peak amplitude (difference between max and min amplitude) of signal using variable-length analysis window. Assumption: glottalization has F0 significantly smaller than surrounding voiced sounds. Therefore, F0-related amplitude changes can be detected using analysis window length determined from 1.3 times the median F0. wave spectrogram phonemes long-term F0 PtP amplitude

Detecting Glottalization Three techniques: PtP Amplitude, Feature Classification, Autocorrelation Feature Classification (Hosom, 2000) Use neural-network classifier with (a) standard MFCC or PLPfeatures, or (b) standard features plus relative change in energyusing analysis window of 2(long-term F0). Assumption: standard classifier can identify changes in energy and source characteristics using standard features, or standard features augmented with feature similar to PtP feature. Results (insertion and deletion errors, within 20 msec):

Detecting Glottalization Three techniques: PtP Amplitude, Feature Classification, Autocorrelation Autocorrelation (Ishi, 2004) Estimate glottal-source waveform by inverse-filtering LPCcoefficients. Compute autocorrelation of glottal-sourcewaveform. Assumption: long delay between impulses in glottalized speech yields non-zero correlations in between glottal pulses glottalized speech normal speech (from Ishi, 2004)

Detecting Glottalization Three techniques: PtP Amplitude, Feature Classification, Autocorrelation A decision tree was applied to various parameters determined from the first two autocorrelation peaks (e.g. relative peak amplitude, relative peak position) Decision tree yielded error rate of 21.6%. However, definition of “creaky” included abnormal F0 patterns as well as low-F0 patterns.

Detecting Bursts What are bursts? Increase in energy that is characteristic of stop consonants, after the closure, when buildup of pressure has been released Not much prior work on detecting bursts. However, detecting bursts is critital for measuring voice-onset-time (VOT), which is the time from burst to onset of voicing. VOT is important for phoneme identity (e.g. distinguishing /p/ from /b/), and may be important in detection of Parkinson’s Disease (PD), where control over VOT may be reduced, leading to reduced intelligibility.

Detecting Bursts Four basic methods: Change in Energy, HMM, SVM Classifier, and Candidate Selection. 1. Change in Energy(Liu, 95)bursts characterized by closure (silence) then burst (high energy), so compute change in energy over entire utterance; if the change is above a threshold, mark as a burst 2. HMM(Niyogi, 99)use a phoneme-level HMM to identify all phonemes in an utterance. Beginning of each plosive is identified as a burst. 3. Support-Vector Machine (SVM) Classifier(Niyogi, 99; Keshet, 01)two SVMs implemented, for linear and non-linear classification, using as features log energy of entire spectrum, log energy of 3 to 8 kHz, and a spectral flatness measure 4. Candidate Selection(Hosom, 00)select “candidate bursts” based on change in relative energy; classify candidates using ANN with cepstral features.

Detecting Bursts 4. Candidate Selection Method (in detail): Generate Candidates: Measure relative change in energy at eight Bark-scale frequency bands. Perform equal-loudness weighting of energy bands, so that perceptually-relevant bands have greater weight. Transform energy values to scale 0 to 1, representing “probability of burst in this band”. Combine probabilities using Bayes’ Rule. Select Candidates: Using a fixed threshold determined from development data (.075), select time points above threshold for classification Classification: For each candidate, compute cepstral features at that time point and surrounding time points. Use Artificial Neural Network (ANN) to classify features as “burst” or “non-burst”.

Detecting Bursts 4. Candidate Selection Method (illustration):

Detecting Bursts Results, relative to number of burst and non-burst phonemes (not frames). Threshold of 20 msec, evaluated on TIMIT corpus. CANDIDATE

Estimating Speech Parameters October, 2006 Oregon Health & Science University