Wen-Yi Chu Department of Computer Science & Information Engineering

Exploring the Use of Speech Features and Their Corresponding Distribution Characteristics for Robust Speech RecognitionShih-Hsiang Lin, Berlin Chen, Yao-Ming Yeh Wen-Yi Chu Department of Computer Science & Information Engineering National Taiwan Normal University

Outline • Introduction • Cluster-based polynomial-fit histogram equalization • Polynomial-fit histogram equalization • Experiments and results • Summary • Conclusions

Introduction(1/3) • The performance of current automatic speech recognition (ASR) systems often deteriorates radically when the input speech is corrupted by various kinds of noise sources. • Broadly speaking, the existing methods can be classified into two categories according to whether they function directly on the basis of the feature domain or consider some specific statistical feature characteristics. • Methods employing the feature domain can be further divided into three subcategories: feature compensation, feature transformation, and feature reconstruction. • Another school of thought attempts to seek remedies based on some of the noise-resistant statistical characteristics of speech features rather than the feature values themselves.

Introduction(2/3) • Histogram equalization (HEQ) methods attempt not only to match the means and variances of speech features but also to completely match the distributions of speech features between training and test speech data. • Noises will not only modify the distributions of the speech features but also inject uncertainties into the speech features due to the random behavior of noise. However, most of the HEQ approaches can only deal with the mismatch between the training and test conditions, while few can deal with such uncertainties . • Therefore, we expect that researches conducted along the aforementioned two directions could complement each other, and it might be possible to inherit their individual merits to overcome their inherent limitations.

Introduction(3/3) • In this paper, we propose a cluster-based polynomial-fit histogram equalization (CPHEQ) approach , which makes use of both the speech features and their corresponding distribution characteristics for speech feature compensation. • CPHEQ inherits the merits of the above two orientations and uses the data fitting technique in a purely data-driven manner to approximate the actual distributions without the need for unrealistic assumptions about the speech feature distributions.

Cluster-based polynomial-fit histogram equalization(CPHEQ)(1/5) • The basic idea behind CPHEQ stems from two diverse approaches. The first is stereo-based piecewise linear compensation for environments (SPLICE), which attempts to use a Gaussian mixture model (GMM) to characterize the noisy feature space. • SPLICE might sometimes fail to handle the nonlinear relationship between the clean and noisy speech when the mixture number of the GMM model is insufficient to characterize the noisy feature space. • In order to avoid this shortcoming, we add the idea of HEQ: HEQ uses nonlinear transformation functions to compensate for nonlinear distortions by utilizing the relationship between the cumulative distribution function (CDF) of the test speech and those of the corresponding training (or reference) one.

Cluster-based polynomial-fit histogram equalization(CPHEQ)(2/5) • For CPHEQ, we first use the noisy speech data to train a GMM model whose parameters are estimated by the k-means algorithm followed by the expectation maximization (EM) algorithm. The GMM is expressed as follows: • Furthermore, we assume that the compensated feature vector can be derived by where the posterior probability given by

Cluster-based polynomial-fit histogram equalization(CPHEQ)(3/5) • The restored value of given the k-thmixture is defined as follows: • Unlike SPLICE, which uses an additive bias to approximate the conditional expectation for the k-thmixture, we introduce the idea originating from HEQ to approximate the conditional expectation. Therefore, the restored value of for the k-thmixture is calculated as where is the inverse (or transformation) function, which maps each CDF value onto its corresponding predefined feature value for the k-thmixture.

Cluster-based polynomial-fit histogram equalization(CPHEQ)(4/5) • For the feature vector component sequenceof a specific dimension of a speech utterance,the corresponding CDF value of each feature component can be computed approximately through the following two steps Step 1: The sequence is first sorted in ascending order according to the values of the feature vector components. Step2: The order-statistics-based approximation of the CDF value of a feature vector component is then given as • In the training phase, the coefficients of the polynomial function for the k-th mixture can be estimated with a set of stereo data by minimizing the squared error defined by

Cluster-based polynomial-fit histogram equalization(CPHEQ)(5/5) • In the test phase, each feature vector component of the test speech is first used to estimate its corresponding CDF value, and then the restored value of can be obtained by • In order to reduce the computation time, we use the maximum a posteriori probability (MAP) criterion and redefine Eqs. (1) and (2), respectively, as follows:

Polynomial-fit histogram equalization(PHEQ)(1/2) • In this paper, we present a variant of CPHEQ, named polynomial histogram equalization (PHEQ). • In the implementation of PHEQ, only a single global transformation function is utilized to obtain the restored value of the noisy feature vector component , and therefore Eq. (2) can be rewritten as where the coefficients are estimated by merely using the clean training speech feature vector components and by minimizing the squared error expressed in the following equation:

Polynomial-fit histogram equalization(PHEQ)(2/2) • A summary of the storage requirements and computational complexities of these three approaches is presented in Table I. In brief, PHEQ is advantageous in terms of storage and computational requirements as compared with the other two conventional HEQ approaches.

Experiments on CPHEQ(1/3) • It can be found that CPHEQ provides significant performance boosts over the MFCC-based baseline system, especially when the number of mixtures is large (e.g., 512 or 1024). • However, there is no significant difference between the soft-decision approach and the hard-decision approach. • Accordingly, this may suggest that using Eq. (3) to derive the polynomial functions for CPHEQ is sufficient and can simplify the computation of CPHEQ either in the training or recognition phases.

Experiments on CPHEQ(2/3) • In the next set of experiments, we assess the performance of CPHEQ with respect to different numbers of mixtures of the GMM model and different orders of the polynomial function.

Experiments on CPHEQ(3/3) • In the third set of experiments, we attempt to combine CPHEQ with two other kinds of feature representations to further verify the effectiveness of CPHEQ. • They are linear discriminant analysis (LDA) and heteroscedastic linear discriminant analysis (HLDA) , both of which are derived directly from the outputs of the Mel-scaled log filter banks and postprocessed by the maximum likelihood linear transform (MLLT) for feature decorrelation. • The feature vectors from every nine successive frames are spliced together to form the supervectors for the construction of the transformation matrix. The dimension of the resultant vectors is set to 39.

Experiments on PHEQ(1/2) • Next we evaluate the performance of PHEQ with respect the polynomial order; the results are presented in Table III. • To go a step further, we integrate PHEQ with the two discriminative feature as described in the preceding section. The corresponding average WER results are shown in Table IV.

Experiments on PHEQ(2/2) • However, as mentioned in the above experiments, a smaller mixture number may be insufficient to delineate the noise characteristics. Hence, we try to combine CPHEQ with PHEQ through a simple linear interpolation of the restored values derived from each of these two methods, to overcome this shortcoming. • The results reveal that CPHEQ and PHEQ can, to some extent, complement each other well.

Comparison with Various Feature NormalizationMethods(1/2) • Here we compare our two proposed feature normalization methods (i.e., CPHEQ and PHEQ) with several typical feature normalization methods under the clean-condition training scenario. • CPHEQ does not outperform the other approaches significantly on Test Set C. This is mainly because Test Set C additionally includes convolutional distortions, which might lead to a substantial discrepancy in calculating the posterior probability for the test speech. To avoid such a discrepancy, a straightforward remedy is to use CMS to remove the channel distortions.

Comparison with Various Feature NormalizationMethods(2/2) • In order to confirm that feature normalization based on both the speech features and the corresponding distribution characteristics is superior to that based on the speech features alone, we also investigated a cluster-based polynomial feature compensation (CPFC) approach that restored the speech features directly on the basis of their value domain rather than their distribution characteristics (i.e., the CDF values).

Further Comparison with Three SophisticatedRobustness Methods(1/2) • Finally, we further compare CPHEQ with three more sophisticated and effective robustness methods, namely, ETSI advanced frontend (denoted as AFE) , Mel-LPC-based Mel-Wiener filter (denoted as MLMWF) , and feature-based vector Taylor-series speech enhancement (denoted as F-VTS) . • Given a test utterance, CPHEQ operates on the MFCC features directly without explicitly using any online noise estimation or reduction process; in other words, the corresponding noise characteristic of the test utterance is simply determined by the pretrained noisy GMM model. Such a deficiency will no doubt limit the performance of CPHEQ.

Further Comparison with Three SophisticatedRobustness Methods(2/2) • Since CPHEQ uses both clean speech and its noisy counterpart to estimate the polynomial functions, we also compare its performance with AFE and MLMWF under the multi-condition training scenario.

Summary • The results shown in Fig. 1 reveal that there exists a strong correlation between the order of the polynomial function and the mixture number of the GMM model. • Even though CPHEQ performs worse than the sophisticated robustness methods described in Section IV-F, due to the nature of simplicity, CPHEQ still lends itself to dealing with noise distortions, alone or combined with the other more complicated robustness methods. Each of these methods has its own merits and defects. • The need of stereo data sometimes limits the applicability of CPHEQ, since stereo data are not always easy to collect. One possible solution to this difficulty is to borrow the idea of VTS enhancement.

Conclusions • Since it is sometimes difficult to collect stereo data, one future research direction would be the use of mono data (either clean or noisy speech data) to estimate the parameters of the transformation functions. • The data-fitting technique is prone to be affected by abnormal values; therefore, another future research direction would be outlier detection/elimination, or the so-called robust regression. • Speech signals are slowly time-varying, so the contextual information between consecutive speech feature vectors might be an important clue that can be employed by CPHEQ and PHEQ.

Wen-Yi Chu Department of Computer Science & Information Engineering