360 likes | 510 Vues
Speech enhancement in nonstationary noise environments using noise properties. Kotta Manohar, Preeti Rao Department of Electrical Engineering, Indian Institute of Technology, Powai, Bombay 400 076, India Presenter: Shih-Hsiang( 士翔 ). SPEECH COMMUNICATION 48 (2006). Reference.
E N D
Speech enhancement in nonstationary noise environments using noise properties Kotta Manohar, Preeti Rao Department of Electrical Engineering, Indian Institute of Technology, Powai, Bombay 400 076, India Presenter: Shih-Hsiang(士翔) SPEECH COMMUNICATION 48 (2006)
Reference • K. Manohar and P. Rao, "Speech enhancement in nonsataionary noise environments using noise properties", Speech Communication,48 ,(2006) • V. Stahl, A. Fischer, and R. Bippus, "Quantile Based Noise Estimation for Spectral Subtraction and Wiener Filtering," in Proc. ICASSP, 2000, vol. 3, pp. 1875—1878 • M. Berouti, R. Schwartz, J. Makhoul, "Enhancement of speech corrupted by acoustic noise." in Proc. ICASSP, 1980, pp.208–211
Introduction • Signal-channel speech enhancement algorithms are generally base on short-time spectral attenuation (SATA) • Applying a spectral gain to each frequency bin in a short-time frame of the noisy speech signal, then the gain is adjusted individually as a function of the relative local SNR at each frequency • Spectral Subtraction (SS), MMSE short-time spectral amplitude estimator • With low SNR regions attenuated relative to high SNR regions • A good estimate of the instantaneous noise spectrum is crucial in the estimation of the local SNR • A common method of noise estimation involves the use of a voice activity detector (VAD) to detect the pauses in speech • The noise estimate is then obtained by a recursively smoothened adaptation of noise during the detected pause
Introduction (cont.) • In stationary background noise, such an estimator is generally reliable • However nonstationary noises cannot be tracked adequately by a recursive noise estimation method that adapts only during detected speech pauses • E.g. factory, battlefield noise • Even the VAD is reliable, changes in the noise spectrum occurring during active speech cannot influence the noise estimate in a timely manner • STAT-based algorithms are effective only in suppressing the stationary noise component generally leaving noise bursts unattenuated in the enhanced speech
Introduction (cont.) • In this paper, a method which exploits known differences in the spectro-temporal properties of noise and speech to selectively attenuate noisy time-frequency regions remaining in STSA-enhanced signals
Suppressing nonstationary noise • The proposed solutions generally fall into two categories • Improvements to the noise estimator • Modification of the suppression rule • A number of methods for noise spectrum estimation without explicit speech pause detection have been proposed • Based on tracking some statistic (e.g. minimum, median) of past power spectral values for each frequency bin over several frames (e.g. QBNE) • However the buffer length necessary to bridge peaks of speech activity makes it difficult to follow any rapid variations in noise spectrum
Suppressing nonstationary noise (cont.) • A brief introduction to QBNE (Quantile Based Noise spectrum Estimation) • In speech section of the input signal not all frequency bands are permanently occupied the energy in each frequency • The noise estimate N(ω) are taking the q-th quantile over time in every frequency band For every ωthe frames of the entire utterance X(ω,t),t=0,…,T are sorted such that X(ω,t0)≤X(ω,t1) ≤… ≤X(ω,tT). The q-quantile noise estimation is defined as
Suppressing nonstationary noise (cont.) QBNE method a buffer of 0.64s duration and quantile value 0.5 Factory noise is nonstationary in nature having stationary noise background with occasional random bursts to which the sudden peaks in the instantaneous noise power spectra VAD estimator tracks the noise burst level only when speech is absent The QBNE estimator responds to the noise burst only approximately and with a delay These direct estimation methods for noise fail in conditions such as factory noise
Suppressing nonstationary noise (cont.) • A different approach to carry out the adaptation of noise during both speech absence and presence is via a speech absence probability based on an estimate of SNR (Malah et al., 1999)(Cohen 2003) • Any sudden increase in the background noise level is not easily distinguished from speech and results in high estimated SNR making the method relatively less effective in highly nonstationary noise • No direct method methods can track highly nonstationary noises accurately even if the noise estimate is updated in every frame
Suppressing nonstationary noise (cont.) • Cooke et al. (2001) propose missing data methods for robust ASR • A two-stage approach is used • Spectral subtraction is employed to suppress the stationary noise component • The recognition processor is conditioned on the estimated reliability of spectro-temporal regions of the signal as determined by various speech spectrum cues • Difficulty of detecting unreliable regions when the nonstationary noise component is intermittent and impulsive • A similar concept applicable to speech enhancement is the use of statistical models of clean speech or trained codebook where a priori information in the form of spectral envelope shapes is stored for both speech and noise • A joint or iterative optimization over assumed speech and noise models is carried out for each frame of noisy speech to determine the noise estimate • The performance would be expected to depend critically on a good match between training and actual usage conditions
Suppressing nonstationary noise (cont.) • This paper is targeted towards a robust algorithm for suppression of random noise bursts with minimal speech distortion • Using available knowledge to distinguish between speech and noise in order to identify, and further attenuate, unreliable spectro-temporal regions in signals enhanced by traditional STSA • To achieve improved speech quality using this approach requires solutions to two problems • determining reliable cues for identifying noisy spectro-temporal regions • finding a suitable suppression rule applicable to the detected noisy regions so as to achieve significant reduction of noise with minimal speech distortion.
Proposed post-processing algorithm • The proposed post-processing algorithm involves identifying regions in the spectrogram of the STSA-enhanced speech that are dominated by the residual noise • These regions are selectively attenuated further with the goal to improve the overall quality of the enhanced speech • The post-processing scheme thus comprises the following steps: • Divide the spectrum of each frame of the STSA enhanced speech into several frequency bands, possibly overlapping, frequency band in view of the fact that the noise spectrum may be localized in frequency • Carry out speech/noise classification to detect frequency bands that are dominated by residual noise • Using a suitable suppression rule, attenuate the spectral values in the identified noisy bands
Proposed post-processing algorithm(cont.) • The suppression rule should ideally depend on the bin SNR in a manner as to apply more attenuation in low SNR regions • This would help to minimize speech distortion while achieving an overall improvement in the SNR • If the identification of noisy frequency bands in Step 2 is reasonably reliable, a local SNR increase in an identified nonspeech bin would signal the onset of a noise burst. An appropriate definition for the estimated SNR is given by the ‘‘average a priori SNR’’ computed as in where previous SNR current SNR The average noise power spectrum estimate as obtained from the noise estimator of the STSA
Proposed post-processing algorithm(cont.) • The attenuation factorλ(k) is varied linearly with the estimated a priori SNR ζ(k) in dB but restricted to the range of 0.05-0.9 f0is the value at 0 dB SNR, and s is the slope of the line 0.9 0.05 SNR(dB) SNR_low SNR_high
the spectral floor gain parameter Proposed post-processing algorithm(cont.) • The suppression rate can be controlled by varying the parameters ‘SNR_low’ and ‘SNR_high’ • After obtaining the attenuation factors, recalculate the speech estimate as follow of an i-th ‘noisy band’ limiting the value to a spectral floor
Spectral flatness based classifiers • Based on the assumption that the STSA enhanced speech contains primarily harmonic speech and frequency-localized noise bursts • Let X[k]denote the magnitude spectrum values computed via a DFT. The ith frequency band comprises L frequency bins with bin index k in the range [bi, ei] • For instance, with a 256-point DFT at sampling frequency of 8 kHz, the 0–1 kHz band will be bounded by the bin indices: bi = 0 and ei = 31 • The measures investigated are: • SFM (spectral flatness measure):It is defined as the ratio of the geometric mean to the arithmetic mean of the magnitude spectrum values taking low values for harmonic regions representing speech, and High values for noise-dominated regions which have a relatively flat spectrum
Spectral flatness based classifiers (cont.) • Energy-normalized variance: The harmonic structure or deviation from flatness of the spectrum in any chosen frequency band is reflected in the energy-normalized variance of the spectral values • Entropy: A related measure is ‘‘entropy’’ as used in the VAD of Renevey and Drygajlo (2001) on the assumption that the signal spectrum is more organized during speech segments than during noise segments high values for harmonic regions representing speech, and low values for noise-dominated regions, where H takes maximum value of ‘1’ when the signal is a white noise, and minimum value of ‘0’ when it is a pure tone (sinusoid). Hence, the entropy based method is well suited for speech detection in white or quasi-white noise
Experimental comparison of classifier • A comparative evaluation of the different classifiers can be achieved by experimental observations in a typical application situation • i.e. by comparing the receiver operating characteristics (ROC) or the hit rate versus false-alarm rate plots • A better classifier would be characterized by a lower false-alarm rate for a given hit rate • The steepness or slope of the ROC curves determines the suitability of the feature in terms of providing an adequate level of discrimination between speech and noise
Experimental comparison of classifier (cont.) ROC plots of the energy-normalized variance, SFM and entropy in the detection of noisy regions for factory noise-corrupted speech at 0 dB SNR
Experimental evaluation • The performance is evaluated for three real environmental noise viz. factor noise, machine gun noise, and train interior noise • All the three noises are highly fluctuating, characterized by random energetic bursts • Two standard STSA algorithms are chosen as the front-end STSA algorithms • Berouti spectral subtraction (BSS) • Multiplicatively modified log spectral amplitude estimator (MM-LSA) • In all experiments, a 32ms Hamming window with 50% overlap is applied to 8kHZ sampled speech. The spectrum is computed using a 256-point DFT
Experimental evaluation (cont.) • Noise properties and post processing parameter settings • Factory noise : contains randomly occurring events such as hammer blows embedded in a more homogenous background noise • Machine gun noise : a series of gunshots recorded in a quiet environment, in order to make it more realistic, a white background noise • Train noise : it is sound recorded in the interior of an Indian electric train with windows open (i.e. the noise arises from the moving mechanical parts of the train)
Experimental evaluation (cont.) Spectrograms of segments of (a) factory, (b) train and (c) machinegun noise
Experimental evaluation (cont.) • Noise properties and post processing parameter settings The frequency bandwidth for the variance-based noise detection is selected to provide a high-frequency resolution for noisy region detection The choice of decision threshold the detection of noise-dominated bands should be based on the desired hit rate or tolerable false-alarm rate. A low false-alarm rate helps to minimize speech distortion The parameters SNR_low and SNR_high determine the amount of attenuation as a function of the estimated a priori SNR
Experimental evaluation (cont.) • Measuring speech quality improvement • Naturalness and Intelligibility of speech output are important attributes of the performance of any speech enhancement system • Since achieving a high degree of noise suppression is often accompanied by speech signal distortion, it is important to evaluate both quality and intelligibility • Subjective listening tests are the best indicators of achieved overall quality • A–B comparison tests of sentences processed by competing processing methods can be used to obtain comparative quality rankings • The chief attributes tested here are the naturalness or overall quality of the processed speech • Speech intelligibility is tested by the SUS (semantically unpredictable sentences) test, originally proposed for evaluating synthetic speech (Benoit et al., 1996)
Semantically Unpredictable Sentences (SUS) • Comparative evaluation of sentence intelligibility, minimizing the effect of contextual cues. Short, semantically unpredictable sentences of five different, common syntactic structures with words randomly selected from lexicons with frequent "mini-syllabic" words (smallest words available in a given category): • Subject - Verb - Adverbial, e.g., The table walked through the blue truth • Subject - Verb - Direct object, e.g., The strong way drank the day • Adverbial - Transitive verb - Direct object (imperative), e.g., Never draw the house and the fact • Q-word - Transitive verb - Subject - Direct object, e.g., How does the day love the bright word? • Subject - Verb - Complex direct object, e.g., The place closed the fish that lived.
Experimental evaluation (cont.) • Overall quality ranking is A–B comparison involving four listeners and eight distinct sentences from the TIMIT database (Fisher et al., 1986) , each from a different speaker (four male and four female) • Each sentence pair presented for listening comparison comprises of the processed versions of a single sentence, before and after post-processing • To avoid bias, the order A and B are interchanged and randomized across sentences and listeners • Speech intelligibility is tested by the SUS • Thirty SU sentences, six of each of five syntax structures, were generated and played in random order to each of four listeners who were asked to write down the sentences they hear • To avoid listener familiarity with a specific noise sample, segments of the noise file to be added to the sentences were chosen randomly from a larger noise sample and digitally added to the clean speech
Experimental evaluation (cont.) • There are a large number of objective measures that quantify the degradation in quality of processed speech with respect to a reference speech sample • However, not all objective measures may be appropriate for specific kinds of distortion • Use PESQ and WSS in the experiments to measure quality gains, if any, achieved due to post-processing
Weighted Spectral Slope Measure • The weighted spectral slope (WSS) measure is based on an auditory model in which 36 overlapping filters of progressive larger bandwidth are used to estimate the smoothed short-time speech spectrum • The measure finds a weighted difference between the spectral slopes in each band • The magnitude of each weight reflects whether the band is near a spectral peak or valley, and weather the peak is the largest in the spectrum the difference between overall sound pressure level of the original and processed utterances Ks is a parameter which can be varied to increase the overall performance.
PESQ MOS • Mean Opinion Score (MOS) • 平均意見指標(mean opinion score;MOS)來衡量清晰度 • 平均意見指標是將收訊的語音樣本,由一群收訊者依收聽到的通話品質分成5個等級:1代表最差、5代表最佳,4則是一般公眾電話網路系統的通話品質。由於MOS很難建立一個客觀標準 • Perceptual Evaluation of Speech quality (PESQ) • 這項技術結合PSQM和PAMS兩種方法的優點—PSQM的聽覺模型(perceptual model)和PAMS的時間對位法(time-alignment routine),所以PESQ指標與MOS指標g之間的相關性將更高 • PSQM演算法是以0到6.5的數字來評量清晰度,數字越低代表通話品質越好 • PAMS會產生聽音品質指標(listening quality score)(Ylq)和聽音效應指標(listening effort )(Yle)兩種指標,它們都是由0~15編排,數字越高代表品質越好。和PSQM清晰度指標一樣,聽音品質指標主要是評量收訊者接收的語音訊號,與原本訊號之間的相似度。聽音效應指標主要是針對嚴重失真無法以聲音品質評估的訊號,因此聽音效應指標評估的是,收訊者必須花費多少心力才能聽懂嚴重失真的語音訊號所傳遞的訊息
清晰度評量的四個步驟 • 第一步都是將參考訊號或是原始訊號(reference or original)與接收訊號作時間對位(time-align) • 第二個步驟是參考訊號和接收訊號的增益調整(gain-scaling),使兩個訊號的功率相同。 • 第三個步驟將原本的時域訊號(time domain)轉換成頻域(frequency domain)訊號,並將所得到的訊號頻譜,依據人類聲音聽覺與頻率之間的非線性相關設定頻帶(bins)。依據Bark scale所設定的頻帶,會反映出人類聽覺對於低頻聲音較明顯的特性,因此低頻端的頻帶頻寬較窄,而高頻端的頻帶頻寬較寬 • 最後就是分析的重要工作。利用聽覺模型(perceptual model)來比對和處理頻帶中的內容,以決定對人類聽覺的重要性和差異性,處理的結果提供清晰度指標作為差異的比較。
Result and discussion there is a clear listener preference for the post-processed speech over that before post-processing The percentageword intelligibility scores averaged acrossthe listeners are 60.7, 51.7 and 50.6 at 3 dB SNRfor the three configurations of noisy, BSS andBSS + PP respectively
Result and discussion (cont.) Narrowband spectrograms of (a) clean, (b) noisy, (c) BSS-enhanced speech and (d) after post-processing, for a speech segment in factory noise
Result and discussion (cont.) The WSS distance indicates a consistent decrease (implying an improvement in quality) with post-processing from that obtained with STSA enhancement alone The PESQ MOS on the other hand is consistent with the subjectively perceived trend of an improvement in speech quality with STSA enhancement over that of noisy speech, Both the objective measures indicate that post-processing has a greater influence at the lower SNRs relative to that at higher SNRs.
Result and discussion (cont.) the performance gains due to post-processing do not change significantly with the change in the algorithm parameters
Conclusion • Traditional STSA speech enhancement algorithms perform inadequately in application to speech corrupted by highly nonstationary noise • With limited added complexity, the post-processing algorithm is effective in significantly reducing the perceived effects of the noise bursts at low SNRs without further speech distortion • While the onsets of noise bursts are greatly attenuated, bursts of long duration are not suppressed completely due to the difficulties in the reliable classification of bins as speech or noise dominated within an identified noise burst band