1 / 17

Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR

Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR. Frédéric Berthommier and Angélique Grosgeorges ICP 46 av. Félix viallet, Grenoble, France email: (bertho,ggeorges)@icp.inpg.fr.

dewitt
Télécharger la présentation

Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Temporal masking of spectrally reduced speech:psychoacoustical experiments and links with ASR Frédéric Berthommier and Angélique Grosgeorges ICP 46 av. Félix viallet, Grenoble, France email: (bertho,ggeorges)@icp.inpg.fr

  2. We used the experimental paradigm proposed by [Shannon et al., 95], from which we developed a series of experiments. As proposed by (Horii et al., 1971) they varied the spectro-temporal resolution of speech utterances. The stimuli were composed of white noise modulated by the filtered envelopes extracted in 4 subbands. The task was consonant identification for VCVCV within 16 French consonants. Then, we evaluated the transmission of their phonetic features: voicing, mode and place of articulation. We extent this paradigm by masking this residual signal with stationary [Lorenzi et al., 99], or non stationary noises [Grosgeorges et al., 00]. In this framework, we substitute to the couple (local SNR/acoustic representation) and to the analysis in terms of identification rate another couple (global SNR/phonetic representation) with an analysis in terms of feature transmission. Then, we focus on the problem of acoustic phonetic decoding in noise, and on the impact of the noise on the features grounding the classification process. In other words, we postulate the existence of an intermediate level preceding the phonetic categorisation, and we study its properties. Introduction and motivations

  3. So, we expect a set of complementary results from this approach, at the same time informative about the study of the link between auditory and speech processes, useful for CASA, and informative for developing ASR for noisy and distorted speech. For RESPITE, the goal of this project is to set-up a plausible multi-stream model in which the phonetic identification of consonants is grounded by the extraction of these three phonetic characteristics, voicing, place and mode, this in specialised modules having different spectro-temporal resolution. A pre-classification according this appropriate phonetic representation could be more robust than the direct classification, the streams easier to weight according their information content, and the fusion process easier to control. Remark: vowel identification is considered as well modelled in current implementations. Moreover, the visual modality can be integrated in this model easily for the same reason: the audio-visual complementarity is optimally represented. Introduction and motivations (2)

  4. Spectral degradation: signal was divided into one, two, three or four frequency bands. Temporal degradation: the amplitude envelope extracted from each band was low-pass filtered with cutoff frequencies Fc:16, 50, 160 or 500Hz. The identification of 3 features (voicing, manner and place) for 16 French consonants « a/C/a » was evaluated by the classical information transmission analysis (Miller and Nicely, 1955). The Shannon et al. ’ experiment The main conclusion is: despite the great spectro-temporal reduction, voicing and manner are remarkably well transmited by the residual envelope, i.e. by the temporal components of the speech. Some questions arise: how this residue is processed ? how to use it for increasing robustness ? …. one way is to mask it and to analyse what occurs.

  5. Factor n°1: The spectral resolution was constant at 4 frequency bands, and the envelope was filtered with cutoff frequency Fc at 10 or 500Hz. Factor n°2: We added different temporal maskers in order to selectively degrade the different components of the residual signal: (1) in order to mask the coarse component of temporal information, we used a low frequency AM (amplitude modulation < 8Hz) white noise applied in each subband, for all maskers. (2) to degrade the residual spectral information, we decorrelated the low frequency AM across the 4 frequency bands. (3) to mask the fine temporal information, we re-modulated the low frequency AM of the masker at 100Hz. Factorial design of the masking experiment

  6. Factorial design (2) Task: Consonant identification task in a quiet room, with forced choice and no feedback Subjects: 6 normal hearing listeners not trained. However all subjects had experience in psychoacoustical experiments Stimuli: 384 stimuli composed of 6 different conditions were presented in random order

  7. FFT FS = 11025 Hz and Frame analysis92.8ms Bandpass filtering Low-pass filtering at 500 Hz or 10 Hz Signal rectification iFFT Signal reconstruction White noise (1) White noise (1) + (2) (1) + (2) + (3) 16 utterances aCaCa : - with C = {b, d, g,v, Z, z, m, n, r, l,p,t,k,f,s,S} consonant features: voicing: voiced={b,d,g,v,Z,z,m,n,r,l} / voiceless={p,t,k,f,s,S} manner: fricative + liquid ={f,s,S,v,Z,z,r,l} / occlusive + nasal={p,t,k,b,d,g,m,n} place: dental={p,b,f,v,m} / labial={t,d,s,z,n,l} / palatal={k,g,Z,S,r} Speech and signal processing Nonsense Speech: 1 2 4 spectral bands decomposition 3 4 + Temporal masker SNR=+6dB Stimulus

  8. Exemple of stimulus

  9. For all conditions, chance was set at 6.25% (1/16) for consonant recognition. Overall mean correct identification for the 6 subjects was 28%. A confusion matrix was generated for each listener and summed across listeners. Then, the mean transmission information (Miller and Nicely, J. Acoust. Soc. Am., 1955) for voicing, manner and place of articulation was evaluated. The average information received for each consonant feature is plotted as a function of the level number, as compared with the average information received when there was no temporal masker (dashed lines). Results of the experiment

  10. Voicing is not transmitted by the fine temporal modulation (as in Shannon et al.) and it decreases slightly with the degradation of residual spectral information allowed by decorrelation. So we conclude that voicing features are acoustically “distributed”, and then, the degradation according the different maskers’ characteristics (low frequency AM, decorrelation and 100Hz re-modulation) is cumulative. Results : transmission of voicing

  11. Manner of consonant articulation is completely suppressed for all temporal maskers, having in common a low AM characteristic. There is no significant difference with 0% information received. Results : transmission of the manner When spectral information is reduced, manner is conveyed by the coarse envelope component, and it strongly interferes with a low AM masker: the differentiation between fricatives and occlusives is encoded temporally and it is well masked by noise having close temporal characteristics.

  12. Nullification of manner transmission

  13. Place of articulation is significantly less transmitted (P<0.05; t-test) for Level 2 and 3 comparatively to Level 1, for Fc=10 Hz (*). Decorrelation degrades the residual spectral information (for Fc at 10Hz). Results : place transmission

  14. We retrieve the main Shannon et al.’s results. Our experiment suggests that: -voicing is a redundant consonant feature which depends on both categories of information: coarse temporal envelope and spectral information, -but manner is mainly carried by the coarse temporal envelope. This experiment supports the hypothesis that consonant identification is a complex process which can compensate for the reduction or the masking of both temporal or spectral information by the use of residual information for voicing and place, but not for the manner. Conclusion of the masking experiment

  15. 10 Hz 500 Hz Perspective (1) : variation of the spectro-temporal resolution Clean signal The intelligibility is weak for 1 and 2 subbands, with a poor transmission of the place of articulation. The difference between Fc at 10 and 500 Hz is weak.

  16. 100 Voicing Place of articulation 80 60 Information received (%) 40 20 Manner of articulation 0 4sb SNR=+6dB 4sb clean 16sb SNR=+6dB 16sb clean Perspective (2): interaction between spectral reductionand masking This preliminary experiment (Fc=10Hz) shows that for the mode, there is a rather independent effect of spectral reduction and of temporal masking, the later having the stronger impact. This confirms that the mode is mainly encoded temporally. So one proposal for multistream ASR is to decode this feature temporally in a separate 4 subbands stream.

  17. Perspective (3): audio-visual complementarity As shown by Erber (1972), intelligibility is high even for 1 and 2 subbands: the place of articulation is the best transmitted by the visual modality, whereas this is the worse transmitted for the audio reduced speech, so the global intelligibility is restored thanks to the direct complementarity of transmission in the two modalities.

More Related