390 likes | 413 Vues
This publication delves into the auditory scene analysis theory, focusing on how the auditory system can accurately segregate complex sounds such as noise and speech. It covers the mechanisms of auditory perceptual organization, including primitive cues, Gestalt principles, and schema-based processes. The experiments explore how noise impacts speech perception, especially in synthetic syllables, investigating factors like frequency modulation and formant transitions. The study concludes that short chirps integrate into speech signals, affecting perception, while longer chirp durations show directional cues influencing speech categorization.
 
                
                E N D
Mr Background Noise and Miss Speech Perception in:by Elvira Perez and Georg Meyer The Mechanisms of Segregation
Structure • Introduction • Background • Experiments • Conclusions • Future research
1. Introduction: Ears received mixtures of sounds. We can tolerate surprisingly high levels of noise and still orientate our attention to whatever we want to attend. But... how the auditory system can do this so accurately?
1. Introduction: • Auditory scene analysis (Bregman, 1990) is a theoretical framework that aims to explain auditory perceptual organisation. • Basics: • Environment contains multiples objects • Decomposition into its constituent elements. • Grouping. • It proposes two grouping mechanisms: • 1. ‘Bottom-up’: Primitive cues (F0, intensity, location) Grouping mechanism based on Gestalt principles. • 2. ‘Top-down’: Schema-based (speech pattern matching)
1. Introduction: • Primitive process (Gelstalt. Koffka, 1935): • Similarity: Sounds will be grouped into a single perceptual stream if they are similar in pitch, timbre, loudness or location. • Good continuation: Smooth changes in frequency, intensity, location or spectrum will be perceived as changes in a single source, whereas abrupt changes indicate change in source. • Common fate: Two components in a sound sharing the same kinds of changes at the same time (e.g.: onset-offset) will perceived and grouped as part of the same single source. • Disjoint locations: A given element in a sound can only form part of one stream at a time ( duplex perception). • Closure: When parts of a sound are masked or occluded, that sound will be perceived as continuous, if there is no direct sensory evidence to indicate that it has been interrupted
1. Introduction • Criticisms: Too simplistic. Whatever cannot be explained through the primitive processes, it is explained by the schema-based processes. • Primitive processes only work in the lab. • Sine-wave replicas of utterances (Remez et al., 1992) • Phonetic principles of organization find a single speech stream, whereas auditory principles find several simultaneous whistles. • Grouping by phonetic rather than by simple auditory coherence.
2. Background: • Previous studies (Meyer and Barry, 1999: Harding and Meyer, 2003) have shown that perceptual streams may be formed also from complex features such as formants in speech, and not just from low-level sounds like tones.
vowel m vowel m Heard as /n/ (Grouped F2 matches that of an /n/) Perception of synthetic /m/ changes to /n/ when preceding by a vowel with a high second formant and no transition between the vowel and consonant. Hear as /m/ (Grouped F2 matches that of an /m/)
3. Experiments (baseline): • The purpose of these studies is to explore how noise (a chirp) affects speech perception. • The stimulus used is a vowel-nasal syllable which is perceived as /en/ if presented in isolation but as /em/ if it is presented with a frequency modulated sine wave in the position where the second formant transition would be expected. • In the three experiments participants categorised the synthetic syllable heard as /em/ or /en/. • Direction, duration, and position of the chirp were the values manipulated.
vowel nasal 2700 2000 Formant frequ. (Hz) 800 375 100200ms 3. Experiments The perception of a nasal /n/ change to /m/ when adding a chirp between the vowel and nasal F2
3. Experiments • Chirp down means that the frequency of its sine wave changes from 2kHz to 1kHz. • Chirp up means that its frequency changes from 1kHz to 2kHz
3. Experiments • Subjects: 7 male and 6 female. • Their task was always the same: After each signal presentation subjects were asked to judge whether the syllable sound more like and /em/ or like and /en/ by pressing the appropriate button on a computer graphics tablet. • There was no transition between the vowel and the nasal. A chirp between the vowel and the nasal F2 was added to the signal.
In 80% of the trials the participants heard the difference between up and down chirp. vowel nasal 100 200ms chirp down chirp up Experiment 1 Baseline/Direction
vowel nasal 100 200ms Experiment 2 Duration
vowel nasal Experiment 3 Position 100 200ms
5. Conclusions: Chirps of 4 ms and 10 ms duration, independent of their direction are apparently integrated into the speech signal and change the percept from /en/ to /em/. For longer duration chirps the direction of it matters but for chirp durations up to 30 ms the majority of stimuli are still heard as /em/. Subjects very clearly hear two objects, so that some scene analysis is taking place since the chirp is not integrated completely into the speech.
Duplex perception with one ear. It seems that listeners can also discriminate the direction motion of the chirp when they focus their attention in the chirp and a more high level of auditory processing takes places (80% accuracy).
The data suggests that a low spectro-temporal analysis of the transitions is carried out because: (1) the spectral structure of the chirp sound is very different from the harmonically structured speech sound, and (2) the direction of the chirp motion has little perceptual significance in the percept. These results complement the auditory scene analysis theoretical framework.
Mr. Background Noise • Do human listeners actively generate representation of background noise to improve speech recognition? • Hypothesis: Recognition performance should be highest if the spectral and temporal structure of interfering noise is regular so that a good noise model can be generated  random noise • Noise prediction.
Experiment 4 & 5 • Same stimuli than before (chirp down) • The amplitude of the chirp vary (5 conditions) • Background noise (down chirps): • Quantity: Lots vs Few • Quality: Random vs Regular • Categorization task 2FC. • Threshold shifts.
Regular condition en en en en en en Irregular condition en en en en en en
Each point in the scatter is the mean threshold over all subjects for a give session. The solid lines show the Boltzmann fit (Eq.(1) for each individual subject in the fifth different conditions. All the fits have the same upper and lower asymptotes.
Exp. 4 rand/reg lots vs. few (t = -3.34, df = 38, p = 0.001). control vs. lots (t = -3.34, df = 38, p = 0.001). No effect between random and regular.
Two aspects change from exp. 4 to 5: • Amplitude scale of the chirps. • The conditions lots now includes 100/20’’ and before 170/20’’.
Exp.5 rand/reg lots vs few (t = 2.27, df = 38, p = 0.05). control vs. lots (t = 3.12, df = 38, p < 0.05). No effect between random and regular.
Conclusions Only the amount of background noise seems to affect the performance of the recognition task. The sequential grouping of the background noise seems an irrelevant cue to improve auditory stream segregation and therefore, speech perception. Counterintuitive phenomenon. Attention must be focused on an object (background noise) for a change in that object to detected (Rensink, et al, 1997)
Lexical (meaning) SPEECH Inlexical (age, gender, emotions of the talker) Acoustical (loudness, pitch, timbre) Change deafness: May occur as a function of (not) attending to the relevant stimulus dimension (Vitevitch, 2003).
Irrelevant sound effect (ISE) (Colle & Welsh, 1976) disrupts in serial recall. • The level of meaning (reverse vs forward speech), predictability of the sequence (random vs regular), and similarity (semantic or phisical) of the IS to the target material, seems to have little impact in the focal task. (Jones et al., 1990). • Changing state: The degree of variability orphisical change within an auditory stream is the primary determinant of the degree of distrupion in the focal task.
One stream Two streams Frequency Frequency Time Time Changing single stream Two unchanging streams
Experiment 6 • Shannon et al. (1999) database of CVC syllables. • Constant white noise + high frequency burst + low frequency burst. • 3 Conditions: Control, background noise 1 stream, background noise 2 streams. • Does auditory stream segregation require attention?
Results More quantity of noise Is there a confound effect? Tukey test p < 0.001
Procedure White noise: 68 dB Burst high 71 dB Burst low 67 dB Speech signal 59-64 dB : 61.5dB
Conclusions • Changing-state stream is more disruptive than two steady-state streams. • Working memory or STM and the phonological loop (Badeley, 1999) • Auditory streaming can occur in the absence of focal attention.
6. Future research: • Streaming and focal attention • Streaming and location (diotic vs dichotic presentation)
References: Bregman, A.S. 1990. Auditory Scene Analysis: The perceptual organisation of sound, MIT Press, Cambridge MA. Harding and Meyer, G. (2003) Changes in the perception of synthetic nasal consonants as a result of vowel formant manipulations. Speech Communication 39, 173-189. Meyer, G. and Barry, W. (1999) Continuity based grouping affects the perception of vowel-nasal syllables. In: Proc. 14thInternational Congress of Phonetic Science. University of California.
F1 B A F2 B A F3 B A Vowel 375 75 60 2000 75 60 2700 75 50 /n/ 250 60 60 2000 60 60 2700 300 60 /m/ 250 60 60 1000 60 60 2700 300 60 Table 1: Klatt synthesizer parameters: Fx: formant frequency (Hz), B: bandwidth (Hz), A: Amplitude (dB). The nasals have nasal formants at 250Hz and nasal zeros at 750 Hz(/m/) and 1000 (/n/).
Spectrogram of the stimulus used in the three experiments, with vowel formant at 375, 2000, and 2700 Hz corresponding to an /e/, and with nasal prototype formants at 250, 2000 and 2700 Hz corresponding to an /n/. There was no transitions between the vowel and the nasal