What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

What are the Essential Cues for Understanding Spoken Language? Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng steveng@icsi.berkeley.edu

No Scientist is an Island … IMPORTANT COLLEAGUES ACOUSTIC BASIS OF SPEECH INTELLIGILIBILTY Takayuki Arai, Joy Hollenback, Rosaria Silipo AUDITORY-VISUAL INTEGRATION FOR SPEECH PROCESSING Ken Grant AUTOMATIC SPEECH RECOGNITION AND FEATURE CLASSIFICATION Shawn Chang, Lokendra Shastri, Mirjam Wester STATISTICAL ANALYSIS OF PRONUNCIATION VARIATION Eric Fosler, Leah Hitchcock, Joy Hollenback

Germane Publications STATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND PRONUNCIATION MODELING Fosler-Lussier, E., Greenberg, S. and Morgan, N. (1999) Incorporating contextual phonetics into automatic speech recognition. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco. Greenberg, S. (1997) On the origins of speech intelligibility in the real world. Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 23-32. Greenberg, S. (1999) Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Speech Communication, 29, 159-176. Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany . Greenberg, S., Hollenback, J. and Ellis, D. (1996) Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus, in Proc. Intern. Conf. Spoken Lang. (ICSLP), Philadelphia, pp. S24-27. AUTOMATIC PHONETIC TRANSCRIPTION AND ACOUSTIC FEATURE CLASSIFICATION Chang, S. Greenberg, S. and Wester, M. (2001) An elitist approach to articulatory-acoustic feature classification. 7th European Conference on Speech Communication and Technology (Eurospeech-2001). Chang, S., Shastri, L. and Greenberg, S. (2000) Automatic phonetic transcription of spontaneous speech (American English), Proceedings of the International. Conference on. Spoken. Language. Processing, Beijing. Shastri, L., Chang, S. and Greenberg, S. (1999) Syllable segmentation using temporal flow model neural networks. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco. Wester, M. Greenberg, S. and Chang,, S. (2001) A Dutch treatment of an elitist approach to articulatory-acoustic feature classification. 7th European Conference on Speech Communication and Technology (Eurospeech-2001). http://www.icsi.berkeley.edu/~steveng

Germane Publications PERCEPTUAL BASES OF SPEECH INTELLIGIBILITY Arai, T. and Greenberg, S. (1998) Speech intelligibility in the presence of cross-channel spectral asynchrony, IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 933-936. Greenberg, S. and Arai, T. (1998) Speech intelligibility is highly tolerant of cross-channel spectral asynchrony. Proceedings of the Joint Meeting of the Acoustical Society of America and the International Congress on Acoustics, Seattle, pp. 2677-2678. Greenberg, S. and Arai, T. (2001) The relation between speech intelligibility and the complex modulation spectrum. Submitted to the 7th European Conference on Speech Communication and Technology (Eurospeech-2001). Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedingss of the International Conference on Spoken Language Processing, Sydney, pp. 74-77. Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal Constraints on Speech Intelligibility as Deduced from Exceedingly Sparse Spectral Representations, Proceedings of Eurospeech, Budapest. AUDITORY-VISUAL SPEECH PROCESSING Grant, K. and Greenberg, S. (2001) Speech intelligibility derived from processing of asynchronous processing of auditory-visual information. Submitted to the ISCA Workshop on Audio-Visual Speech Processing (AVSP-2001). PROSODIC STRESS ACCENT – AUTOMATIC CLASSIFICATION AND CHARACTERIZATION Hitchcock, L. and Greenberg, S. (2001) Vowel height is intimately associated with stress-accent in spontaneous American English discourse. Submitted to the 7th European Conference on Speech Communication and Technology (Eurospeech-2001). Silipo, R. and Greenberg, S. (1999) Automatic transcription of prosodic stress for spontaneous English discourse. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco. Silipo, R. and Greenberg, S. (2000) Prosodic stress revisited: Reassessing the role of fundamental frequency. Proceedings of the NIST Speech Transcription Workshop, College Park, MD. Silipo, R. and Greenberg, S. (2000) Automatic detection of prosodic stress in American English discourse. Technical Report 2000-1, International Computer Science Institute, Berkeley, CA. http://www.icsi.berkeley.edu/~steveng

PROLOGUE The Central Challenge for Models of Speech Recognition

Language - The Traditional Perspective The “classical” view of spoken language posits a quasi-arbitrary relation between the lower and higher tiers of linguistic organization

The Serial Frame Perspective on Speech • Traditional models of speech recognition assume that the identity of a phonetic segment depends on the detailed spectral profile of the acoustic signal for a given (usually 25-ms) frame of speech

Language - A Syllable-Centric Perspective A more empirical perspective of spoken language focuses on the syllable as the interface between “sound” and “meaning” Within this framework the relationship between the syllable and the higher and lower tiers is non-arbitrary and systematic statistically

Lines of Evidence

Segmentation is crucial for understanding spoken language At the level of the phrase the word the syllable the phonetic segment But …. this linguistic segmentation is inherently “fuzzy” As is the spectral information associated with each linguistic tier The low-frequency (3-25 Hz) modulation spectrum is a crucial acoustic (and possibly visual) parameter associated with intelligibility It provides segmentation information that unites the phonetic segment with the syllable (and possibly the word and beyond) Many properties of spontaneous spoken language differ from those of laboratory and citation speech There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization Take Home Messages

The Central Importance of the Modulation Spectrum and the Syllable for Understanding Spoken Language

Effects of Reverberation on the Speech Signal Reflections from walls and other surfaces routinely modify the spectro-temporal structure of the speech signal under everyday conditions

Effects of Reverberation on the Speech Signal Reflections from walls and other surfaces routinely modify the temporal and modulation spectral properties of the speech signal The modulation spectrum’s peak is attenuated and shifted down to ca. 2 Hz [based on an illustration by Hynek Hermansky]

Modulation Spectrum Computation

The Modulation Spectrum Reflects Syllables The peak in the distribution of syllable duration is close to the mean - 200 ms The syllable duration distribution is very close to that of the modulation spectrum - suggesting that the modulation spectrum reflects syllables

The Ability to Understand Speech Under Reverberant Conditions (Spectral Asynchrony)

Spectral Asynchrony - Method Output of quarter-octave frequency bands quasi- randomly time-shifted relative to common reference. Maximum shift interval ranged between 40 and 240 ms (in 20-ms steps). Mean shift interval is half of the maximum interval. Adjacent channels separated by a minimum of one-quarter of the maximum shift range. Stimuli – 40 TIMIT Sentences “She washed his dark suit in greasy dish water all year”

Spectral Asynchrony - Paradigm The magnitude of energy in the 3-6 Hz region of the modulation spectrum is computed for each (4 or 7 channel sub-band) as a function of spectral asynchrony The modulation spectrum magnitude is relatively unaffected by asynchronies of 80 ms or less (open symbols), but is appreciably diminished for asynchronies of 160 ms or more Is intelligibility correlated with the reduction in the 3-6 Hz modulation spectrum?

Intelligibility and Spectral Asynchrony Speech intelligibility does appear to be roughly correlated with the energy in the modulation spectrum between 3 and 6 Hz The correlation varies depending on the sub-band and the degree of spectral asynchrony

Speech is capable of withstanding a high degree of temporal asynchrony across frequency channels This form of cross-spectral asynchrony is similar to the effects of many common forms of acoustic reverberation Speech intelligibility remains high (>75%) until this asynchrony (maximum) exceeds 140 ms The magnitude of the low-frequency (3-6 Hz) modulation spectrum is highly correlated with speech intelligibility Spectral Asynchrony - Summary

Understanding Spoken Language Under Very Sparse Spectral Conditions

A Flaw in the Spectral Asynchrony Study Of the 448 possible combinations of four slits across the spectrum (where one slit is present in each of the 4 sub-bands) ca. 10% (i.e. 45) exhibit a coefficient of variation less than 10% - thus, the seeming temporal tolerance of the auditory system may be illusory (if listeners can decode the speech signal using information from only a small number of channels distributed across the spectrum) Intelligibility of spectrally desynchronized speech Distribution of channel asynchrony

Spectral Slit Paradigm Can listeners decode spoken sentences using just four narrow (1/3 octave) channels (“slits”) distributed across the spectrum? The edge of each slit was separated from its nearest neighbor by an octave The modulation pattern for each slit differs from that of the others The four-slit compound waveform looks very similar to the full-band signal + +

Word Intelligibility - Single Slits The intelligibility associated with any single slit is only 2 to 9% The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits

Word Intelligibility - Roap Map 1. Intelligibility as a function of the number of slits (from one to four)

Word Intelligibility - 1 Slit

Word Intelligibility - 2 Slits

Word Intelligibility - Roap Map 2. Intelligibility for different combinations of two-slit compounds The two center slits yield the highest intelligibility

Intelligibility - 2 Slits

Word Intelligibility - Roap Map 3. Intelligibility for different combinations of three-slit compounds Combinations with one or two center slits yield the highest intelligibility

Word Intelligibility - Roap Map 4. Four slits yield nearly (but not quite) perfect intelligibility of ca. 90% This maximum level of intelligibility makes it possible to deduce the specific contribution of each slit by itself and in combination with others

A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility Spectral Slits - Summary

Modulation Spectrum Across Frequency The modulation spectrum varies in magnitude across frequency The shape of the modulation spectrum is similar for the three lowest slits, but the highest frequency slit differs from the rest in exhibiting a far greater amount of energy in the mid modulation frequencies

Word Intelligibility - Single Slits The intelligibility associated with any single slit ranges between 2 and 9%, suggesting that the shape and magnitude of the modulation spectrum per se is NOT the controlling variable for intelligibility

A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility The magnitude component of the modulation spectrum does not appear to be the controlling variable for intelligibility Spectral Slits - Summary

The Effect of Desynchronizing Sparse Spectral Information on Speech Intelligibility

Modulation Spectrum Across Frequency Desynchronizing the slits by more than 25 ms results in a significant decline in intelligibility

Even small amounts of asynchrony (>25 ms) imposed on spectral slits can result in significant degradation of intelligibility Asynchrony greater than 50 ms has a profound impact of intelligibility Spectral Slits - Summary

What are the Essential Cues for Understanding Spoken Language? Steven Greenberg