Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland

A brain’s-eye-view of speech perception David Poeppel Cognitive Neuroscience of Language Lab Department of Linguistics and Department of Biology Neuroscience and Cognitive Science Program University of Maryland College Park Colleagues: • Allen Braun, NIH • Greg Hickok, UC Irvine • Jonathan Simon, Univ. Maryland Students: • Anthony Boemio • Maria Chait • Huan Luo • Virginie van Wassenhove

“chair” “uncomfortable” “lunch” “soon” encoding ? Is this a hard problem? Yes! If it could be solved straightforwardly (e.g. by machine), Mark Liberman would be in Tahiti having cold beers. representation ?

Outline (1) Fractionating the problem in space: Towards a functional anatomy of speech perception Fractionating the problem in time: Towards a functional physiology of speech perception - A hypothesis about the quantization of time - Psychophysical evidence for temporal integration - Imaging evidence

interface with lexical items, word recognition

interface with lexical items, word recognition hypothesis about storage: distinctive features [-voice] [+voice] [+voice] [+labial] [+high] [+labial] [-round] [+round] [-round] [….] [….] [….]

production, articulation of speech interface with lexical items, word recognition hypothesis about storage: distinctive features [-voice] [+voice] [+voice] [+labial] [+high] [+labial] [-round] [+round] [-round] [….] [….] [….]

production, articulation of speech hypothesis about production: distinctive features [-voice] [+voice] [+labial] [+high] [….] [….] interface with lexical items, word recognition hypothesis about storage: distinctive features [-voice] [+voice] [+voice] [+labial] [+high] [+labial] [-round] [+round] [-round] [….] [….] [….]

production, articulation of speech FEATURES analysis of auditory signal  spectro-temporal rep.  FEATURES interface with lexical items, wordrecognition FEATURES

Unifying concept: distinctive feature auditory-motor interface coordinate transform from acoustic to articulatory space production, articulation of speech analysis of auditory signal  spectro-temporal rep.  FEATURES auditory-lexical interface interface with lexical items, word recognition

coordinate transform from acoustic to articulatory space production, articulation of speech analysis of auditory signal  spectro-temporal rep.  FEATURES interface with lexical items, word recognition

pIFG/dPM (left) articulatory-based speech codes Area Spt (left) auditory-motor interface STG (bilateral) acoustic-phonetic speech codes pMTG (left) sound-meaning interface Hickok & Poeppel (2000), Trends in Cognitive Sciences Hickok & Poeppel (in press), Cognition

Indefrey & Levelt, in press, Cognition Meta-analysis of neuroimaging data, perception/production overlap Shared neural correlates of word production and perception processes Bilat mid/post STG L anterior STG L mid/post MTG L post IFG • MTG and IFG overlap when controlling for the overt/covert • distinction across tasks • Hypothesized functions: • lexical selection (MTG) • lexical phon. code retr. (MTG) • post-lexical syllabification (IFG)

Scott & Johnsrude 2003

Possible Subregions of Inferior Frontal GyrusBurton (2001) Auditory Studies Burton et al. (2000), Demonet et al. (1992, 1994), Fiez et al, (1995), Zatorre et al., (1992, 1996) Visual Studies Sergent et al. (1992, 1993), Poldrack et al., (1999), Paulesu et al. (1993, 1996), Sergent et al., 1993, Shaywitz et al. (1995)

Auditory lexical decision versus FM/sweeps (a), CP/syllables (b), and rest (c) (a) (b) (c) D. Poeppel et al. (in press) z=+6 z=+9 z=+12

fMRI (yellow blobs) and MEG (red dots) recordings of speech perception show pronounced bilateral activation of left and right temporal cortices T. Roberts & D. Poeppel (in preparation)

Binder et al. 2000

pIFG/dPM (left) articulatory-based speech codes Area Spt (left) auditory-motor interface STG (bilateral) acoustic-phonetic speech codes pMTG (left) sound-meaning interface Hickok & Poeppel (2000), Trends in Cognitive Sciences Hickok & Poeppel (in press), Cognition

The local/global distinction in vision is intuitively clear Chuck Close

What information does the brain extract from speech signals?

Phenomena at the scale of formant transitions, subsegmental cues “short stuff” -- order of magnitude 20-50ms Phenomena at the scale of syllables (tonality and prosody) “long stuff” -- order of magnitude 150-250ms Acoustic and articulatory phonetic phenomena occur on different time scales fine structure envelope

Does different granularity in time matter? Segmental and subsegmental information serial order in speech fool/flu carp/crap bat/tab Supra-segmental information prosody Sleep during lecture! Sleep during lecture?

The local/global distinction can be conceptualized as a multi-resolution analysis in time Further processing Binding process Supra-segmental information (time ~200ms) Segmental information (time ~20-50ms) syllabicity metrics tone features, segments

Temporal integration windows Psychophysical and electrophysiologic evidence suggests that perceptual information is integrated and analysed in temporal integration windows (v. Bekesy 1933; Stevens and Hall 1966; Näätänen 1992; Theunissen and Miller 1995; etc). The importance of the concept of a temporal integration window is that it suggests the discontinuous processing of information in the time domain. The CNS, on this view, treats time not as a continuous variable but as a series of temporal windows, and extracts data from a given window. arrow of time, physics arrow of time, Central Nervous System

25ms short temporal integration windows long temporal integration windows 200ms Asymmetric sampling/quantization of the speech waveform This p a p er i s h ar d tp u b l i sh

Two spectrograms of the same word illustrate how different analysis windows highlight different aspects of the sounds. (a) high time resolution - each glottal pulse visible as vertical striation (b) high frequency resolution - each harmonic visible as horizontal stripe (a) High time, low frequ.- resolution (b) Low time, high frequ.- resolution

Hypothesis: Asymmetric Sampling in Time (AST) Left temporal cortical areas preferentially extract information over 25ms temporal integration windows. Right hemisphere areas preferentially integrate over long, 150-250ms integration windows. By assumption, the auditory input signal has a neural representation that is bilaterally symmetric (e.g. at the level of core); beyond the initial representation, the signal is elaborated asymmetrically in the time domain. Another way to cocneptualize the AST proposal is to say that the sampling rate of non-primary auditory areas is different, with LH sampling at high frequencies (~40Hz) and RH sampling at low frequencies (4-10Hz).

e.g. e.g. intonation contours formant transitions b. Functional lateralization Analyses requiring high temporal resolution Analyses requiring high spectral resolution LH RH a. Physiological lateralization Symmetric representation of spectro-temporal receptive fields in primary auditory cortex Temporally asymmetric elaboration of perceptual representations in non-primary cortex LH RH Proportion of neuronal ensembles 25 [40Hz 4Hz] 250 25 [40Hz 4Hz] 250 Size of temporal integration windows (ms) [Associated oscillatory frequency (Hz)]

Asymmetric sampling in time (AST) characteristics • AST is an example of functional segregation, a standard concept. • AST is an example of multi-resolution analysis, a signal processing strategy common in other cortical domains (cf. visual areas MT and V4 which, among other differences, have phasic versus tonic firing properties, respectively). • AST speaks to the “granularity” of perceptual representations: the model suggests that there exist basic perceptual representations that correspond to the different temporal windows (e.g. featural info is equally basic to the envelope of syllables, on this view). • The AST model connects in plausible ways to the local versus global distinction: there are multiple representations of a given signal on different scales (cf. wavelets) Global ==> ‘large-chunk’ analysis, e.g., syllabic level Local ==> ‘small-chunk’ analysis, e.g., subsegmental level

e.g. e.g. intonation contours formant transitions a. Physiological lateralization Symmetric representation of spectro-temporal receptive fields in primary auditory cortex Temporally asymmetric elaboration of perceptual representations in non-primary cortex LH RH Proportion of neuronal ensembles 25 [40Hz 4Hz] 250 25 [40Hz 4Hz] 250 Size of temporal integration windows (ms) [Associated oscillatory frequency (Hz)] b. Functional lateralization Analyses requiring high temporal resolution Analyses requiring high spectral resolution LH RH

Outline (1) Fractionating the problem in space: Towards a functional anatomy of speech perception Fractionating the problem in time: Towards a functional physiology of speech perception - A hypothesis about the quantization of time AST model - Psychophysical evidence for temporal integration - Imaging evidence

Perception of FM sweepsHuan Luo, Mike Gordon, Anthony Boemio,David Poeppel

FM Sweep Example waveform 80msec, from 3-2 kHz, linear FM sweep spectrogram

The rationale • Important cues for speech perception: Formant transition in speech sounds (For example, F2 direction can distinguish /ba/ from /da/) • Importance in tone languages • Vertebrate auditory system is well equipped to analyze FM signals.

Tone languages • For example, Chinese, Thai… • The direction of FM (of the fundamental frequency) is important in the language to make lexical distinctions. • (Four tones in Chinese) /Ma 1/, /Ma 2/ , /Ma 3/, /Ma 4/

Questions • How good are we at discriminating these signals? determine the threshold of the duration of stimuli (corresponding to rate) for the detection of FM direction Any performance difference between UP and DOWN detection? • Will language experience affect the performance of such a basic perceptual ability?

Stimuli • Linearly frequency modulated • Frequency range studied: 2-3 kHz (0.5 oct) • Two directions (Up / Down ) • Changing FM rate (frequency range/time) by changing duration. For each frequency range, frequency span is kept constant (slow / Fast ) • Stimuli duration: from 5msec(100 oct/sec) to 640 msec (0.8 oct/sec) Tasks • Detection and discrimination of UP versus DOWN • 2 AFC, 2IFC, 3IFC

English speakers • • 3 frequency ranges relevant to speech • (approximately F1, F2, F3 ranges) • • single-interval 2-AFC • Two main findings: • • threshold for UP at 20ms • • UP better than DOWN 2-3 kHz 1-1.5 kHz 600-900Hz Gordon & Poeppel (2001), JASA-ARLO

2IFC • To eliminate the possibility of bias strategy subjects can use • To see whether the asymmetric performance of English subjects is due to their “Up preference bias” Same duration of the two sounds, so the only difference is direction Interval 1 Interval 2 UP Down Which interval (1 or 2) contains certain direction sound?

Results for Chinese Subjects no significant difference Threshold for both UP and DOWN is about 20 msec

Results for English Subjects No difference now between UP and DOWN Threshold for both at 20msec No difference between Chinese and English subjects now.

3IFC Standard Interval 1 Interval 2 UP UP Down Choose which interval contains DIFFERENT among the three sounds (different quality rather than only direction)

3 IFC versus 2 IFC No difference between Chinese and English subjects Threshold confirmed at 20ms

Conclusion • Importance of 20 msec as the threshold for discrimination of FM sweeps - corresponds to temporal order threshold determined by Hirsh 1959 - consistent with Schouten 1985, 1989 testing FM sweeps - this basic threshold arguably reflects the shortest integration window that generates robust auditory percepts.

Click trains Anthony Boemio & David Poeppel

Click Stimuli

Psychophysics

Colleagues : Allen Braun, NIH Greg Hickok, UC Irvine Jonathan Simon, Univ. Maryland