1 / 38

Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation

Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation. DeLiang Wang The Ohio State University. Outline of Presentation. Introduction Speech segregation problem Auditory scene analysis (ASA) approach A multistage model for computational ASA

abedi
Télécharger la présentation

Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University

  2. Outline of Presentation • Introduction • Speech segregation problem • Auditory scene analysis (ASA) approach • A multistage model for computational ASA • On amplitude modulation and pitch tracking • Oscillatory correlation theory for ASA

  3. Speech Segregation Problem • In a natural environment, target speech is usually corrupted by acoustic interference. An effective system for speech segregation has many applications, such as automatic speech recognition, audio retrieval, and hearing aid design • Most speech separation techniques require multiple sensors • Speech enhancement developed for the monaural situation can deal with only specific acoustic interference

  4. Auditory Scene Analysis (Bregman’90) • Listeners are able to parse the complex mixture of sounds arriving at the ears in order to retrieve a mental representation of each sound source • ASA would take place in two conceptual processes: • Segmentation. Decompose the acoustic mixture into sensory elements (segments) • Grouping. Combine segments into groups, so that segments in the same group are likely to have originated from the same environmental source

  5. Auditory Scene Analysis - continued • The grouping process involves two aspects: • Primitive grouping. Innate data-driven mechanisms, consistent with those described by Gestalt psychologists for visual perception (proximity, similarity, common fate, good continuation, etc.) • Schema-driven grouping. Application of learned knowledge about speech, music and other environmental sounds

  6. Computational Auditory Scene Analysis • Computational ASA (CASA) systems approach sound separation based on ASA principles • Weintraub’85, Cooke’93, Brown & Cooke’94, Ellis’96, Wang’96 • Previous CASA work suggests that: • Representation of the auditory scene is a key issue • Temporal continuity is important (although it is ignored in most frame-based sound processing algorithms) • Fundamental frequency (F0) is a strong cue for grouping

  7. A Multi-stage Model (Wang & Brown’99)

  8. Auditory Periphery Model • A bank of fourth-order gammatone filters (Patterson et al.’88) • Meddis hair cell model converts gammatone output to neural firing

  9. Auditory Periphery - Example • Hair cell response to utterance: “Why were you all weary?” mixed with phone ringing • 128 filter channels arranged in ERB

  10. Mid-level Auditory Representations • Mid-level representations form the basis for segment formation and subsequent grouping • Correlogram extracts periodicity information from simulated auditory nerve firing patterns • Summary correlogram is used to identify F0 • Cross-correlation between adjacent correlogram channels identifies regions that are excited by the same frequency component or formant

  11. Mid-level Representations - Example • Correlogram and cross-channel correlation for the speech/telephone mixture

  12. Oscillator Network: Segmentation Layer • Horizontal weights are unity, reflecting temporal continuity, and vertical weights are unity if cross-channel correlation exceeds a threshold, otherwise 0 • A global inhibitor ensures that different segments have different phases • A segment thus formed corresponds to acoustic energy in a local time-frequency region that is treated as an atomic component of an auditory scene

  13. Segmentation Layer - Example • Output of the segmentation layer in response to the speech/telephone mixture

  14. Oscillator Network: Grouping Layer • At each time frame, an F0 estimate from the summary correlogram is used to classify channels into two categories; those that are consistent with the F0, and those that are not • Connections are formed between pairs of channels: mutual excitation if the channels belong to the same F0 category, otherwise mutual inhibition • Strong excitation within each segment • The second layer embodies the grouping stage of ASA

  15. Grouping Layer - Example • Two streams emerge from the grouping layer at different times or with different phases • Left: Foreground (original mixture ) • Right: Background

  16. Challenges Facing CASA • Previous systems, including the Wang-Brown model, have difficulty in • Dealing with broadband high-frequency mixtures • Performing reliable pitch tracking for noisy speech • Retaining high-frequency energy of the target speaker • Our next step considers perceptual resolvability of various harmonics

  17. Resolved and Unresolved Harmonics • For voiced speech, lower harmonics are resolved while higher harmonics are not • For unresolved harmonics, the envelopes of filter responses fluctuate at the fundamental frequency of speech • Hence we apply different grouping mechanisms for low-frequency and high-frequency signals: • Low-frequency signals are grouped based on periodicity and temporal continuity • High-frequency signals are grouped based on amplitude modulation (AM) and temporal continuity

  18. Proposed System (Hu & Wang'02)

  19. Envelope Representations - Example (a) Correlogram and cross-channel correlation of hair cell response to clean speech (b) Corresponding representations for response envelopes

  20. Initial Segregation • The Wang-Brown model is used in this stage to generate segments and select the target speech stream • Segments generated in this stage tend to reflect resolved harmonics, but not unresolved ones

  21. Pitch Tracking • Pitch periods of target speech are estimated from the segregated speech stream • Estimated pitch periods are checked and re-estimated using two psychoacoustically motivated constraints: • Target pitch should agree with the periodicity of the time-frequency (T-F) units in the initial speech stream • Pitch periods change smoothly, thus allowing for verification and interpolation

  22. Pitch Tracking - Example (a) Global pitch (Line: pitch track of clean speech) for a mixture of target speech and ‘cocktail-party’ intrusion (b) Estimated target pitch

  23. T-F Unit Labeling • In the low-frequency range: • A T-F unit is labeled by comparing the periodicity of its autocorrelation with the estimated target pitch • In the high-frequency range: • Due to their wide bandwidths, high-frequency filters generally respond to multiple harmonics. These responses are amplitude modulated due to beats and combinational tones (Helmholtz, 1863) • A T-F unit in the high-frequency range is labeled by comparing its AM repetition rate with the estimated target pitch

  24. AM - Example (a) The output of a gammatone filter (center frequency:2.6 kHz) to clean speech (b) The corresponding autocorrelation function

  25. AM Repetition Rates • To obtain AM repetition rates, a filter response is half-wave rectified and bandpass filtered • The resulting signal within a T-F unit is modeled by a single sinusoid using the gradient descent method. The frequency of the sinusoid indicates the AM repetition rate of the corresponding response

  26. Final Segregation • New segments corresponding to unresolved harmonics are formed based on temporal continuity and cross-channel correlation of response envelopes (i.e. common AM). Then they are grouped into the foreground stream according to AM repetition rates • The foreground stream is adjusted to remove the segments that do not agree with the estimated target pitch • Other units are grouped according to temporal and spectral continuity

  27. Ideal Binary Mask for Performance Evaluation • Within a T-F unit, the ideal binary mask is 1 if target energy is stronger than interference energy, and 0 otherwise • Motivation: Auditory masking - stronger signal masks weaker one within a critical band • Further motivation: Ideal binary masks give excellent listening experience and automatic speech recognition performance • Thus, we suggest to use ideal binary masks as ground truth for CASA performance evaluation

  28. Monaural Speech Segregation Example • Left: Segregated speech stream (original mixture: ) • Right: Ideal binary mask

  29. Systematic Evaluation • Evaluated on a corpus of 100 mixtures (Cooke’93): 10 voiced utterances x 10 noise intrusions • Noise intrusions have a large variety • Resynthesis stage allows estimation of target speech waveform • Evaluation is based on ideal binary masks

  30. Signal-to-Noise Ratio (SNR) Results • Average SNR gain: 12.1 dB; average improvement over Wang-Brown: 5 dB • Major improvement occurs in target energy retention, particularly in the high-frequency range

  31. Segregation Examples Mixture Ideal Binary Mask Wang-Brown New System

  32. How Does Auditory System Perform ASA? • Information about acoustic features (pitch, spectral shape, interaural differences, AM, FM) is extracted in distributed areas of the auditory system • Binding problem: How are these features combined to form a perceptual whole (stream)? • Hierarchies of feature-detecting cells exist, but do not seem to constitute a solution to the binding problem

  33. Oscillatory Correlation Theory (von der Malsburg & Schneider’86; Wang’96) • Neural oscillators are used to represent auditory features • Oscillators representing features of the same source are synchronized (phase-locked with zero phase lag), and are desynchronized from oscillators representing different sources • Supported by growing experimental evidence, e.g. oscillations in auditory cortex measured by EEG, MEG and local field potentials

  34. Oscillatory Correlation Representation FD: Feature Detector

  35. Oscillatory Correlation for ASA • LEGION dynamics (Terman & Wang’95) provides a computational foundation for the oscillatory correlation theory • The utility of oscillatory correlation has been demonstrated for speech separation (Wang-Brown’99), modeling auditory attention (Wrigley-Brown’01), etc.

  36. Issues • Grouping is entirely pitch-based, hence limited to segregating voiced speech • How to group unvoiced speech? • Target pitch tracking in the presence of multiple voiced sources • Role of segmentation • We found increased robustness with segments as an intermediate representation between streams and T-F units

  37. Summary • Multistage ASA approach to monaural speech segregation • Performs substantially better than previous CASA systems • Oscillatory correlation theory for ASA • Key issue is integration of various grouping cues

  38. Collaborators • Recent work with Guoning Hu- Ohio State University • Earlier work with Guy Brown - University of Sheffield

More Related