Unleashing Prosodic Constraints: Enhancing Speech Recognition in Noisy Environments

Mark Hasegawa-Johnson University of Illinois at Urbana-Champaign Prosodic Constraints for Robust Speech Recognition

Goals Disambiguate sentences with similar phonemic content. Create speech recognition algorithms which will fail less often in noisy environments. Example “The nurse brought a big Ernie doll.” “The nurse brought a bigger needle.” Prosodic Constraints for ASR

What is Prosody? Why is Prosody Useful? Why is Prosody Ignored by ASR? What Can We Do About It? 1. The Normalization Problem 2. The Search Problem Prosodic Constraints for ASR

Lexical Stress (Phonological): Lexical Stress is marked in the dictionary. Perceptual Correlates: stressed syllable may receive prominence. Phrasing and Prominence (Perceptual): Phrasing and Prominence are controlled by the speaker to suggest the correct syntactic and pragmatic parse of a sentence. Acoustic Correlates: pitch, duration, glottalization, energy, and spectral envelope. What is Prosody?

Prosody is a System of Constraints: Syntax and semantics constrain p(w2|w1) Prosody constrains p(O|W) Prosody is Hierarchical and Non-Local: Phrase-final lengthening and phrase-initial glottalization increase with boundary depth Location of prominences is constrained by phrase structure What is Prosody?

1. Humans extremely sensitive to prosody Infants use prosody to learn new vocabulary. 2. Prosody is audible in noise Low-frequency acoustic correlates (energy, F0) 3. Prosody disambiguates confusable words Experiment: destroy all fine phonetic information, keep only 6 manner classes. Average cohort size = 5.0 (std=19.6, max=538) Keep manner classes, plus lexical stress. Average cohort size = 3.4 (std=11.6, max=333) Why is Prosody Useful?

1. The normalization problem: Acoustic features must be normalized, and normalization algorithms are unknown. 2. The search problem: Prosodic constraints are non-local, and are therefore difficult to use in an efficient search algorithm. Why is Prosody Ignored by ASR?

1. The Normalization Problem • F0, Duration, Energy, Glot., Spec. Env.: • {Influence of Speaker and Phoneme} • >> • {Influence of Prominence}. • Normalization: • Explicit or Implicit? • One-Pass or Multi-Pass?

Background: An Algorithm for Synthesis of F0 Contours

Explicit Normalization based on Speech Synthesis Algorithm

Explicit One-Pass Normalization: Synthetic Example

Implicit, Multi-Pass Normalization Parse-independent observation PDF Parse-dependent observation PDF

1. Parse-Dependent vs. Parse-Independent Information sources parse-dependent, but.. Parse-dependent requires a multi-pass search 2. Explicit vs. Implicit Normalization Explicit normalization: design normalization weights to filter “signal” from “noise.” Implicit normalization: observation PDF includes current & previous values of cue; “normalization” is learned during training. Summary: Normalization

Viterbi Beam Search N-gram Grammar Complexity: |V|(N-1) word models N-gram Grammar governed by a Semantic Hierarchy with MS Equivalence Classes Complexity: MS|V|(N-1) Two Hierarchies (Prosodic + Semantic) Complexity: MSMP|V|(N-1) Training: MP Acoustic Models of Each Word?! 2. The Search Problem

Search Solution #1: Lexically Stressed Vowel Models Acoustic Model • Phone-Based HMM. • Dictionary entry specifies “stressed” or “unstressed” vowel model. Objectives • Focus on just one level of the prosodic hierarchy. • Provide a testbed for studying the perception and acoustics of stress and rhythm.

Search Solution #1: Lexically Stressed Vowel Models Advantages • Search Complexity is Not Increased. • Training Complexity Minimally Increased (# of Vowel Models is Doubled). Expected Efficacy • van Kuijk & Boves, 1999: stressed/unstressed classification up to 70% correct. • More information increases word recognition scores (even if just a little).

Search Solution #2: Perceptually Prominent Vowel Models Acoustic Model • 2 Models of every word: prominent, not prominent. • Train using e.g. Radio News corpus. Expected Efficacy • 2 Models of every word ==> Uncertainty increases by 1 bit. • Overlapping PDFs ==> Information increases by <1 bit. • Recognition performance declines.

Search Solution #3: Start-Synchronous A* Search(Renals & Hochberg, 1999) 1. ACOUSTIC PRUNING: Use beam search to find all words wtu starting at time t such that p(Otu | wtu) > threshold 2. LINGUISTIC PRUNING Create word strings Wu=[Wt-1, wtu], p(Wu,O1u) = p(Wt-1,O1t-1) p(Otu | wtu) p(wtu | Wt-1) > threshold

Start-Synchronous Search with Prosodic Model • Stack entry for the search algorithm is [ Wu,Fu] = [ Wt-1, Ft-1, wtu, ftu ]. • ftu contains F0, duration, energy, etc. • p(Wu, Fu, O1u) = p(Wt-1,Ft-1,O1t-1) (history) p(wtu | Wt-1) (word order model) p(ftu | Wt-1, wtu, Ft-1) (prosodic model) p(Otu | wtu) (local acoustic model)

Start-Synchronous Search with Prosodic Model Advantages • p(Otu | wtu) and p(wtu | Wt-1) Unchanged ==> Training Complexity Unchanged Search Complexity (Nearly) Unchanged • p(ftu | Wu,Ft-1) gives “fine-tuned” ranking of candidate word strings Wu at each time u. Research Issues • Does it work?

Conclusions Why Use Prosody? • Humans use it. • Possible improved recognition in noise. How Can We Use Prosody? • Normalization: bottom-up (one-step) or parse-dependent (multi-step). • Lexically stressed vowel models p(Otu | wtu). • Explicit prosody model p(ftu | Wu,Ft-1) can be part of a start-synchronous A* search.

Unleashing Prosodic Constraints: Enhancing Speech Recognition in Noisy Environments

Unleashing Prosodic Constraints: Enhancing Speech Recognition in Noisy Environments

Presentation Transcript

A Universal Human Machine Speech Interaction Language for Robust Speech Recognition Applications

Speech Recognition

Robust Speech recognition

Using Speech Recognition for Speech Therapy

ROBUST SIGNAL REPRESENTATIONS FOR AUTOMATIC SPEECH RECOGNITION

Robust Recognition of Emotion from Speech

Speech recognition

Speech Recognition

Quantile Based Histogram Equalization for Noise Robust Speech Recognition

Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition

Speech Recognition

Histogram-based Quantization for Distributed / Robust Speech Recognition

MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION

Enhanced Speech Models for Robust Speech Recognition

SPEECH RECOGNITION:

Robust Automatic Speech Recognition by Transforming Binary Uncertainties

On Properties of Modulation Spectrum for Robust Automatic Speech Recognition

CMU Robust Vocabulary-Independent Speech Recognition System

Speech Recognition

A Feature Weighting Method for Robust Speech Recognition

Speech Recognition