1 / 23

Prosodic Constraints for Robust Speech Recognition

Mark Hasegawa-Johnson University of Illinois at Urbana-Champaign. Prosodic Constraints for Robust Speech Recognition. Goals Disambiguate sentences with similar phonemic content. Create speech recognition algorithms which will fail less often in noisy environments. Example

morrisjerry
Télécharger la présentation

Prosodic Constraints for Robust Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mark Hasegawa-Johnson University of Illinois at Urbana-Champaign Prosodic Constraints for Robust Speech Recognition

  2. Goals Disambiguate sentences with similar phonemic content. Create speech recognition algorithms which will fail less often in noisy environments. Example “The nurse brought a big Ernie doll.” “The nurse brought a bigger needle.” Prosodic Constraints for ASR

  3. What is Prosody? Why is Prosody Useful? Why is Prosody Ignored by ASR? What Can We Do About It? 1. The Normalization Problem 2. The Search Problem Prosodic Constraints for ASR

  4. Lexical Stress (Phonological): Lexical Stress is marked in the dictionary. Perceptual Correlates: stressed syllable may receive prominence. Phrasing and Prominence (Perceptual): Phrasing and Prominence are controlled by the speaker to suggest the correct syntactic and pragmatic parse of a sentence. Acoustic Correlates: pitch, duration, glottalization, energy, and spectral envelope. What is Prosody?

  5. Prosody is a System of Constraints: Syntax and semantics constrain p(w2|w1) Prosody constrains p(O|W) Prosody is Hierarchical and Non-Local: Phrase-final lengthening and phrase-initial glottalization increase with boundary depth Location of prominences is constrained by phrase structure What is Prosody?

  6. 1. Humans extremely sensitive to prosody Infants use prosody to learn new vocabulary. 2. Prosody is audible in noise Low-frequency acoustic correlates (energy, F0) 3. Prosody disambiguates confusable words Experiment: destroy all fine phonetic information, keep only 6 manner classes. Average cohort size = 5.0 (std=19.6, max=538) Keep manner classes, plus lexical stress. Average cohort size = 3.4 (std=11.6, max=333) Why is Prosody Useful?

  7. 1. The normalization problem: Acoustic features must be normalized, and normalization algorithms are unknown. 2. The search problem: Prosodic constraints are non-local, and are therefore difficult to use in an efficient search algorithm. Why is Prosody Ignored by ASR?

  8. 1. The Normalization Problem • F0, Duration, Energy, Glot., Spec. Env.: • {Influence of Speaker and Phoneme} • >> • {Influence of Prominence}. • Normalization: • Explicit or Implicit? • One-Pass or Multi-Pass?

  9. Background: An Algorithm for Synthesis of F0 Contours

  10. Explicit Normalization based on Speech Synthesis Algorithm

  11. Explicit One-Pass Normalization: Synthetic Example

  12. Implicit, Multi-Pass Normalization Parse-independent observation PDF Parse-dependent observation PDF

  13. 1. Parse-Dependent vs. Parse-Independent Information sources parse-dependent, but.. Parse-dependent requires a multi-pass search 2. Explicit vs. Implicit Normalization Explicit normalization: design normalization weights to filter “signal” from “noise.” Implicit normalization: observation PDF includes current & previous values of cue; “normalization” is learned during training. Summary: Normalization

  14. Viterbi Beam Search N-gram Grammar Complexity: |V|(N-1) word models N-gram Grammar governed by a Semantic Hierarchy with MS Equivalence Classes Complexity: MS|V|(N-1) Two Hierarchies (Prosodic + Semantic) Complexity: MSMP|V|(N-1) Training: MP Acoustic Models of Each Word?! 2. The Search Problem

  15. Search Solution #1: Lexically Stressed Vowel Models Acoustic Model • Phone-Based HMM. • Dictionary entry specifies “stressed” or “unstressed” vowel model. Objectives • Focus on just one level of the prosodic hierarchy. • Provide a testbed for studying the perception and acoustics of stress and rhythm.

  16. Search Solution #1: Lexically Stressed Vowel Models Advantages • Search Complexity is Not Increased. • Training Complexity Minimally Increased (# of Vowel Models is Doubled). Expected Efficacy • van Kuijk & Boves, 1999: stressed/unstressed classification up to 70% correct. • More information increases word recognition scores (even if just a little).

  17. Search Solution #2: Perceptually Prominent Vowel Models Acoustic Model • 2 Models of every word: prominent, not prominent. • Train using e.g. Radio News corpus. Expected Efficacy • 2 Models of every word ==> Uncertainty increases by 1 bit. • Overlapping PDFs ==> Information increases by <1 bit. • Recognition performance declines.

  18. Search Solution #3: Start-Synchronous A* Search(Renals & Hochberg, 1999) 1. ACOUSTIC PRUNING: Use beam search to find all words wtu starting at time t such that p(Otu | wtu) > threshold 2. LINGUISTIC PRUNING Create word strings Wu=[Wt-1, wtu], p(Wu,O1u) = p(Wt-1,O1t-1) p(Otu | wtu) p(wtu | Wt-1) > threshold

  19. Start-Synchronous Search with Prosodic Model • Stack entry for the search algorithm is [ Wu,Fu] = [ Wt-1, Ft-1, wtu, ftu ]. • ftu contains F0, duration, energy, etc. • p(Wu, Fu, O1u) = p(Wt-1,Ft-1,O1t-1) (history) p(wtu | Wt-1) (word order model) p(ftu | Wt-1, wtu, Ft-1) (prosodic model) p(Otu | wtu) (local acoustic model)

  20. Start-Synchronous Search with Prosodic Model Advantages • p(Otu | wtu) and p(wtu | Wt-1) Unchanged ==> Training Complexity Unchanged Search Complexity (Nearly) Unchanged • p(ftu | Wu,Ft-1) gives “fine-tuned” ranking of candidate word strings Wu at each time u. Research Issues • Does it work?

  21. Conclusions Why Use Prosody? • Humans use it. • Possible improved recognition in noise. How Can We Use Prosody? • Normalization: bottom-up (one-step) or parse-dependent (multi-step). • Lexically stressed vowel models p(Otu | wtu). • Explicit prosody model p(ftu | Wu,Ft-1) can be part of a start-synchronous A* search.

More Related