1 / 14

Robust Methods for Automatic Transcription and Alignment of Speech Signals

Robust Methods for Automatic Transcription and Alignment of Speech Signals. Course presentation: Speech Recognition Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics and Systems Engineering) GSLT (Graduate School of Language Technology)

fern
Télécharger la présentation

Robust Methods for Automatic Transcription and Alignment of Speech Signals

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Robust Methods for Automatic Transcription and Alignment of Speech Signals Course presentation: Speech Recognition Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics and Systems Engineering) GSLT (Graduate School of Language Technology) Göteborg University (Department of Linguistics) Robust Methods for Automatic Transcription and Alignment of Speech Signals

  2. Introduction: GSLC • GSLC (Göteborg Spoken Language Corpus): A multimodal corpus: • Video and/or audio recording • GTS (Göteborg Transcription Standard) • Overlaps on word level, background information, and comments relevant for interaction • MSO (Modified Standard Orthography) • Closer to speech than written language • NOT phonetic • Keeps possibilities to compare to written language • Designed for studies of natural speech in various activities • 25 social activity types – 200 hours – 360 recordings – 1.3 million running words • Recording/transcription only aligned for a few recordings, not word by word Robust Methods for Automatic Transcription and Alignment of Speech Signals

  3. Transcription example $L: he:ej $G: heej $L: heej hur haru haft d{et} i veckan < / > @ < event : applause > $G: jättebra ja{g} tycker inte att du är ett svart hål $L: tycker [4 du inte d{et} va bra ]4 $V: [4 jo:e kolla lungan ]4 $G: ja{g} tycker att du e0 [5 blå å0 (...) ]5 jo d{et} kan du väl ändå [6 tycka tycker ja ]6 $L: [5 ja tycker inte att du e0 en röd stjärna ]5 $L: [6 nä ]6 $V: va{d} tycker [7 ni själva rå1 ]7 $L: [7 röd ja ]7 stjärna ja $G: kan du ö{h} sluta avbryta [8 oss vi håller ]8 $V: [8 va{d} e0 ni va{d} tycke{r} ni själva att ni ]8 e0 $L: (...) $G: va Robust Methods for Automatic Transcription and Alignment of Speech Signals

  4. MultiTool • Prerelease 0.7 • Browsing, searching, coding, counting • Easy navigation through recordings • Search in transcription, partiture, media file, or time scale • Only manual alignment • Partial alignment of specific events would help a lot! Robust Methods for Automatic Transcription and Alignment of Speech Signals

  5. Robust Methods for Automatic Transcription and Alignment of Speech Signals

  6. What can speech technology do for MultiTool? • A lot of research I didn’t know about… • Question: should we use the transcription or not? • Yes: Automatic forced alignment on word level • No: Speech recognition + alignment • Yes, find the time for: • Utterance start and end points • Non speech annotations (coughing, whispering, click, loud, high pitch, glottalization, etc) and silent sections • Easy-to-recognize speech sounds or words • Find out if two utterances are uttered by the same person Robust Methods for Automatic Transcription and Alignment of Speech Signals

  7. Challenging task… • Speech recognition/alignment work best with high quality sound signals • Recordings of spontaneous speech in natural situations have some unwanted properties: • Long distance between microphone and speaker • Many speakers in the same signal • Overlapped speech • Unlimited vocabulary • Whatever you call it: Disfluencies, repairs, repetitions, deletions, fragmental speech • Various background noise • Will any of the existing methods work here? Robust Methods for Automatic Transcription and Alignment of Speech Signals

  8. Existing research • The Production of Speech Corpora (Schiel et. al.) – fully automatic methods with usable results: • Segmentation into words, if known vocabulary and not very spontaneous speech • Markup of prosodic features • Time alignment of phonemes (+ probabilistic pronunciation rules give word alignment Robust Methods for Automatic Transcription and Alignment of Speech Signals

  9. Research, cont. • Sentence boundary tagging (Stolcke & Shriberg 1996) • Probabilities for boundaries between words • HMM + Viterbi • POS-tags improves • Good sound quality • Interesting, but sentences are not utterances • Inter-word event tagging (Stolcke et. al. 1998) • Events are disfluencies in general • Input is forced alignment + acoustic features • Not directly usable but, similar model and acoustic features may be useful for other events as well Robust Methods for Automatic Transcription and Alignment of Speech Signals

  10. HMM-based segmentation and alignment • Find the most probable alignment for a sequence of words • Sjölander (2003) describes an interesting system • Very interesting! • Reports correct alignment for 85.5% of boundaries within 20ms • Will it work on noisy signals? • A result of say 5% would be very useful • I have tried to get the system… Robust Methods for Automatic Transcription and Alignment of Speech Signals

  11. Related tasks • Intensity discrimination • Easy to measure • Useful as indicator for phoneme changes, etc. • Voicing Determination and Fundamental Frequency • Many methods: Cepstrum, probabilities based on weighted features • Voicing patterns could give good hints when specific words occur. • Glottalization and impulse detection • Intensity and sudden f0 decrease could be used • Glottalization is marked in the transcription! Robust Methods for Automatic Transcription and Alignment of Speech Signals

  12. Robust alignment • How could the algorithm used by Sjölander be revised for more robustness? • f0 (voicing) and glottalization detection + ordinary probabilities for phonemes could help • Problem: the speech models will not give probabilities for phonemes in simultaneous speech • Problem #2: GSLC does not contain phonetic transcription • Would training on letters work? • My guess: this will not work good enough • Better approach to identify things that could be recognized since word-by-word alignment is not necessary Robust Methods for Automatic Transcription and Alignment of Speech Signals

  13. Conclusion • First thing to try: Sjölander’s aligner • Second: Spoken event tagger • Identify events that could be recognized • Identify useful acoustic features • May for example a decision tree help to recognize the events? • Lots of test and experiments will be needed, if the forced alignment doesn’t give useable results Robust Methods for Automatic Transcription and Alignment of Speech Signals

  14. The End! Thank you for listening  ??? !!! Robust Methods for Automatic Transcription and Alignment of Speech Signals

More Related