230 likes | 376 Vues
This study investigates the automatic speech recognition (ASR) of football commentaries in Dutch and English using data from two matches: England vs. Germany and Yugoslavia vs. The Netherlands. It discusses various challenges faced, including overlapping segments and noise interference. The methodology includes transcription techniques, chunk alignment, and the application of noise reduction tools. Analysis shows high word error rates influenced by function words. Future work aims to enhance noise robustness and expand language models for better performance.
E N D
Speech recognition in MUMIS Mirjam Wester, Judith Kessens & Helmer Strik
Intro • Objective: Automatic speech recognition of football commentaries • SPEX transcribed two matches for two languages (Dutch and English): • England - Germany (Eng-Dld) and • Yugoslavia -The Netherlands (Yug-Ned) • Commentaries and stadium noise are mixed
Data Conversion • SPEX transcription: • text grid: • orthographic transcription • chunk alignment; chunk = a segment of speech of about 2 to 3 seconds • CD with one large wav file • Split according to chunk alignments
Examples of data • Yug-Ned Dutch • Yug-Ned English • Eng-Dld Dutch • Eng-Dld English
Statistics English matches have two commentators, Dutch only one. Overlapping segments have been disregarded.
Training Dutch: • Yug-Ned ¾ of CD (19 min speech) • France Telecom Noise Reduction (FTNR) English: • Yug-Ned ¾ of CD (28 min speech) • FTNR For more information on France Telecom Noise Reduction tool see: B. Noé, J. Sienel, D. Jouvet, L. Mauuary, L. Boves, J. de Veth & F. de Wet “Noise Reduction for Noise Robust Feature Extraction for Distributed Speech Recognition”. In Proc. of Eurospeech ’01
Test Dutch: • Yug-Ned ¼ of CD • 626 chunks, 1577 words • lexicon and language model based on complete Yug-Ned match English: • Yug-Ned ¼ of CD • 636 chunks, 2641 words • lexicon and language model based on complete Yug-Ned match
Dutch – Polyphone • Data is phonetically rich sentences • Phone models were trained on: • Polyphone all speakers • Polyphone male speakers • Polyphone male speakers + MUMIS noise • Polyphone as bootstrap for segmentation of MUMIS material
Cross tests (Dutch & English) Cross-tests: • train on ¾ Yug-Ned test on ¼ Eng-Dld • train on ¾ Eng-Dld test on ¼ Yug-Ned
MUMIS models (Dutch) Yug-Ned test Eng-Dld test
MUMIS models (English) Yug-Ned test Eng-Dld test
MUMIS models (Dutch+English) Yug-Ned test Eng-Dld test
Function words vs content words word type Dutch data English data
Discussion • WERs are high • Noise? • FTNR leads to lower SNR, but WERs do not improve substantially • Not enough training data? • Polyphone for training/bootstrapping does not lead to lower WERs than training on MUMIS data • Noisifying Polyphone with MUMIS gives encouraging results
Discussion continued • Function words comprise ± 50% of the data, and cause great deal of the errors • Names are recognized very well • Function words not necessary for information extraction (?)
Future work • Steps to noise robust speech recognition: • model/speaker adaptation • combinations of noisified Polyphone models and FTNR • Other issues: • transcription of more data • English, Dutch and German • preference specific games? radio? TV? • generic football specific language model • confidence measures?
Future work continued Questions: • What type of output from ASR is needed? • word-graph • n-best list • top of the list • word spotting? only content words? • For research purposes: is it possible to obtain data that has not been mixed (noise + commentary)?