1 / 16

Broadcast News Training Experiments

Broadcast News Training Experiments. Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena, Horacio Franco Jing Zheng and Andreas Stolcke Speech Technology & Research Laboratory SRI International, Menlo Park, CA. Goals.

ros
Télécharger la présentation

Broadcast News Training Experiments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena, Horacio Franco Jing Zheng and Andreas Stolcke Speech Technology & Research Laboratory SRI International, Menlo Park, CA EARS STT Workshop

  2. Goals • Assess effect of TDT-4 data on SRI BN system (not previously used) • Explore alternatives for use of closed-caption transcripts for acoustic and LM training • Specifically, investigate algorithm for “repairing” inaccuracies in CC transcripts. • Initial test of voicing feature front end on BN (originally developed for CTS) EARS STT Workshop

  3. Talk Overview • BN training on TDT-4 CC data • Generation of raw transcripts • Waveform segmentation • Transcript Repair with FlexAlign • FlexAlign output for LM training • Effect of amount of training data • Comparison with CUED TDT-4 transcripts • Ongoing effort on voicing features for BN acoustic modeling EARS STT Workshop

  4. TDT-4 Training:Generation of Waveforms Segmentsand Reference Transcripts • References were assumed to be delimited by <TEXT> and </TEXT> in the LDC transcripts. • The speech signal was cut using the time marks extracted from the <DOC> tags surrounding the TEXT elements. • Long waveforms were identified and recut at progressively shorter pauses until all waveforms were 30s or shorter. • Used PTM acoustic models for forced alignment that didn’t require speaker-level normalizations. • Used “flexible” forced alignment (see next). EARS STT Workshop

  5. FlexAlign • Special lattices were generated for each segment. • Each word was preceded by an optional pause and an optional nonlexical word model. • Goal was to simultaneously delete noisy or mistranscribed text and insert disfluencies. EARS STT Workshop

  6. Optional Nonlexical Word Transition probabilities were approximated by the unigram relative frequencies in the 96/97 BN acoustic training corpus. EARS STT Workshop

  7. Training Procedure • Final refs were the output of the recognizer on the FlexAlign lattices. • WER wrt original CC transcripts: 5.0% (Sub 0.4, Ins 4.4, Del 0.3) • Standard acoustic models were built using Viterbi training on these transcripts. EARS STT Workshop

  8. Does FlexAlignment Help LM Training? “Subset”: Random selection of original CC references to match token count of FlexAlign transcripts. Note: Only disfluency in the test data was “uh”. EARS STT Workshop

  9. An Accidental Experiment What happens if we train on only a subset of the data? Is the performance proportionately worse? EARS STT Workshop

  10. Comparison with CUED TDT-4 Training Transcripts • CUED TDT-4 transcripts were generated by a STT system with a biased LM (trained on TDT-4 CC). • CUED transcripts were generated from CU word time information and SRI waveform segments. • CUED transcripts sometimes have “holes” in them where our wave segments span more than one of CUED waves (probably due to ad removal). • WER wrt CC transcriptions: Originals: 18.2% (Sub 7.7, Ins 3.2,, Del 7.2) Flex-align:19.5% (Sub 10.1, Ins 3.8, Del 5.6) • A fairer comparison ought to use CUED transcripts with CUED segments for training the acoustic models, so take results with a grain of salt! EARS STT Workshop

  11. Results of First Decoding Pass EARS STT Workshop

  12. Multi-pass System Results • Multi-pass system used new decoding strategy (described in later talk). • But: MFC instead of PLP, and no SAT normalization in training (to save time). EARS STT Workshop

  13. Voicing Features • Test voicing features developed for CTS system for BN STT (cf. Martigny talk) • Then, we obtained a 2% relative error reduction across stages • Use Peak of autocorrelation and entropy of higher order cepstrum • Use a window of 5 frames of two voicing features • Juxtapose MDCC plus deltas and double deltas to window of voicing features • Apply dimensionality reduction with HLDA. Final feature vector has 39 dimensions EARS STT Workshop

  14. Voicing Features Results • TDT-4 devtest set (results on first pass) • Used equivalent parameters to those optimized for CTS system • Need to investigate (reoptimize) FE parameters for higher BW • It is not clear what the effect of background music might be in voicing features in BN • Possible software issues • With higher BW features, voicing features may be more redundant . EARS STT Workshop

  15. Summary • Developed CC transcript “repair” algorithm based on flexible alignment. • Training on “repaired” TDT-4 transcripts gives 8.8% (1st pass) to 6.2% (multi-pass) relative improvement of Hub-4 training. • Accidental result: leaving out 1/3 of new data reduces improvement only marginally. • Transcript “repair” not suitable for LM training (yet). • No improvement from voicing features (yet), need to investigate parameters. EARS STT Workshop

  16. Future Work • Redo comparison with alternative transcripts more carefully. • Investigate data filtering (e.g., based on reject word occurrences in FlexAlign output). • Add the rest of the data ! • Further investigate the use of voicing features. EARS STT Workshop

More Related