Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)

Broadcast News Segmentation using Metadata and Speech-To-Text Informationto Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL) International Computer Science Institute (ICSI) March 16, 2004

Outline General Idea ASR System used Exploratory work Strategies Results Conclusion

General idea Use Metadata (SUs) and Speech-To-Text (STT) information to improve later STT passes (feedback loop)

Why segment the audio stream? • Important to give « linguistically coherent » pieces to the language model • Remove « non-speech » (i.e. long silences, laughs, music, other noises,…) Why segmentation? Why use MDE? • MDE gives information about sentence and speaker breaks • Speaker labels improve the efficiency of the acoustic model and sentences improve the efficiency of the language model • BBN’s error analysis of Broadcast News recognition revealed a higher error rate at segments boundaries  this may be caused by missing the true sentence boundaries

MDE object used: Sentence-like units (SUs): express a thought or idea. It generally corresponds to a sentence. Each SU has a confidence measure, timing information (starting point and duration) and a cluster label. Metadata and STT information STT object used: Lexemes: describe the words that were assumed to be uttered. Each word has timing information (beginning and duration).

The system used is a simplified SRI BN evaluation system. ASR system used Recognition steps: • 1. Segment the waveforms • 2. Cluster the segments into « pseudo-speakers » • 3. Compute and normalize features (Mel cepstrum) • 4. Do first pass recognition with non-crossword acoustic models and bigram language model • 5. Generate lattices • 6. Expand lattices using 5-gram language model • 7. Adapt acoustic models for each « pseudo-speaker » • 8. Generate new lattices using the adapted acoustic models • 9. Expand new lattices using 5-gram language model • 10. Score the resulting hypotheses

Types of segmentation Baseline vs. MDE-based segmentation Baseline • Classifies frames into « speech » and « non-speech » using a 2-state HMM • Uses inter-words silences and speaker turns to segment the BN shows MDE-based • Uses sentence and speaker breaks to define an initial segmentation • Further processes the segments using different strategies presented later

Baseline experiments Comments: • The baseline segmentation is the one presented above • The results (shown later) obtained are: • the current best results • the baselines that ultimately have to be improved • No additional processing step is applied to modify the segments

« Cheating » experiments (1) Why? • See if there is room for improvement when using MDE-based segmentation How? • Use transcripts written by humans to segment the Broadcast News audio stream and apply processing strategies to improve recognition (i.e. use true information)

WER Baseline seg Cheating seg (using SU) Cheating seg (SU+proc) Wtd avg on 6 shows 14.0 14.2 13.0 « Cheating » experiments (2) Results: Baseline vs. « Cheating » experiments There is room for improvement!

Overview of the processing steps Broadcast News Shows 0. Segmentation using SUs 1. First strategy: splitting of long segments 2. Second strategy: concatenation of short segments 3. Third strategy: addition of time pads Final segmentation

First strategy: splitting of long segments Why? • Too long segments may cover more than 1 sentence  confusing for the language model How? • Use automatically generated transcripts and MDE • Too short segments mustn’t be processed  bad for the efficiency of the language model • Take two features into account for decision tree: • The duration of segments • The pause between words

Second strategy: concatenation of short segments Why? • Short segments are not optimal for the language model • Short segments increase the WER because all their words are close to the boundaries (cf. BBN’s error analysis) How? • Take 3 features into account for decision tree: • Pause between segments • Sum of the duration of two neighbors • Cluster label

Third strategy: Addition of time pads Why? • Prevent words from only being partially included • Because the windowing in the front end has a scope of up to 8 frames (4 on each side)  better to have enough padding How? • Take 1 feature into account for decision tree: • The pause between segments

Examples of improvements (1) 1) Real sentence: … and strictly limits state authority over how and when water is used … Recognized sentence: With baseline segmentation (cuts in middle of sentence): … and stricter limits dataarty over how and when wateryhues … Legend: segmentation point red errors time time With MDE-based segmentation: … and strict_ limits state authority over how and when water issues … time

Examples of improvements (2) 2) Real sentence: … I didn’t know if we would pull off the games. I didn’t know if this community would ever rally around the Olympics again. … Recognized sentence: With baseline segmentation (doesn’t cut at end of sentence): … pull off the games that had not this community would ever rally around … time time With MDE-based segmentation: … pull off the game_ I didn’t know _ this community would ever rally around … time

WER Baseline seg Step 0: SU seg SU seg + steps 1 & 2 SU seg + step1 SU seg + steps 1 & 2 & 3 Wtd avg on 6 shows 14.0 14.4 14.0 14.2 13.3 Results for the development set The improvement is 0.7% absolute and 5% relative!

WER Baseline seg Step 0: SU seg SU seg + steps 1 & 2 SU seg + step1 SU seg + steps 1 & 2 & 3 Wtd avg on 6 shows 18.7 19.8 19.6 19.7 18.4 Results for the evaluation set The improvement is 0.3% absolute and 1.6% relative!

Observations: • No « cheating » information available for the eval  not sure how well the SU detection is working • Improvements from step 0 (SU segmentation) to final segmentation are similar for dev set and eval set: 1.1% absolute (7.6% relative) for dev set and 1.3% absolute (6.6% relative) for eval set  SU information not optimized for eval • Respective improvements are quite uneven for each show  suggests that the strategies are show dependent, not channel dependent Dev results vs. Eval results

Future work Further optimize the thresholds for the three strategies Find a representation to choose a specific value of the thresholds for each show individually (i.e. fully adapted the decision trees to each show) Use Metadata objects such as the confidence measure of each SU and diarization to further improve the strategies

Conclusion Development of a new segmentation method based on Metadata and Speech-To-Text information Use features given by MDE and STT information in decision trees for each processing step Results indicate the promiss of this approach Further developments still seem to have room for improvement

Acknowlegments I would like to thank: Prof. Bourlard & Prof. Morgan Barbara & Andreas Yang IM2 for supporting my experience

Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)

Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL)

Presentation Transcript

YANN BARRANDON SWISS FEDERAL INSTITUTE OF TECHNOLOGY LABORATORY OF STEM CELL DYNAMICS AND LAUSANNE MEDICAL SCHOOL DEPA

Roland Schertenleib Swiss Federal Institute of Aquatic Science and Technology (Eawag) / W+S Consult

Swiss-Tx : A Commodity MPI computing solution

Sebastien Arsenault

Particle Accelerators: An introduction

Michel Pouly EPFL-STI-LGPP Swiss Microtech

Swiss-T1 : A Commodity MPI computing solution

Jens Ingensand GIS Research Laboratory, Swiss Federal Institute of Technology (EPFL)

Roland Schertenleib Swiss Federal Institute of Aquatic Science and Technology (Eawag)

Anil Alexander and Andrzej Drygajlo Swiss Federal Institute of Technology Lausanne

CENTER FOR SECURITY STUDIES ETH Zurich (Swiss Federal Institute of Technology)

A Swiss Network of Scientific Diasporas to Enforce the Role

Swiss Experiment EPFL-LSIR Report

Niklaus Bütikofer (Swiss Federal Archives)

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Felix Addor Deputy Director General Swiss Federal Institute of Intellectual Property, Bern

Summer Research Institute - EPFL

Swiss Federal Institute of Aquatic Science and Technology Kastanienbaum - Switzerland

Sebastien Loeb

Swiss-T1 : A Commodity MPI computing solution

Summer Research Institute - EPFL

Michel Pouly EPFL-STI-LGPP Swiss Microtech