Download
an iterative technique for segmenting speech and text alignment n.
Skip this Video
Loading SlideShow in 5 Seconds..
An Iterative Technique for Segmenting Speech and Text Alignment PowerPoint Presentation
Download Presentation
An Iterative Technique for Segmenting Speech and Text Alignment

An Iterative Technique for Segmenting Speech and Text Alignment

542 Vues Download Presentation
Télécharger la présentation

An Iterative Technique for Segmenting Speech and Text Alignment

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. An Iterative Technique for Segmenting Speech and Text Alignment Arthur R. Toth Speech Seminar - 4/18/2003

  2. Basic Problem • Have Large Audio File, Associated Text • Want to Align Text With Audio • Useful for Synthesis • Useful for Acoustic Modeling • Doing this manually is tedious • What if it could be done automatically? • or even if part could be done automatically?

  3. Related Problem • Splitting audio file can help • Phrases can be good candidate • Can’t only be so long (have to breathe) • Short enough where forced alignment feasible • Existing work on predicting break locations • But then you need to split associated text

  4. Constraints • Different Data is available • Acoustic data, i.e. waveform • Supra-segmental information • For our first attempts, we are trying to see how far we can get using only waveform • Differs from strategies which use word info • cf. Wang & Hirschberg, Wightman et al.

  5. Data Set • BostonUniversity Radio Corpus • Single speaker monologue • No dialogue turn information • Female newscaster • Some idiosyncrasies • Loud breathing • Broad f0 range, sometimes large dips

  6. Segmenting Strategy • Want to focus on Phrase Break Levels>2 • Tool for first approximation: vad • end-pointer available from MS State University • public domain • uses power and zero-crossings • lists beginnings and ends of found segments • http://www.isip.msstate.edu/projects/speech/software/legacy/signal_detector/index.html

  7. Splitting Text - First Pass • Use Festival to predict lengths of words • Linearly scale total predicted length to actual length • Look at positions of segment endpoints from vad and use scaled length predictions to predict word

  8. Iterations • Refine estimates iteratively as follows: • In each iteration, work left-to-right • Use sphinx-align to score forced alignments • for words through initial final word prediction • also try final words up to 2 before and 2 after • take best scoring list of words as new estimate • Note: forced alignment can fail

  9. Experiment and Results • 5 iterations were run • Estimated word locations were compared with actual ones • Had to convert from times to words • Criterion - break associated with last previous word ending time • Most substantial improvement appeared to be in first iteration

  10. Discussion • Points close to correct improved quickly • Points further away didn’t improve as much • Window size probably too small • Need to expand window sizes, but keep other constraints in mind • Heuristic like Itakura rule might be handy • Many misses only 1 off, and biased • May result from measurement or labeling

  11. Further Work • More sophisticated phrase break detection • Using a general purpose tool • Want the option of using supra-segmental data, if available • Would a Switching State-Space Model help? (Ghahramani & Hinton) • Is left-to-right iteration approach best? • Non-iterative model for splitting text?