1 / 35

LYU0103 Speech Recognition Techniques for Digital Video Library

LYU0103 Speech Recognition Techniques for Digital Video Library. Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo. Outline of Presentation. Project objectives ViaVoice recognition experiments Speech recognition editing tool Audio scene change detection

hosea
Télécharger la présentation

LYU0103 Speech Recognition Techniques for Digital Video Library

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LYU0103Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo

  2. Outline of Presentation • Project objectives • ViaVoice recognition experiments • Speech recognition editing tool • Audio scene change detection • Speech classification • Summary

  3. Our Project Objectives • Audio information retrieval • Speech recognition

  4. Last Term’s Work • Extract audio channel (stereo 44.1 kHz) from a mpeg video files into wave files (mono 22 kHz) • Segmented the wave files into sentences by detecting its frame energy • Developed a visual training tool

  5. Visual Training Tool Video Window; Dictation Window; Text Editor

  6. IBM ViaVoice Experiments • Employed 7 student helpers • Produce transcripts of 77 news video clips • Four experiments: • Baseline measurement • Trained model measurement • Slow down measurement • Indoor news measurement

  7. Baseline Measurement • To measure the ViaVoice recognition accuracy using TVB news video • Testing set: 10 video clips • The segmented wav files are dictated • Employ the hidden Markov model toolkit (HTK) to examine the accuracy

  8. Trained Model Measurement • To measure the accuracy of ViaVoice, trained by its correctly recognized words • 10 videos clips are segmented and dictated • The correctly dictated words of training set are used to train the ViaVoice by the SMAPI function SmWordCorrection • Repeat the procedures of “baseline measurement” after training to get the recognition performance • Repeat the procedures of using 20 videos clips

  9. Slow Down Measurement • Investigate the effect of slowing down the audio channel • Resample the segment wave files in the testing set by the ratio of 1.05, 1.1, 1.15, 1.2, 1.3, 1.4, and 1.6 • Repeat the procedures of “baseline measurement”

  10. Indoor News Measurement • Eliminate the effect of noise • Select the indoor news reporter sentence • Dictate the test set using untrained model • Repeat the procedure using trained model

  11. Experimental Results Overall Recognition Results (ViaVoice, TVB News )

  12. Experimental Result Cont. Result of trained model with different number of training videos Result of using different slow down ratio

  13. Analysis of Experimental Result • Trained model: about 1% accuracy improvement • Slowing down speeches: about 1% accuracy improvement • Indoor speeches are recognized much better • Mandarin: estimated baseline accuracy is about 70 % ( >> Cantonese)

  14. Speech Processor • Training does not increase accuracy significantly • Need manually editing of the recognition result • Word timing information is also important

  15. Editing Functionality • The recognition result is organized in a basic unit called “firm word” • Retrieve the timing information from the speech engine • Record the timing information of every firm word in an index • Highlight corresponding firm word during video playback

  16. Dynamic Time Index Alignment • While editing recognition result, firm word structure may be changed • Time index need to be updated to maintain new firm word • In speech processor, time index is aligned with firm words whenever user edits the text

  17. Time Index Alignment Example Before Editing Editing After Editing

  18. Motivation for Doing Speech Segmentation and Classification • Gender classification can help us to build gender dependent model • Detection of scene changes from video content is not accurate enough, so we need audio scene change detection as an assistant tool

  19. Flow Diagram of Audio Information Retrieval System Audio Signal By MFCC var. By Clustering By 256 GMM From News’ Audio Channel Detect cont’ vowel > 30% Segmentation MFCC Male? Speaker Identification/ Classification Speech Audio Scene Change Audio Signal Feature Extraction Female? Non- Speech Music Pattern Matching

  20. Feature Extraction by MFCC • The first thing we should do on the raw audio input data • MFCC stands for “mel-frequency cepstral coefficient” • Human perception of the frequency of sound does not follow a linear scale

  21. Detection of Audio Scene Change by Bayesian Information Criterion (BIC) • Bayesian information criterion (BIC) is a likelihood criterion • We maximize the likelihood functions separately for each model M and obtain L (X,M) • The main principle is to penalize the system by the model complexity

  22. Detection of a single point change using BIC • We define: H0 : x1, x2 … xN ~ N(μ,Σ) to be the whole sequence without changes and H1: x1, x2 … xL ~ N(μ1,Σ1), xL+1,xL+2 … xN ~ N(μ2,Σ2), is the hypothesis that change occurring at time i. The maximum likelihood ratio is defined as: R(I)=Nlog| Σ|-N1log| Σ1|-N2log| Σ2|

  23. Detection of a single point change using BIC • The difference between the BIC values of two models can be expressed as: BIC(I) = R(I) – λP P=(1/2)(d+(1/2d(d+1))logN • If BIC value>0, detection of scene change

  24. Detection of multiple point changes by BIC • a.       Initialize the interval [a, b] with a=1, b=2 • b.      Detect if there is one changing point in interval [a, b] using BIC • c.       If (there is no change in [a, b]) let b= b + 1 else let t be the changing point detected assign a = t +1; b = a+1; end d. go to step (b) if necessary

  25. Advantages of BIC approach • Robustness • Thresholding-free • Optimality

  26. Comparison of different algorithms

  27. Audio scene change detection

  28. Gender Classification • The mean and covariance of male and female feature vector is quite different • So we can model it by a Gaussian Mixture Model (GMM)

  29. Male/Female Classification (freq count vs. values) Male Female

  30. Gender Classification

  31. Music/Speech classification by pitch tracking • speech has more continue contour than music. • Speech clip always has 30%-55% continuous contour whereas silence or music has1%-15% • Thus, we choose >20% for speech.

  32. Frequency Vs no of frames Speech Music

  33. Summary • ViaVoice training experiments • Speech recognition editing tool • Dynamic time index alignment • Audio scene change detection • Speech classification • Integrated the above functions into a speech processor

  34. Future Work • Classify the indoor news and outdoor news for further process the video clips • Train the gender dependent models for ViaVoice engine. It may increase the recognition accuracy by having a gender dependent model

More Related