1 / 27

Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition. Qi Li , Senior Member, IEEE , Jinsong Zheng, Augustine Tsai, and Qiru Zhou , Member, IEEE Presented by Chen Hung_Bin. outline. Introduction endpoint detection Endpoint detection include

Télécharger la présentation

Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Robust Endpoint Detection and Energy Normalizationfor Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine Tsai, and Qiru Zhou, Member, IEEE Presented by Chen Hung_Bin

  2. outline • Introduction endpoint detection • Endpoint detection include • Endpoint detection (Filter) • State Transition • Experiment

  3. Introduction • The detection of the presence of speech embedded in various types of nonspeech events and background noise is called endpoint detection, speech detection, or speech activity detection. • In this paper, address endpoint detection by sequential and batch-mode processes to support real-time recognition. • sequential: automatic speech recognition (ASR) • batch-mode: utterances are usually as short as a few seconds and the delay in response is usually small.

  4. Introduction • Endpoint detection include • energy threshold • pitch detection • spectrum analysis • cepstral analysis • zero-crossing rate • periodicity measure • chi-square test • entropy • hybrid detection

  5. Introduction • energy

  6. Introduction • A Mandarin digit “eight.” • spectrum

  7. Introduction • zero-crossing rate

  8. Introduction • The chi-square test given by • The hypothesis test can thus be written as

  9. Introduction • entropy

  10. Introduction • endpoint detection crucial :accuracy and speed for several reasons. • It is hard to model noise and silence accurately in changing environments. • if silence frames can be removed prior to recognition, the accumulated utterance likelihood scores will focus more on the speech. • The cepstral mean subtraction (CMS), a popular algorithm for robust speech recognition, accurate endpoints to compute the mean of speech frames precisely in order to improve recognition accuracy.

  11. Introduction • point out in this study : • The more accurately we can detect endpoints, the better we can do on real-time energy normalization. • requirements: • Accurate location of detected endpoints; • Robust detection at various noise levels; • Low computational complexity; • Fast response time; • And simple implementation.

  12. Endpoint Detection (Filter) • First, we need a detector (filter) that meets the following general requirements: • 1) invariant outputs at various background energy levels; • 2) capability of detecting both beginning and ending points; • 3) short time delay or look-ahead; • 4) limited response level; • 5) maximum output signal-to-noise ratio (SNR) at endpoints; • 6) accurate location of detected endpoints; • 7) maximum suppression of false detection.

  13. Endpoint Detection (Filter)

  14. Less then 25 points Filter for Both Beginning- and Ending-Edge Detection • choose the filter size • W =13 • s = 0.5385 • A = 0.2208 • Let H(i)=h(i-13); then the filter has 25 points in total with a 24-frame look-ahead since H(1) both H(25) and are zeros. Count 30

  15. Filter for Both Beginning- and Ending-Edge Detection • In this paper choose the filter size Shape of the optimal filter for beginning edge detection, plotted as h (t), with W = 7 and s = 1 Shape of the optimal filter for ending edge detection, plotted as h (t), with W = 35 and s = 0:2.

  16. Batch-mode Endpoint Detection Lines E, F, G, and H indicate the locations of two pairs of beginning and ending points. Output of the beginning-edge filter (solid line) and ending-edge filter (dashed line)

  17. Batch-mode Endpoint Detection

  18. State Transition Diagram • Using a three-state transition diagram to make final decisions. • silence, in-speech, and leaving-speech. 8 KHz sampling rate State transition diagram for endpoint decision. (a) energy contour of digit “4” (b) filter outputs and state transitions.

  19. Real-Time Energy Normalization • Purposing of energy normalization is to normalize the utterance energy g(t), such that the largest value of energy is close to zero.

  20. Real-Time Energy Normalization

  21. Real-Time Energy Normalization • example (a) Energy contours of “4-327-631-Z214” from original utterance (bottom, 20 dB SNR) and after adding car noise (top, 5 dB SNR). (b) Filter outputs for 5 dB (dashed line) and 20 dB (solid line) SNR cases. (c) Detected endpoints and normalized energy for the 20 dB SNR case and (d) for the 5 dB SNR case.

  22. Database Evaluation • The proposed algorithm was compared with a baseline endpoint detection algorithm on one noisy database and several telephone databases. • Baseline Endpoint Detection: • six-state transition diagram is used • initializing, silence, rising, energy, fell-rising, and fell states. • In total, eight counters and 24 hard-limit thresholds are used for the decisions of state transition.

  23. Database Evaluation • Noisy Database Evaluation: • In this experiment, a database was first recorded from a desktop computer at 16 KHz sampling rate, then down-sampled to 8 KHz sampling rate. • Car and other back ground noises were artificially added to the original database at the SNR levels of 5, 10, 15, and 20 dB. • The original database has 39 utterances and 1738 digits in total. • LPC feature and the short-term energy were used and the hidden Markov model (HMM) to recognize.

  24. Database Evaluation (a) utterance in DB5: “1 Z 4 O 5 8 2.” (b) baseline, recognized as “1 Z 4 O 5 8.” (c) proposed, recognized as “1 Z 4 O 5 8 2.” (d) filter output Comparisons on real-time connected digit recognition

  25. Database Evaluation • Telephone Database Evaluation: • The proposed algorithm was further evaluated in 11 databases collected from the telephone networks with 8 kHz sampling rates in various acoustic environments. • DB1 to DB5 contain digits, alphabet and word strings. • DB6 to DB11 contain pure digit strings. • In the proposed system, we set the parameters as

  26. Database Evaluation digits, alphabet and word strings pure digit strings

  27. CONCLUSIONS • Since the entire algorithm only uses a 1-D energy feature, it has low complexity and is very fast in computation.

More Related