Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

Robust Endpoint Detection and Energy Normalizationfor Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine Tsai, and Qiru Zhou, Member, IEEE Presented by Chen Hung_Bin

outline • Introduction endpoint detection • Endpoint detection include • Endpoint detection (Filter) • State Transition • Experiment

Introduction • The detection of the presence of speech embedded in various types of nonspeech events and background noise is called endpoint detection, speech detection, or speech activity detection. • In this paper, address endpoint detection by sequential and batch-mode processes to support real-time recognition. • sequential: automatic speech recognition (ASR) • batch-mode: utterances are usually as short as a few seconds and the delay in response is usually small.

Introduction • Endpoint detection include • energy threshold • pitch detection • spectrum analysis • cepstral analysis • zero-crossing rate • periodicity measure • chi-square test • entropy • hybrid detection

Introduction • energy

Introduction • A Mandarin digit “eight.” • spectrum

Introduction • zero-crossing rate

Introduction • The chi-square test given by • The hypothesis test can thus be written as

Introduction • entropy

Introduction • endpoint detection crucial :accuracy and speed for several reasons. • It is hard to model noise and silence accurately in changing environments. • if silence frames can be removed prior to recognition, the accumulated utterance likelihood scores will focus more on the speech. • The cepstral mean subtraction (CMS), a popular algorithm for robust speech recognition, accurate endpoints to compute the mean of speech frames precisely in order to improve recognition accuracy.

Introduction • point out in this study : • The more accurately we can detect endpoints, the better we can do on real-time energy normalization. • requirements: • Accurate location of detected endpoints; • Robust detection at various noise levels; • Low computational complexity; • Fast response time; • And simple implementation.

Endpoint Detection (Filter) • First, we need a detector (filter) that meets the following general requirements: • 1) invariant outputs at various background energy levels; • 2) capability of detecting both beginning and ending points; • 3) short time delay or look-ahead; • 4) limited response level; • 5) maximum output signal-to-noise ratio (SNR) at endpoints; • 6) accurate location of detected endpoints; • 7) maximum suppression of false detection.

Endpoint Detection (Filter)

Less then 25 points Filter for Both Beginning- and Ending-Edge Detection • choose the filter size • W =13 • s = 0.5385 • A = 0.2208 • Let H(i)=h(i-13); then the filter has 25 points in total with a 24-frame look-ahead since H(1) both H(25) and are zeros. Count 30

Filter for Both Beginning- and Ending-Edge Detection • In this paper choose the filter size Shape of the optimal filter for beginning edge detection, plotted as h (t), with W = 7 and s = 1 Shape of the optimal filter for ending edge detection, plotted as h (t), with W = 35 and s = 0:2.

Batch-mode Endpoint Detection Lines E, F, G, and H indicate the locations of two pairs of beginning and ending points. Output of the beginning-edge filter (solid line) and ending-edge filter (dashed line)

Batch-mode Endpoint Detection

State Transition Diagram • Using a three-state transition diagram to make final decisions. • silence, in-speech, and leaving-speech. 8 KHz sampling rate State transition diagram for endpoint decision. (a) energy contour of digit “4” (b) filter outputs and state transitions.

Real-Time Energy Normalization • Purposing of energy normalization is to normalize the utterance energy g(t), such that the largest value of energy is close to zero.

Real-Time Energy Normalization

Real-Time Energy Normalization • example (a) Energy contours of “4-327-631-Z214” from original utterance (bottom, 20 dB SNR) and after adding car noise (top, 5 dB SNR). (b) Filter outputs for 5 dB (dashed line) and 20 dB (solid line) SNR cases. (c) Detected endpoints and normalized energy for the 20 dB SNR case and (d) for the 5 dB SNR case.

Database Evaluation • The proposed algorithm was compared with a baseline endpoint detection algorithm on one noisy database and several telephone databases. • Baseline Endpoint Detection: • six-state transition diagram is used • initializing, silence, rising, energy, fell-rising, and fell states. • In total, eight counters and 24 hard-limit thresholds are used for the decisions of state transition.

Database Evaluation • Noisy Database Evaluation: • In this experiment, a database was first recorded from a desktop computer at 16 KHz sampling rate, then down-sampled to 8 KHz sampling rate. • Car and other back ground noises were artificially added to the original database at the SNR levels of 5, 10, 15, and 20 dB. • The original database has 39 utterances and 1738 digits in total. • LPC feature and the short-term energy were used and the hidden Markov model (HMM) to recognize.

Database Evaluation (a) utterance in DB5: “1 Z 4 O 5 8 2.” (b) baseline, recognized as “1 Z 4 O 5 8.” (c) proposed, recognized as “1 Z 4 O 5 8 2.” (d) filter output Comparisons on real-time connected digit recognition

Database Evaluation • Telephone Database Evaluation: • The proposed algorithm was further evaluated in 11 databases collected from the telephone networks with 8 kHz sampling rates in various acoustic environments. • DB1 to DB5 contain digits, alphabet and word strings. • DB6 to DB11 contain pure digit strings. • In the proposed system, we set the parameters as

Database Evaluation digits, alphabet and word strings pure digit strings

CONCLUSIONS • Since the entire algorithm only uses a 1-D energy feature, it has low complexity and is very fast in computation.

Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

Presentation Transcript

Speech and Face Recognition Semester Project Speaker Segmentation

A Linked-HMM for Robust Voicing and Speech Detection

Robust Speech recognition

Audio-Visual Speech and Speaker Recognition

Robust real-time face detection

ROBUST SIGNAL REPRESENTATIONS FOR AUTOMATIC SPEECH RECOGNITION

ON REAL-TIME MEAN-AND-VARIANCE NORMALIZATION OF SPEECH RECOGNITION FEATURES

Robust Real-Time Face Detection

Robust Real-time Object Detection

Real-Time Detection, Alignment and Recognition of Human Faces

Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition

Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition

ON REAL-TIME MEAN-AND-VARIANCE NORMALIZATION OF SPEECH RECOGNITION FEATURES

Real-Time Detection, Alignment and Recognition of Human Faces

MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION

Enhanced Speech Models for Robust Speech Recognition

Robust Speaker Recognition

PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

Real-Time Speech Recognition

Speech and speaker normalization (in vowel normalization)

Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments

Prosodic Constraints for Robust Speech Recognition