Using Speech Recognition to Predict VoIP Quality

Using Speech Recognition to Predict VoIP Quality Wenyu Jiang IRT Lab April 3, 2002

Introduction to Voice Quality • Quality factors in Voice over IP (VoIP) • Packet loss, delay, and jitter • Choice of voice codec • Quality metric: Mean Opinion Score • Widely used • Human based • Time consuming • Labor intensive • Results N/A in real-time

Motivation • Features of a speech recognizer: • Automatic speech recognition (ASR), no human listeners needed • Accuracy of recognition is apparently coupled with the quality of input speech • Recognition can be done in real-time, allowing online quality monitoring. • Recognition performance may be related to speech intelligibility as well as quality.

Related Work • ITU-T E-model [G.107/G.108] • An analytical model for estimating perceived quality • Provides loss-to-MOS mapping for some common codecs (G.729, G.711, G.723.1). • Chernick et al studies speech recognition performance with DoD-CELP codec • Effect of bit error rate instead of packet loss • Phoneme (instead of word) recognition ratio • Some MOS results, but not accurate enough

Experiment Setup • Speech recognition engine • IBM ViaVoice on Linux • Wrote software for both voice model training and performance testing • Training and Testing • 2 scripts, #1 for training, #2 for testing. • 2 speakers, A and B, both read 2 scripts. • Script #2 is split into 25 audio clips, with 5 clips per loss condition (0%, 2%, 5%, 10%, 15%) • Codec: G.729 • Training by G.729 processed audio

Experiment Setup, contd. • Performance metric • Absolute word recognition ratio • Relative word recognition ratio • p is packet loss probability • MOS listening tests: 22 listeners

Recognition Ratio vs. MOS • Both MOS and Rabs decrease w.r.t loss • Then, eliminate middle variable p

Properties of ASR Performance • When loss probability is low • Recognition ratio changes slowly • Possibly due to robustness in ViaVoice • Less accurate MOS prediction in such case • Importance of voice training method • Training audio should use same codec as testing

Speaker Dependence in ASR • ViaVoice SDK cites a 90% accuracy for • Average speaker without a heavy accent • Sampling at 22KHz, PCM linear-16 • For speaker A, we achieved • About 42% accuracy with no packet loss • Reasons: • 8KHz sampling + G.729 compression • Accent + talk speed • Does not interfere with MOS prediction, but need to check for speaker dependence

Speaker Dependence Check • Absolute recognition ratio is • 70% for speaker B, but 42% for speaker A • dependent on the speaker • But the relative recognition ratio Rrel is universal and speaker-independent

Rrel as Universal MOS Predictor • Mapping from relative recognition ratio Rrel to MOS

Human Recognition Results • Listeners are asked to transcribe what they hear in addition to MOS grading. • Human recognition result curves are less “smooth” than MOS curves.

Human Results, contd. • Two flat regions in loss-human curve • 2-5% loss (some loss but not very high) • 10-15% loss (loss is already too high) • Mapping between machine and human recognition performance

Application Scenarios • Sender transmits a pre-recorded audio clip of a speaker known to receiver. • Receiver does the following: • Looks up Rabs(0%) for this speaker • Performs speech recognition • Compare to the original text, compute Rrel • No need to store the original audio clip • Just the text is sufficient  less storage • Need not know packet loss probability • Suitable for e2e black-box measurements

Conclusions • Evaluation of speech recognition performance as a MOS predictor • Used ViaVoice speech engine • Performance metric: word recognition ratio • The relative word recognition ratio is a universal, speaker-independent metric • Also analyzed human recognition performance • Future work: evaluate other codecs, e.g., G.726, GSM.

Using Speech Recognition to Predict VoIP Quality

Using Speech Recognition to Predict VoIP Quality

Presentation Transcript

Speech Recognition

Speech Recognition

Using Speech Recognition for Speech Therapy

Using Speech Recognition

Introduction to Speech Recognition

Speech recognition using HMM

Speech Recognition

Speech recognition

“How Does VoIP Impact Speech Recognition?”

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

SPEECH RECOGNITION:

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition