150 likes | 266 Vues
This study explores using speech recognition to predict VoIP quality and its relation to recognition performance, MOS, and packet loss in real-time. Evaluating ViaVoice engine, training methods, and speaker dependence. Human vs. machine recognition results. Application scenarios and conclusions.
E N D
Using Speech Recognition to Predict VoIP Quality Wenyu Jiang IRT Lab April 3, 2002
Introduction to Voice Quality • Quality factors in Voice over IP (VoIP) • Packet loss, delay, and jitter • Choice of voice codec • Quality metric: Mean Opinion Score • Widely used • Human based • Time consuming • Labor intensive • Results N/A in real-time
Motivation • Features of a speech recognizer: • Automatic speech recognition (ASR), no human listeners needed • Accuracy of recognition is apparently coupled with the quality of input speech • Recognition can be done in real-time, allowing online quality monitoring. • Recognition performance may be related to speech intelligibility as well as quality.
Related Work • ITU-T E-model [G.107/G.108] • An analytical model for estimating perceived quality • Provides loss-to-MOS mapping for some common codecs (G.729, G.711, G.723.1). • Chernick et al studies speech recognition performance with DoD-CELP codec • Effect of bit error rate instead of packet loss • Phoneme (instead of word) recognition ratio • Some MOS results, but not accurate enough
Experiment Setup • Speech recognition engine • IBM ViaVoice on Linux • Wrote software for both voice model training and performance testing • Training and Testing • 2 scripts, #1 for training, #2 for testing. • 2 speakers, A and B, both read 2 scripts. • Script #2 is split into 25 audio clips, with 5 clips per loss condition (0%, 2%, 5%, 10%, 15%) • Codec: G.729 • Training by G.729 processed audio
Experiment Setup, contd. • Performance metric • Absolute word recognition ratio • Relative word recognition ratio • p is packet loss probability • MOS listening tests: 22 listeners
Recognition Ratio vs. MOS • Both MOS and Rabs decrease w.r.t loss • Then, eliminate middle variable p
Properties of ASR Performance • When loss probability is low • Recognition ratio changes slowly • Possibly due to robustness in ViaVoice • Less accurate MOS prediction in such case • Importance of voice training method • Training audio should use same codec as testing
Speaker Dependence in ASR • ViaVoice SDK cites a 90% accuracy for • Average speaker without a heavy accent • Sampling at 22KHz, PCM linear-16 • For speaker A, we achieved • About 42% accuracy with no packet loss • Reasons: • 8KHz sampling + G.729 compression • Accent + talk speed • Does not interfere with MOS prediction, but need to check for speaker dependence
Speaker Dependence Check • Absolute recognition ratio is • 70% for speaker B, but 42% for speaker A • dependent on the speaker • But the relative recognition ratio Rrel is universal and speaker-independent
Rrel as Universal MOS Predictor • Mapping from relative recognition ratio Rrel to MOS
Human Recognition Results • Listeners are asked to transcribe what they hear in addition to MOS grading. • Human recognition result curves are less “smooth” than MOS curves.
Human Results, contd. • Two flat regions in loss-human curve • 2-5% loss (some loss but not very high) • 10-15% loss (loss is already too high) • Mapping between machine and human recognition performance
Application Scenarios • Sender transmits a pre-recorded audio clip of a speaker known to receiver. • Receiver does the following: • Looks up Rabs(0%) for this speaker • Performs speech recognition • Compare to the original text, compute Rrel • No need to store the original audio clip • Just the text is sufficient less storage • Need not know packet loss probability • Suitable for e2e black-box measurements
Conclusions • Evaluation of speech recognition performance as a MOS predictor • Used ViaVoice speech engine • Performance metric: word recognition ratio • The relative word recognition ratio is a universal, speaker-independent metric • Also analyzed human recognition performance • Future work: evaluate other codecs, e.g., G.726, GSM.