1 / 12

Experiments on ”stir-sir”-paradigm using large vocabulary ASR

This study aims to test large vocabulary automatic speech recognition (ASR) in the "stir-sir" paradigm using a newly trained English-English ASR system. The experiments involve adapting the ASR models to different conditions and evaluating the free recognition performance. The results suggest the need for further improvements to achieve better performance comparable to human listeners.

jpryor
Télécharger la présentation

Experiments on ”stir-sir”-paradigm using large vocabulary ASR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experiments on ”stir-sir”-paradigm using large vocabulary ASR Kalle Palomäki Adaptive Informatics Research Centre Helsinki University of Technology

  2. Introduction • Aim: Test large vocabulary ASR in stir – sir paradigm • Motivation: Large vocabulary ASR has learned phoneme models close to humans • ASR: a newly trained English-English large vocabulary recogniser • Trained on read Wall street journal articles • Sampling rate 16 kHz

  3. ASR details • Standard features: Mel freq. cepstral coefficients (MFCCs) + power + deltas + accelerations • Triphone HMMs with acoustic likelihood modeled by Gaussian mixture model • Supervised adaptation using constrained maximum likelihood linear regression, CMLLR • Can be formulated as linear feature transformation

  4. Experiments • Three things tested for • Free recognition result • Recognizer chooses in between: ”next_you'll_get_sir_to_click_on” “next_you'll_get_stir_to_click_on” • Temporally averaged log-probability of ”t”

  5. Experiments • Experiment 1: ”dry” models with no adaptation • Experiment 2: ”dry” models adapted to right conditions • Near-near adapted with near-near • Far-far adapted with far-far • Supervised adaptation with utterances at ends of continuum • Experiment 3: "dry” models adapted to both ”near near”, and ”far-far” • Supervised adaptation with utterances at the ends of continuum

  6. Exp. 1: “dry” models, no adaptation • Free recognition: • near-near: “nantz two-a-days so far”, “nursing care so far” • far-far: “nantz th”, “NMS death”, “ “ • Choose in between “next_you'll_get_sir_to_click_on”, “next_you'll_get_stir_to_click_on” and silence model • Near near: change in between conditions 08 and 09 • Far-far: everything silence

  7. Exp. 1: “dry” models, no adaptation

  8. Exp. 1: “dry” models, adapted to right cond. • Free recognition: • Near-near: “next month though the khon” • Far-far: ”next he’ll throw the khon” • Choose in between “next_you'll_get_sir_to_click_on”, “next_you'll_get_stir_to_click_on” and silence model • Near near: change in between conditions 03 and 04 • Far-far: ”sir” all the time

  9. Exp. 1: “dry” models, adapted to right cond.

  10. Exp. 1: “dry” models, adapted to both • Free recognition: • Near-near: next month though the khon • Far far: “next month khon” or “nantz khon” • Choose in between “next_you'll_get_sir_to_click_on”, “next_you'll_get_stir_to_click_on” and silence model • Switches in between the sentences oddly

  11. Exp. 1: “dry” models, adapted to both

  12. Discussion & Future directions • Currently ”unconvincing” • Poor free recognition performance • Especially poor far-far performance • May be hard to obtain similar sensitivity as human listeners have • Tricks to get around the poor performance • Cooke (2006) uses a priori masks in order to find glimpses of speech • Choose in between two sentences rather than free recogniton • Measure log-prob instead of recogn performance • How to model Compensation which is the main issue

More Related