1 / 9

University of Sheffield

University of Sheffield. M4 speech recognition. Vincent Wan , Martin Karafi á t. Trigram language model (SRILM). Word internal triphone models. Cross word triphone models. Lattice rescoring Time synchronous decoding (HTK). n -best lattice generation Best first decoding (Ducoder).

kaiyo
Télécharger la présentation

University of Sheffield

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát

  2. Trigram language model (SRILM) Word internal triphone models Cross word triphone models Lattice rescoring Time synchronous decoding (HTK) n-best lattice generation Best first decoding (Ducoder) MLLR adaptation (HTK) MLLR adaptation (HTK) Front end Recognition output Recognition output The Recogniser

  3. System limitations • N-best list rescoring not optimal • Adaptation must be performed on two sets of acoustic models • Many more hyper-parameters to tune manually • SRILM is not efficient on very large language models (greater than 10e+9 words)

  4. Advances since last meeting • Models trained on two databases • SWITCHBOARD recogniser • Acoustic & language models trained on 200 hours of speech • ICSI meetings recogniser • Acoustic models trained on 40 hours of speech • Language model is a combination of SWB and ICSI • Improvements mainly affect the Switchboard models • 16kHz sampling rate used throughout

  5. Advances since last meeting • Adaptation of word internal context dependent models • Unified the phone sets and pronunciation dictionaries • Improved the pronunciation dictionary for Switchboard • Now using the ICSI dictionary with missing pronunciations imported from the ISIP dictionary • Better handling of multiple pronunciations during acoustic model training • General bug fixes

  6. Results overview % word error rates * Results from lapel mics † Results from beam former

  7. Results: adaptation vs. direct training on ICSI % word error rates * Results from Ducoder using all pruning

  8. Acoustic model adaptation issue • Acoustic models are presently not very adaptive • Better MLLR code required (next slide) • More training data required • Need to make better use of the combined ICSI/SWB training data for M4.

  9. Other news • The next version of HTK’s adaptation code will be made available to M4 before the official public release. • Sheffield to acquire HTK LVCSR decoder • Licensing issues to be resolved • May be able to make binaries available to M4 partners

More Related