1 / 16

Javier Ferreiros, Javier Macías-Guarasa, José M. Pardo (GTH UPM), Luis Villarrubia (Telefónica I+D)

ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS IN SPANISH SPEECH RECOGNITION SYSTEMS. Javier Ferreiros, Javier Macías-Guarasa, José M. Pardo (GTH UPM), Luis Villarrubia (Telefónica I+D). Presentation Contents. Introduction

Jimmy
Télécharger la présentation

Javier Ferreiros, Javier Macías-Guarasa, José M. Pardo (GTH UPM), Luis Villarrubia (Telefónica I+D)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ESCA Tutorial & Research WorkshopModelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS IN SPANISH SPEECH RECOGNITION SYSTEMS Javier Ferreiros, Javier Macías-Guarasa, José M. Pardo (GTH UPM), Luis Villarrubia (Telefónica I+D)

  2. Presentation Contents • Introduction • The strategy applied • CSR • Task • System Architecture • Results • ISR • Task • System Architecture • Results • Conclusions and Future Work

  3. Introduction (I) • Pronunciation variation: common source of recognition errors • Rule-based strategy to incorporate pronunciation alternatives for Spanish • Phonetic Rules for actual speaking habits and context dependencies (no dialectal) have been explored • Alternate pronunciations can be found even within the same speaker

  4. Introduction (II) • The lexicon should consider these different possibilities even within the same dialect • It is important to study the impact of the rules on the lexicon • Near 20% error rate reduction for continuous speech task • No significant change for isolated word hypothesis generator case

  5. The strategy applied (I) • Grapheme-to-Allophone transcriptor for continuous speech and multiple pronunciations • It deals with coarticulation and assimilation effects in word boundaries for continuous speech • Rules are accurate enough for Spanish due to easy transformation from grapheme to allophone • Rules are selected according to expert linguistic knowledge for Castilian Spanish speaking style

  6. The strategy applied (II) • Examples of variations considered: • DIFFERENT HABITS: exámen: /e k s a m e n/ • [e k s á m e~ n] • [e  s á m e~ n] • [e s á m e~ n] • CONTEXT DEPENDENT: bote: /b o t e/ • un bote: [ú m b ó t e] • el bote: [e l  ó t e]

  7. The strategy applied (III) • We have empirically searched for the minimum number of rules that produces significant improvements to limit the increase in lexicon size (i.e. Perplexity) • For the isolated word hypothesis generator case, further reduction in the number of rules has been necessary in order not to worsen the recognition rates

  8. CSR Task • Domain: Navy Resources Management in Spanish • Speaker Dependent Task • Training: 600 sentences, 4 speakers • Test: 100 sentences, the same 4 speakers • Base dictionary size: 979 words • Extended dictionary size: 1211 words (+23.7%)

  9. CSR System Architecture • One pass algorithm without any grammar • In the lexicon some words have several entries, each with an alternative allophone sequence • (10 MFCC + Energy), delta and delta2 parameter sets in 3 different codebooks with 256 centroids each • discrete and semicontinuous HMM models for basic allophones (47) and triphones (350)

  10. CSR Results

  11. ISR Task • Domain: Proper Names, telephone environment • Hypothesis / Verification scheme • Tested on the Hypothesis Generator so far • Training: 5800 words, 3000 speakers • Test: 2500 words, 2250 speakers • Base dictionary size: 1175 words • Extended dictionary size: 1266 words (+7.7%) with the same rules than in CSR task and 1193 words (+1.5%) excluding some rules

  12. ISR Hypothesis Generator (I) • 8 MFCC+Energy, 8 delta MFCC+delta Energy in 2 codebooks of 256 centroids each • PSBU generates a string of alphabet units (53 allophone-like units) very fast • Lexical Access: DP algorithm to match the phonetic string against the dictionary where multiple pronunciations may be included

  13. ISR Hypothesis Generator (II) Hypothesis Generator Dictionary Phonetic string Indexes Speech Preprocessing & VQ processes Phonetic String Build-Up List of Candidate Words Lexical Access Alignment costs VQ books HMMs Durations

  14. ISR Results for 12 best hypothesis

  15. Conclusions and Future Work (I) • The selection of the appropriate model for each context is important when two words are concatenated for CSR: Rules for different entries depending on context. For ISR these rules are not useful. • The acoustic model may not have enough resolution to take advantage of the alternatives proposed by the rules: these rules should work better in the verifier for ISR.

  16. Conclusions and Future Work (II) • It is important to study the real impact of the rules on the lexicon. For example: Dialectal rules should reduce recognition error rates in a similar way both for CSR and ISR. • We want to test these kind of rules plus dialectal variability rules on the verifier stage of the ISR system.

More Related