Javier Ferreiros, Javier Macías-Guarasa, José M. Pardo (GTH UPM), Luis Villarrubia (Telefónica I+D)

ESCA Tutorial & Research WorkshopModelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS IN SPANISH SPEECH RECOGNITION SYSTEMS Javier Ferreiros, Javier Macías-Guarasa, José M. Pardo (GTH UPM), Luis Villarrubia (Telefónica I+D)

Presentation Contents • Introduction • The strategy applied • CSR • Task • System Architecture • Results • ISR • Task • System Architecture • Results • Conclusions and Future Work

Introduction (I) • Pronunciation variation: common source of recognition errors • Rule-based strategy to incorporate pronunciation alternatives for Spanish • Phonetic Rules for actual speaking habits and context dependencies (no dialectal) have been explored • Alternate pronunciations can be found even within the same speaker

Introduction (II) • The lexicon should consider these different possibilities even within the same dialect • It is important to study the impact of the rules on the lexicon • Near 20% error rate reduction for continuous speech task • No significant change for isolated word hypothesis generator case

The strategy applied (I) • Grapheme-to-Allophone transcriptor for continuous speech and multiple pronunciations • It deals with coarticulation and assimilation effects in word boundaries for continuous speech • Rules are accurate enough for Spanish due to easy transformation from grapheme to allophone • Rules are selected according to expert linguistic knowledge for Castilian Spanish speaking style

The strategy applied (II) • Examples of variations considered: • DIFFERENT HABITS: exámen: /e k s a m e n/ • [e k s á m e~ n] • [e  s á m e~ n] • [e s á m e~ n] • CONTEXT DEPENDENT: bote: /b o t e/ • un bote: [ú m b ó t e] • el bote: [e l  ó t e]

The strategy applied (III) • We have empirically searched for the minimum number of rules that produces significant improvements to limit the increase in lexicon size (i.e. Perplexity) • For the isolated word hypothesis generator case, further reduction in the number of rules has been necessary in order not to worsen the recognition rates

CSR Task • Domain: Navy Resources Management in Spanish • Speaker Dependent Task • Training: 600 sentences, 4 speakers • Test: 100 sentences, the same 4 speakers • Base dictionary size: 979 words • Extended dictionary size: 1211 words (+23.7%)

CSR System Architecture • One pass algorithm without any grammar • In the lexicon some words have several entries, each with an alternative allophone sequence • (10 MFCC + Energy), delta and delta2 parameter sets in 3 different codebooks with 256 centroids each • discrete and semicontinuous HMM models for basic allophones (47) and triphones (350)

CSR Results

ISR Task • Domain: Proper Names, telephone environment • Hypothesis / Verification scheme • Tested on the Hypothesis Generator so far • Training: 5800 words, 3000 speakers • Test: 2500 words, 2250 speakers • Base dictionary size: 1175 words • Extended dictionary size: 1266 words (+7.7%) with the same rules than in CSR task and 1193 words (+1.5%) excluding some rules

ISR Hypothesis Generator (I) • 8 MFCC+Energy, 8 delta MFCC+delta Energy in 2 codebooks of 256 centroids each • PSBU generates a string of alphabet units (53 allophone-like units) very fast • Lexical Access: DP algorithm to match the phonetic string against the dictionary where multiple pronunciations may be included

ISR Hypothesis Generator (II) Hypothesis Generator Dictionary Phonetic string Indexes Speech Preprocessing & VQ processes Phonetic String Build-Up List of Candidate Words Lexical Access Alignment costs VQ books HMMs Durations

ISR Results for 12 best hypothesis

Conclusions and Future Work (I) • The selection of the appropriate model for each context is important when two words are concatenated for CSR: Rules for different entries depending on context. For ISR these rules are not useful. • The acoustic model may not have enough resolution to take advantage of the alternatives proposed by the rules: these rules should work better in the verifier for ISR.

Conclusions and Future Work (II) • It is important to study the real impact of the rules on the lexicon. For example: Dialectal rules should reduce recognition error rates in a similar way both for CSR and ISR. • We want to test these kind of rules plus dialectal variability rules on the verifier stage of the ISR system.

Javier Ferreiros, Javier Macías-Guarasa, José M. Pardo (GTH UPM), Luis Villarrubia (Telefónica I+D)