Speech rate affects the word error rate of automatic speech recognition systems.

Articulation Rate: Measures, Realised Lexical Form and Phone Classification in Spontaneous and Read German Speech word error rate # utterances slow fast -3 -2 -1 0 1 2 3 articulation rate (sd) Jürgen Trouvain, Jacques Koreman, Attilio Erriquez and Bettina BraunUniversität des Saarlandes, Saarbrücken, Germany{trouvain,koreman,erriquez,bebr}@coli.uni-sb.de Aims Database References Introduction • Alleva, F., Huang, X., Hwang, M-Y. & Jiang, L. "Can continous speech recognisers handle isolated speech?" Speech Communication 26 (3), 183-190, 1998. • Martínez, F., Tapias, D., Álvarez, J. & León, P. "Characteristics of slow, average and fast speech and their effects in large vocabulary continous speech recognition." Proc. Eurospeech Rhodes, 469-472, 1997. • Mirghafori, N., Fosler, E. & Morgan, N. "Fast speakers in large vocabulary continous speech recognition." Proc. Eurospeech Madrid. 1995. • Pfau, T. & Ruske, G. "Creating Hidden Markov Models for fast speech." Proc. ICSLP Sydney, 205-208, 1998. • Siegler, M. A. & Stern, R. M. "On the effects of speech rate in large vocabulary speech recognition systems." Proc. ICASSP Detroit (1), 612-615, 1995. • Wrede, B. Fink, G. and Sagerer, G., ”An investigation of modelling aspects for rate-dependent speech recognition." Proc. Eurospeech Aalborg, 2001. We address three questions: The German KielCorpus for Read and Spontaneous Speech • Speech rate affects the word error rate of automatic speech recognition systems. • Higher error rates for fast speech, but also for slow, hyperarticulated speech (Siegler and Stern, 1995; Mirghafori et al., 1995; Martínez et al., 1997; Pfau and Ruske, 1998; Alleva et al., 1998). • What linguistic unit should we use to quantify speech rate and what domain is appropriate? • What are the most important effects on the realisation of the lexical forms? • How well are the acoustic models suited for different speech rates? • manually labelled realised phones along with intended (canonical) transcriptions • large parts also prosodically annotated Only segmentally and prosodically labelled parts selected for this study. Read: • single sentences of variable length and two short stories • 4 hours • 53 speakers (27 male and 26 female) Spontaneous: • appointment-making dialogues • 4 hours • 42 speakers (24 male, 18 female) Articulation rate measures We investigated several linguistic units for measuring articulation rate as well as two different domains. • Vowel-/r/ combinations are counted as two phones in the intended form, but as one in the realised form (except for schwa-/r/ combinations, which were labelled //). • Realised syllables can be problematical, as e.g. the //-syllable in „ziehen“ in the example above can be realised as a syllabic or non-syllabic /n/, leading to different syllable counts (one and zero, respectively). • inter-pause stretch (ips) - The pauses which delimit them (pause, breathing, filled pause, lip smacks, coughing and other non-verbal articulations) are easy to determine in the labelfile and are often used to delimit the domain over which articulation rate is calculated. - ASR is primarily interested in decoding speech (not silence) from the information contained in the phone segments. Linguistic unit It is important to distinguish between intended and realised units. Intended forms can be easily derived from the canonical transcription of the uttered words, but their actual realisation can vary strongly: The following units were measured: Results and discussion Am blauen Himmel ziehen die WolkenEngl. In the blue sky wander the clouds // [] • intended word • intended syllable • realised syllable • intended phone • realised phone • intonational phrase (IP) • Correlations higher for ips than for IP. • Words/second show lowest correlations with duration. • Realised phones/second result in the highest correlation with duration • Realised phones/second in ips used in this study • For other applications, comparable results can be obtained when using the graphical word or the intended syllable, which can be measured/derived more easily • Note: Although phone and syllable deletions lower the measured articulation rate, it is not clear what their effect on the perceived articulation rate is. 10 syllables26 phones - is considered as an important planning unit, reflected by the intonation contour. - utterances must be labelled intonationally to obtain IPs. The criteria for IPs can differ considerably between studies. 8 syllables20 phones The definition of what is and is not a unit is also problematical: Domain Results Articulation rate changes continuously while speaking and is not always constant within an utterance. Therefore we use two prosodic domains (although it is clear that more local variation can and will occur even within these domains): • Glottal stops are considered to be a phone (in contrast to laryngealisation) • Due to the labelling conventions of the database, affricates are counted as two phones and diphthongs as one Despite many sources of variation in both the units and the domains, we found high correlations between the number of units and domain duration for all units, both for read and spontaneous speech. Articulation rate and realised lexical form Implications for ASR Results: Database analysis: spontaneous versus read speech • Generally more deletions than replacements, especially /, , t/ • consonants generally affected more strongly by deletions and replacements than vowels • Exception: schwa • for /n/ more replacements than deletions (place assimilation) • //, which is /r/ in the canonical form, is seldom deleted or reduced • Deletion of //, //, /t/ (closure and especially release + aspiration) and also /n/ should be represented in the lexicon by means of pronunciation variants. • Pronunciation variants due to assimilation of /n/ and replacement of /t/ closures and // should also be added to the lexicon. • If there is any vowel reduction, therefore, it must take place on the acoustic rather than the lexical level (except for schwa). As is well-known, spontaneous speech differs from read speech with respect to pauses (more unfilled pauses, filled pauses, ungrammatical pauses). We also find differences in temporal structure (shorter phrases, greater variance in phrase duration, greater variance in articulation rate for spontaneous speech). But we also observe changes on the phonemic level. %DELETIONS Database analysis: realised lexical form The KielCorpus database was subdivided into three parts on the basis of the articulation rate measured in realised phones/second (for read and spontaneous speech separately): %REPLACEMENTS • slow: more than 1 sd below the mean • medium: between -1 and +1 sd from the mean • fast: more than 1 sd above the mean Articulation rate and phone classification Phone classification Rate-independent phone classe effects Articulation rate effects • We found a deterioration of phone classification with articulation rate (unlike e.g. Siegler and Stern, 1995). Our findings are comparable to those of Wrede et al. (2001) and are probably caused by the greater spectral variation at faster articulation rates. • Articulation rate affects both vowels and consonants (lower phone classification results for faster speaking rates). • t-tests on average vowel classification rates for matched pairs showed that the phone classification rates for normal and fast vowels do not differ significantly in spontaneous speech. This is probably due to the large amount of variation in the average vowel classification rates. • Among the consonants, particularly fricatives and also plosives were affected by articulation rate. By performing phone classification for individual phones, the effects of silences in the utterance are excluded as a source of recognition error. The emphasis is entirely on the recognition of the phones at different articulation rates (calculated for each ips as phones/second). • Phone classification rates for consonants, particularly voiceless obstruents, are higher than for vowels. • Schwa is recognised particularly poorly, possibly because of its liability to transconsonantal coarticulation • Diphthongs and // are also recognised poorly. • Phone classification for individual phones in hidden Markov modelling experiments using HTK, for read and spontaneous speech separately • Hidden Markov models: 3 states (5 for diphthongs), left-to-right (no states skipped) and 8 mixtures per state • Jackknife experiments with 20% of the database as test data, results computed as weighted averages • Results evaluated for slow, medium and fast speech Average phone classification rates for slow, normal and fast speech (read and spontaneous)

Speech rate affects the word error rate of automatic speech recognition systems.

Speech rate affects the word error rate of automatic speech recognition systems.

Presentation Transcript

Automatic Speech Recognition

Automatic Speech Recognition: An Overview

Automatic Speech Recognition

Automatic Speech Recognition

Automatic Speech Recognition (ASR)

Automatic speech recognition

Automatic Speech Recognition II

Automatic Speech Recognition System

Automatic Speech Recognition

Automatic Continuous Speech Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION

Automatic Speech Recognition Studies

Automatic Speech Recognition Introduction

Automatic Speech Recognition

Wake-Up-Word Speech Recognition:

Automatic Speech Recognition - Edukite

Automatic Speech Recognition Introduction

Introduction to Automatic Speech Recognition

Wake-Up-Word Speech Recognition:

Automatic Speech Recognition Introduction

Automatic Speech Recognition