Linguistic Dissection of Switchboard-Corpus Automatic Speech Recognition Systems

Linguistic Dissection of Switchboard-Corpus Automatic Speech Recognition Systems Steven Greenberg and Shawn Chang International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 ISCA Workshop on Automatic Speech Recognition: Challenges for the New Millennium, Paris, September 18-20, 2000

EVALUATION DESIGN SUPPORT George Doddington and Jack Godfrey ANALYSIS SUPPORT Leah Hitchcock, Joy Hollenback and Rosaria Silipo SC-LITE SUPPORT Jon Fiscus FUNDING SUPPORT U.S. Department of Defense PROSODIC LABELING Jeff Good and Leah Hitchcock PHONETIC LABELING AND SEGMENTATION Candace Cardinal, Rachel Coulston and Colleen Richey DATA SUBMISSION AT&T BBN DRAGON SYSTEMS CAMBRIDGE UNIVERSITY JOHNS HOPKINS UNIVERSITY MISSISSIPPI STATE UNIVERSITY SRI INTERNATIONAL UNIVERSITY OF WASHINGTON Acknowledgements and Thanks

SWITCHBOARD RECOGNITION SYSTEMS FROM EIGHT SEPARATE SITES WERE EVALUATED WITH RESPECT TO PHONE- AND WORD- LEVEL CLASSIFICATION ON NON-COMPETITIVE DIAGNOSTIC MATERIAL PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS Decision-tree analyses support this hypothesis Additional analyses are also consistent with this conclusion SYLLABLE STRUCTURE AND PROSODIC STRESS ARE ALSO IMPORTANT FACTORS FOR ACCURATE RECOGNITION The pattern of errors differs across the syllable (onset, nucleus, coda) Stress affects primarily the number of word-deletion errors SPEAKING RATE CAN BE USED TO PREDICT RECOGNITION ERROR Syllables per second is a far more accurate metric than MRATE (an acoustic measure based on the modulation spectrum) ASR SYSTEMS CAN POTENTIALLY BE IMPROVED BY FOCUSING MORE ATTENTION ON PHONETIC CLASSIFICATION, SYLLABLE STRUCTURE AND PROSODIC STRESS Take Home Messages

THE EIGHT ASR SYSTEMS WERE EVALUATED USING A 1-HOUR SUBSET OF THE SWITCHBOARD TRANSCRIPTION CORPUS This corpus had been hand-labeled at the phone, syllable, word and prosodic stress levels and hand-segmented at the syllabic and word levels. 25% of the material was hand-segmented at the phone level and the remainder quasi-automatically segmented into phonetic segments (and verified) The phonetic segments of each site were mapped to a common reference set, enabling a detailed analysis of the phone and word errors for each site that would otherwise be difficult to perform THIS EVALUATION REQUIRED THE CONVERSION OF THE ORIGINAL SUBMISSION MATERIAL TO A COMMON REFERENCE FORMAT The common format was required for scoring using SC-Lite and to perform certain types of statistical analyses Key to the conversion was the use of TIME-MEDIATED parsing which provides the capability of assigning different outputs to the same reference unit (be it word, phone or other) THE RECOGNITION MATERIAL WAS TAGGED AT THE PHONE AND WORD LEVELS WITH ca. 40 SEPARATE LINGUISTIC PARAMETERS This information pertains to the acoustic, phonetic, lexical, utterance and speaker characteristics of the material and are formatted into “BIG LISTS” Overview - 1

MOST OF THE RECOGNITION MATERIAL AND CONVERTED FILES, AS WELL AS THE SUMMARY (“BIG”) LISTS, ARE AVAILABLE ON THE WORLD WIDE WEB: http://www.ices.berkeley.edu/real/phoneval THE ANALYSES SUGGEST THE FOLLLOWING: PHONETIC CLASSIFICATION APPEARS TO BE AN IMPORTANT FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS Decision-tree analyses of the big lists support this hypothesis Additional statistical analyses are also consistent with this conclusion SYLLABLE STRUCTURE AND PROSODIC STRESS ARE ALSO IMPORTANT FACTORS FOR ACCURATE RECOGNITION The pattern of errors differs across the syllable (onset, nucleus, coda) Stress affects primarily the rate of word-deletion errors FAST/SLOW SPEAKING RATE IS CORRELATED WITH WORD ERROR Syllables per second is a far more accurate metric than MRATE (an acoustic measure based on the modulation spectrum) Overview - 2

SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS Switchboard contains informal telephone dialogues Nearly one hour’s material that had previously been phonetically transcribed (by highly trained phonetics students from UC-Berkeley) All of this material was hand-segmented at either the phonetic-segment or syllabic level by the transcribers The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained on 72-minutes of hand-segmented Switchboard material. This automatic segmentation was manually verified. THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS USED IN THE CURRENT PROJECT ARE AVAILABLE ON THE PHONEVAL WEB SITE: http://www.icsi.berkeley.edu/real/phoneval THE ORIGINAL FOUR HOURS OF TRANSCRIPTION MATERIAL ARE AVAILABLE AT: http://www.icsi.berkeley.edu/real/stp Evaluation Materials

ALL 674 FILES IN THE DIAGNOSTIC EVALUATION MATERIAL WERE PROSODICALLY LABELED THE LABELERS WERE TWO UC-BERKELEY LINGUISTICS STUDENTS ALL SYLLABLES WERE MARKED WITH RESPECT TO: Primary Stress Complete Lack of Stress (no explicit label) Intermediate Stress INTERLABELER AGREEMENT WAS HIGH 95% Agreement with Respect to Stress (78% for Primary Stress) 85% Agreement for Unstressed Syllables THE PROSODIC TRANSCRIPTION MATERIAL IS AVAILABLE AT: http://www.icsi.berkeley.edu/~steveng/prosody Prosodic Material

AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS BROAD DISTRIBUTION OF UTTERANCE DURATIONS 2-4 sec - 40%, 4-8 sec - 50%, 8-17 sec - 10% COVERAGE OF ALL (7) U.S. DIALECT REGIONS IN SWITCHBOARD A WIDE RANGE OF DISCUSSION TOPICS VARIABILITY IN DIFFICULTY (VERY EASY TO VERY HARD) Evaluation Material Characteristics By Dialect Region By Subjective Difficulty Number of Utterances Subjective Difficulty Dialect Region

EIGHT SITES PARTICIPATED IN THE EVALUATION All eight provided material for the unconstrained-recognition phase Six sites also provided sufficient forced-alignment-recognition material (i.e., phone/word labels and segmentation given the word transcript for each utterance) for a detailed analysis AT&T (forced-alignment recognition incomplete, not analyzed ) Bolt, Beranek and Newman Cambridge University Dragon (forced-alignment recognition incomplete, not analyzed ) Johns Hopkins University Mississippi State University SRI International University of Washington Evaluation Sites

Parameter Key START - Begin time (in seconds) of phone DUR - Duration (in sec) of phone PHN - Hypothesized phone ID WORD - Hypothesized Word ID Format is for all 674 files in the evaluation set (Example courtesy of MSU) Initial Recognition File - Example

EACH SUBMISSION SITE USED A (QUASI) CUSTOM PHONE SET Most of the phone sets are available on the PHONEVAL web site THE SITES’ PHONE SETS WERE MAPPED TO A COMMON “REFERENCE” PHONE SET The reference phone set is based on the ICSI Switchboard Transcription material (STP), but is adapted to match the less granular symbol sets used by the submission sites The set of mapping conventions relating the STP (and reference) sets are also available on the PHONEVAL web site THE REFERENCE PHONE SET WAS ALSO MAPPED TO THE SUBMISSION SITE PHONE SETS This reverse mapping was done in order to insure that variants of a phone were given due “credit” in the scoring procedure For example - [em] (syllabic nasal) is mapped to [ix] + [m], the vowel [ix] maps in certain instances to both [ih] and [ax], depending on the specifics of the phone set Phone Mapping Procedure

Generation of Evaluation Data - 1

EACH SITE’S MATERIAL WAS PROCESSED THROUGH SC-LITE TO OBTAIN A WORD-ERROR SCORE AND ERROR ANALYSIS (IN TERMS OF ERROR TYPE) CTM File Format for Word Scoring ERROR KEY C = CORRECT I = INSERTION N = NULL ERROR S = SUBSTITUTION

LEXICAL PROPERTIES Lexical Identity Unigram Frequency Number of Syllables in Word Number of Phones in Word Word Duration Speaking Rate Prosodic Prominence Energy Level Lexical Compounds Non-Words Word Position in Utterance SYLLABLE PROPERTIES Syllable Structure Syllable Duration Syllable Energy Prosodic Prominence Prosodic Context Summary of Corpus Statistical Analyses • PHONE PROPERTIES • Phonetic Identity • Phone Frequency • Position within the Word • Position within the Syllable • Phone Duration • Speaking Rate • Phonetic Context • Contiguous Phones Correct • Contiguous Phones Wrong • Phone Segmentation • Articulatory Features • Articulatory Feature Distance • Phone Confusion Matrices • OTHER PROPERTIES • Speaker (Dialect, Gender) • Utterance Difficulty • Utterance Energy • Utterance Duration

Word- and Phone-Centric “Big Lists” • THE “BIG LISTS” CONTAIN SUMMARY INFORMATION ON 55-65 SEPARATE PARAMETERS ASSOCIATED WITH PHONES, SYLLABLES, WORD, UTTERANCES AND SPEAKERS SYNCHRONIZED TO EITHER THE WORD (THIS SLIDE) OR THE PHONE

RECOGNITION FILES Converted Submissions ATT, BBN , JHU, MSU, SRI, WASH Word Level Recognition Errors ATT, CU, BBN , JHU, MSU, SRI, WASH Phone Error (Free Recognition) ATT, BBN, JHU, MSU, WASH Word Recognition Phone Mapping ATT, BBN, JHU, MSU, WASH BIG LISTS Word-Centric ATT, CU, BBN, JHU, MSU, SRI, WASH Phone-Centric ATT, BBN, JHU, MSU, WASH Phonetic Confusion Matrices ATT, BBN, JHU, MSU, WASH The Switchboard Evaluation Web Site FORCED ALIGNMENT FILES • Forced Alignment Files BBN , JHU, MSU, WASH • Word-Level Alignment Errors BBN , CU, JHU, MSU, SRI, WASH • Phone Error (Forced Alignment) CU, BBN, JHU, MSU, SRI, WASH • Alignment Word-Phone Mapping BBN , JHU, MSU, WASH BIG LISTS • Word-Centric BBN, CU, JHU, MSU, SRI, WASH • Phone-Centric BBN, JHU, MSU, WASH • Phonetic Confusion Matrices BBN, JHU, MSU, WASH • Description of the STP Phone Set • STP Transcription Material Phone-Word Reference Syllable-Word Reference • Phone Mapping for Each Site ATT, BBN , JHU, MSU, WASH STP-to-Reference Map STP Phone-to-Articulatory-Feature Map http://www.icsi.berkeley.edu/real/phoneval

Site Error Rate Error Type Phone Error - Unconstrained Recognition • PHONE ERROR RATES VARY BETWEEN 39% AND 55% • Substitutions are the major source of phone errors

Phone Error - Forced Alignment • PHONE ERROR RATES VARY BETWEEN 35% AND 49% • Insertions as well as substitutions are a major source of errors Site Error Rate • AT&T, Dragon did not provide a complete set of forced alignments Error Type

Site Error Rate Error Type Word Recognition Error • WORD ERROR RATES VARY BETWEEN 27% AND 43% • Substitutions are the major source of word errors

Are Word and Phone Errors Related? • COMPARISON OF THE WORD AND PHONE ERROR RATES ACROSS SITES SUGGESTS THAT WORD ERROR IS HIGHLY DEPENDENT ON THE PHONE ERROR RATE • The correlation between the two parameters is 0.78 Pronunciation Models? The differential error rate is probably related to the use of either pronunciation or language models (or both) Error Rate r = 0.78 Submission Site

PHONE-BASED PARAMETERS DOMINATE THE TREES …. WORD SUBSTITUTIONS VERSUS EVERYTHING ELSE ATT - phnsub, wordfreq, avgAFdist, beginoff, endoff, postworderr BBN - postworderr, preworderr, avgAFdist, phnsub, wordfreq, hypdur CU - preworderr, phnsub, wordfreq Dragon - phnsub, preworderr, postworderr MSU - phnsub, avgAFdist, hypdur, postworderr, beginoff JHU - phnsub, wordfreq, cano-sylcnt SRI - postworderr, phnsub, avgAFdist, wordfreq, hypdur WASH - phnsub, wordfreq, avgAFdist, postworderr, avgphnfreq WORD DELETIONS VERSUS EVERYTHING ELSE ATT - phncor, avgAFdist, postworderr BBN - avgAFdist,refdur, wordengy, preworderr CU - avgAFdist, phnins, phncor Dragon - phncor, preworderr, phnsub MSU - avgAFdist, phncor,phnins,phnsub JHU - avgAFdist,preworderr, refdur SRI - phncor, phnsub, phnins, wordfreq, hypdur WASH - avgAFdist, refdur, preworderr Decision Tree Analysis of Errors - 1

DURATIONIS IMPORTANT IN DISTINGUISHING AMONG ERROR TYPES IN THE TREES …. WORD SUBSTITUTIONS VERSUS DELETIONS ATT - refdur, phnsub, wordengy, postworderr, phncor BBN - phnsub, phncor, phnins CU - hypdur, phnsub, avgAFdist, phncor, Dragon - refdur,phnsub, avgAFdist, phnins MSU - refdur, phnsub, avgAFdist, phncor, phnins JHU - refdur, phnsub, phncor, phnins, postworderr SRI - refdur, wordengy, phnsub, wordfreq, phnins, phncor WASH - refdur, phnsub, phnins, phncor WORD SUBSTITUTIONS VERSUS INSERTIONS ATT - hypdur, avgAFdist, preworderr BBN - hypdur, phnsub,avgphnfreq,refdur, preworderr CU - hypdur, avgphnfreq, postworderr Dragon - hypdur JHU - hypdur, phnsub MSU - avgphnfreq,hypdur, preworderr, phnsub SRI - hypdur, phnsub WASH - hypdur, phnsub, avgphnfreq,phndel, preworderr, phncor Decision Tree Analysis of Errors - 2

Phone Error and Word Length • For CORRECT words, only one phone (on average) is misclassified • For INCORRECT words, phone errors increase linearly with word length Data are averaged across all eight sites

Articulatory Features & Word Error • Incorrect words exhibit nearly 3 times the AF errors as correct words AFs include MANNER (e.g., stop, fricative, nasal, vowel, etc.), PLACE (e.g, labial, alveolar, velar), VOICING, LIP-ROUNDING Data are averaged across all eight sites

Consonantal Onsets and AF Errors • Syllable onsets are intolerant of AF errors in CORRECT words • Place and manner AF errors are particularly high in INCORRECT onsets Data are averaged across all eight sites

Consonantal Codas and AF Errors • Syllable codas exhibit a slightly higher tolerance for error than onsets Data are averaged across all eight sites

Vocalic Nuclei and AF Errors • Nuclei exhibit a much higher tolerance for error than onsets & codas • There are many more errors than among syllabic onsets & codas Data are averaged across all eight sites

Syllable Structure & Word Error Rate • Vowel-initial forms show the greatest error • Polysyllabic forms exhibit the lowest error Data are averaged across all eight sites

Syllable Structure & Word Error Rate • VOWEL-INITIAL forms exhibit the HIGHEST error • POLYSYLLABLES have the LOWEST error rate

Prosodic Stress & Word Error Rate • The effect of stress is most concentrated among word-deletion errors Unstressed Intermediate Stress Fully Stressed Data represent averages across all eight ASR systems

Prosodic Stress and Deletion Rate • All 8 ASR systems show the effect of prosodic stress on word deletion rate 0 = unstressed, 0.5 = intermediate stress, 1 = fully stressed

Prosodic Stress and Word Error Rate • The effect of stress on overall word error is less pronounced than on deletions 0 = unstressed, 0.5 = intermediate stress, 1 = fully stressed

Different Measures of Speaking Rate • MRATE IS AN ACOUSTIC MEASURE BASED ON THE MODULATION PROPERTIES OF THE SIGNAL’S ENERGY ACROSS THE SPECTRUM • SYLLABLES/SEC IS A LINGUISTIC MEASURE OF SPEAKING RATE • THE CORRELATION BETWEEN THE TWO METRICS (R) = 0.56 • MRATE GENERALLY UNDERESTIMATES THE SYLLABLE RATE • Non-speech, filled pauses, etc. are contained in MRATE but not in syllable rate

MRATE Distribution

Word Error and MRATE • MRATE (acoustic metric) is not predictive of word-error rate Slowest and fastest speaking rates should exhibit the highest word error, but don’t (in terms of MRATE)

Syllable Rate Distribution • ONLY A SMALL PROPORTION (10%) OF UTTERANCES ARE FASTER THAN 6 SYLLABLES/SEC OR SLOWER THAN 3 SYLLABLES/SEC

Word Error and Syllable Rate • Syllables per second is a useful metric for predicting word-error rate Slow and fast speaking rates exhibit the highest word error (in terms of syllables/sec)

THE DIAGNOSTIC MATERIAL MAY NOT BE TRULY REPRESENTATIVE OF THE SWITCHBOARD RECOGNITION TASK The competitive evaluation is based on entire conversations, whereas the current diagnostic material contains only relatively small amounts of material from any single speaker This strategy was intended to provide a broad coverage of different speaker qualities (gender, dialect, age, voice quality, topic, etc.), but … Was also designed to foil recognition based largely on speaker adaptation algorithms THE TIME-MEDIATED SCORING TECHNIQUE IS NOT “PERFECT” AND MAY HAVE INTRODUCED CERTAIN ERRORS NOT PRESENT IN THE COMPETITIVE EVALUATION THE STP TRANSCRIPTION (REFERENCE) MATERIAL IS ALSO NOT “PERFECT” AND THEREFORE THE ANALYSES COULD UNDERESTIMATE A SITE’S PERFORMANCE ON BOTH FREE AND FORCED-ALIGNMENT-BASED RECOGNITION Caveats

SWITCHBOARD RECOGNITION SYSTEMS FROM EIGHT SEPARATE SITES WERE EVALUATED WITH RESPECT TO PHONE- AND WORD- LEVEL CLASSIFICATION ON NON-COMPETITIVE DIAGNOSTIC MATERIAL PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS Decision-tree analyses support this hypothesis Additional analyses are also consistent with this conclusion SYLLABLE STRUCTURE AND PROSODIC STRESS ARE ALSO IMPORTANT FACTORS FOR ACCURATE RECOGNITION The pattern of errors differs across the syllable (onset, nucleus, coda) Stress affects primarily the number of word-deletion errors SPEAKING RATE CAN BE USED TO PREDICT RECOGNITION ERROR Syllables per second is a far more accurate metric than MRATE (an acoustic measure based on the modulation spectrum) ASR SYSTEMS CAN POTENTIALLY BE IMPROVED BY FOCUSING MORE ATTENTION ON PHONETIC CLASSIFICATION, SYLLABLE STRUCTURE AND PROSODIC STRESS Summary and Conclusions

STRUCTURED QUERY LANGUAGE (SQL) DATABASE VERSION (11/2000) Will provide quick and ready access to the entire set of recognition and forced-alignment material over the web Will enable accurate selection of specific subsets of the material for detailed, intensive analysis and graphing without much scripting Will accelerate analysis of the evaluation material, which is … POSTED ON THE PHONEVAL WEB SITE FOR WIDE DISSEMINATION DEVELOPMENT OF A HIGH-FIDELITY AUTOMATIC PHONETIC TRANSCRIPTION SYSTEM TO LABEL AND SEGMENT (IN PROGRESS) This automatic system will enable accurate labeling and segmentation of the remainder of the Switchboard corpus, thus enabling … PHONETIC AND LEXICAL DISSECTION OF THE COMPETITIVE EVALUATION SUBMISSIONSIN THE SPRING OF 2001 Hopefully providing further insight into ways in which ASR systems can be improved Into the Future …

That’s All, Folks Many Thanks for Your Time and Attention

Additional Slides for Discussion

EACH SITE’S SUBMISSION WAS PROCESSED THROUGH SC-LITE TO OBTAIN A WORD-ERROR SCORE AND ERROR ANALYSIS (IN TERMS OF ERROR TYPE) CTM File Format for Word Scoring ERROR KEY C = CORRECT I = INSERTION N = NULL ERROR S = SUBSTITUTION

HOW ACCURATE IS THE PHONETIC SEGMENTATION PROVIDED BY FORCED-ALIGNMENT-BASED RECOGNITION? The average disparity between the phone duration of the reference (STP) corpus and the duration of the forced alignment phones is substantial (ca. 40% of the mean duration of a phone in the corpus) AUTOMATIC ALIGNERS ARE NOT RELIABLE PHONE SEGMENTERS Precision of Forced Alignment Segmentation There is virtually no skew in disparity between beginning and ending portions of the phones (i.e., no bias in segmentation) Mean phone duration in corpus = 79.3 ms

RELATION OF THE NUMBER OF PHONES IN A WORD TO WORD ERROR Done by George Doddington of NIST using both the free and forced-alignment recognition results (from the “Big Lists”) Reveals an interesting relationship between the number of phones correctly (or incorrectly) classified and the probability of a word being correctly (or incorrectly) labeled Also shows the extent to which decoders are tolerant of phone classification errors George’s analysis is consistent with the D-Tree analyses suggesting that phone classification is the controlling variable for word error George will discuss this material directly following this presentation ANALYSIS OF PHONETIC CONFUSIONS IN THE CORPUS MATERIAL Performed by Joe Kupin and Hollis Fitch of the Institute for Defense Analysis The output of their scripts are available on the PHONEVAL web site Hollis will discuss some of their results directly after George’s presentation Analyses Performed By Others

UTTERANCE LEVEL Utterance ID Number of Words in Utterance Utterance Duration Utterance Energy (Abnormally Low or High Amplitude) Utterance Difficulty (Very Easy, Easy, Medium, Hard, Very Hard) Speaking Rate - Syllables per Second Speaking Rate - Acoustic Measure (MRATE) Speech Parameters Analyzed - 1

Linguistic Dissection of Switchboard-Corpus Automatic Speech Recognition Systems

Linguistic Dissection of Switchboard-Corpus Automatic Speech Recognition Systems

Presentation Transcript

Automatic Speech Recognition

Automatic Speech Recognition

Phonetic Dissection of Switchboard-Corpus Automatic Speech Recognition Systems Steven Greenberg and Shuangyu Chang Inter

Automatic Speech Recognition

Automatic Speech Recognition (ASR)

Automatic speech recognition

Automatic Speech Recognition II

Linguistic knowledge for Speech recognition

Automatic Speech Recognition System

Automatic Speech Recognition

Automatic Continuous Speech Recognition

Automatic Speech Recognition Studies

Automatic speech recognition on the articulation index corpus

Automatic Speech Recognition Introduction

Automatic Speech Recognition

Corpus linguistic What is corpus linguistic?

Automatic Speech Recognition - Edukite

Automatic Speech Recognition Introduction

Introduction to Automatic Speech Recognition

Automatic Speech Recognition Introduction

Automatic Speech Recognition