ECE 259—Lecture 19 Techniques for Speech and Natural Language Recognition
Voice reply to customer “What number did you want to call?” The Speech Dialog Circle Speech TTS ASR Automatic SpeechRecognition Text-to-SpeechSynthesis Data, Rules Words: What’s next? “Determine correct number” Words spoken “I dialed a wrong number” SLG SLU Spoken Language Generation Spoken LanguageUnderstanding Action DM Meaning “Billing credit” DialogManagement
Automatic Speech Recognition • Goal:Accurately and efficiently convert a speech signal into a text message independent of the device, speaker or the environment. • Applications: Automation of complex operator-based tasks, e.g., customer care, dictation, form filling applications, provisioning of new services, customer help lines, e-commerce, etc.
Milestones in Speech and Multimodal Technology Research Small Vocabulary, Acoustic Phonetics-based Large Vocabulary; Syntax, Semantics, Very Large Vocabulary; Semantics, Multimodal Dialog, TTS Medium Vocabulary, Template-based Large Vocabulary, Statistical-based Isolated Words; Connected Digits; Continuous Speech Connected Words; Continuous Speech Continuous Speech; Speech Understanding Spoken dialog; Multiple modalities Isolated Words Filter-bank analysis; Time-normalization;Dynamicprogramming Pattern recognition; LPC analysis; Clustering algorithms; Level building; Stochastic language understanding; Finite-state machines; Statistical learning; Concatenative synthesis; Machine learning; Mixed-initiative dialog; Hidden Markov models; Stochastic Language modeling; 1962 1967 1972 1977 1982 1987 1992 1997 2002 Year
Basic ASR Formulation The basic equation of Bayes rule-based speech recognition is where X=X1,X2,…,XN is the acoustic observation (feature vector) sequence. is the corresponding word sequence, P(X|W) is the acoustic model and P(W) is the language model ^ s(n), W Xn W Speech Analysis Decoder
TTS ASR SLG SLU DM Speech Recognition Process Acoustic Model (HMM) Xn Input Speech Pattern Classification (Decoding, Search) Feature Analysis (Spectral Analysis) Confidence Scoring (Utterance Verification) “Hello World” (0.9) (0.8) ^ W s(n), W Language Model (N-gram) Word Lexicon
Speech Recognition Processes • Choose task => sounds, word vocabulary, task syntax (grammar), task semantics • Text training data set => word lexicon, word grammar (language model), task grammar • Speech training data set => acoustic models • Evaluate performance • Speech testing data set • Training algorithm => build models from training set of text and speech • Testing algorithm => evaluate performance from testing set of speech
Acoustic Model Feature Extraction Goal: Extract robust features (information) from the speech that are relevant for ASR. Method: Spectral analysis through either a bank-of-filters or through LPC followed by non-linearity and normalization (cepstrum). Result: Signal compression where for each window of speech samples where 30 or so cepstral features are extracted (64,000 b/s -> 5,200 b/s). Challenges: Robustness to environment (office, airport, car), devices (speakerphones, cellphones), speakers (acents, dialect, style, speaking defects), noise and echo. Feature set for recognition—cepstral features or those from a high dimensionality space. Pattern Classification Confidence Scoring Feature Extraction Language Model Word Lexicon
What Features to Use? Short-time Spectral Analysis: Acoustic features: - cepstrum (LPC, filterbank, wavelets) - formant frequencies, pitch, prosody - zero-crossing rate, energy Acoustic-Phonetic features: - manner of articulation (e.g., stop, nasal, voiced) - place of articulation (labial, dental, velar) Articulatory features: - tongue position, jaw, lips, velum Auditory features: - ensemble interval histogram (EIH), synchrony Temporal Analysis:approximation of the velocity and acceleration typically through first and second order central differences.
Feature Extraction Process Sampling and Quantization Noise Removal, Normalization Filtering Cepstral Analysis Preemphasis M,N Pitch Formants Spectral Analysis Energy Zero-Crossing Segmentation (blocking) Equalization Bias removal or normalization Temporal Derivative Windowing Delta cepstrum Delta^2 cepstrum
Robustness Rejection Unlimited Vocabulary Robustness Problem: A mismatch in the speech signal between the training phase and testing phase can result in performance degradation. Methods: Traditional techniques for improving system robustness are based on signal enhancement, feature normalization or/and model adaptation. Perception Approach: Extract fundamental acoustic information in narrow bands of speech. Robust integration of features across time and frequency.
Signal Features Model Training Enhancement Normalization Adaptation Signal Features Model Testing Methods for Robust Speech Recognition A mismatch in the speech signal between the training phase and testing phase results in performance degradation.
Noise and Channel Distortion y(t) = [s(t) + n(t)] * h(t) Noise n(t) h(t) Channel + Distorted Speech y(t) Speech s(t) 5dB 50dB Fourier Transform Fourier Transform frequency frequency
Speaker Variations • one source of acoustic variation among speakers is explained by their different vocal tract lengths (which varies in length from 15-20 cm) • an increase in vocal tract length is inversely proposal to the frequency contents of the resulting acoustic signal • by warping the frequency content of a signal, one can effectively normalize for variations in vocal tract lengths • Maximum Likelihood Speaker Normalization
Pattern Classification Confidence Scoring Feature Extraction Acoustic Model Language Model Word Lexicon Acoustic Model Goal:Map acoustic features into distinct phonetic labels (e.g., /s/, /aa/). Hidden Markov Model (HMM):Statistical method for characterizing the spectral properties of speech by a parametric random process. A collection of HMMs is associated with a phone. HMMs are also assigned for modeling extraneous events. Advantages:Powerful statistical method for dealing with a wide range of data and reliably recognizing speech. Challenges:Understanding the role of classification models (ML Training) versus discriminative models (MMI training). What comes after the HMM—are there data driven models that work better for some or all vocabularies.
Acoustic Model Word Lexicon Goal: Map legal phone sequences into words according to phonotactic rules. For example, David /d/ /ey/ /v/ /ih/ /d/ Multiple Pronunciation: Several words may have multiple pronunciations. For example Data /d/ /ae/ /t/ /ax/ Data /d/ /ey/ /t/ /ax/ Challenges: How do you generate a word lexicon automatically; how do you add new variant dialects and word pronunciations. Pattern Classification Confidence Scoring Feature Extraction Language Model Word Lexicon
The Lexicon • one entry per word, with at least 100K words needed for dictation (find: /f/ /ay/ /n/ /d/) • either done by hand or with letter-to-sound rules (LTS). Rules can be automatically trained with decision trees (CART); less than 8% errors, but proper nouns are hard! • multiple pronunciations (ta-mey-to vs to-mah-to)
Acoustic Model Language Model Goal: Model “acceptable” spoken phrases, constrained by task syntax. Rule-based: Deterministic grammars that are knowledge driven. For example, flying from $cityto $city on $date Statistical: Compute estimate of word probabilities (N-gram, class-based, CFG). For example flying from Newark to Boston tomorrow Challenges: How do you build a language model rapidly for a new task. Pattern Classification Confidence Scoring Feature Extraction Language Model Word Lexicon 0.4 0.6
N-Grams • Trigrams Estimation
Example • training data: “John read her book”, “I read a different book”, “John read a book by Mulan” • but we have a problem here
N-Gram Smoothing • data sparseness: in millions of words more than 50% of trigrams occur only once. • can’t assign p(wi|wi-1, wi-2)=0 • solution: assign non-zero probability for each n-gram by lowering the probability mass of seen n-grams.
Perplexity • cross-entropy of a language model P on word sequence W is • and its perplexity • measures the complexity of the language model (geometric mean of branching factor).
Perplexity • for digit recognition task (TIDIGITS) has 10 words, PP=10 and 0.2% error rate • Airline Travel Information System (ATIS) has 2000 words and PP=20 • Wall Street Journal Task has 5000 words and PP=130 with bigram and 5% error rate • in general, lower perplexity => lower error rate, but it does not take acoustic confusability into account: E-set (B, C, D, E, G, P, T) has PP=7 and has 5% error rate.
Bigram Perplexity Trained on 500 million words and tested on the Encarta Encyclopedia
Out-Of-Vocabulary (OOV) Rate OOV rate measured on Encarta Encyclopedia. Trained on 500 million words
WSJ Results • Perplexity and word error rate on the 60,000-word Wall Street Journal continuous speech recognition task. • Unigrams, bigrams and trigrams were trained from 260 million words • Smoothing mechanisms: Katz and Kneser-Ney
Acoustic Model Pattern Classification • Goal: • Combine information (probabilities) • from the acoustic model, language • model and word lexicon to generate • an “optimal” word sequence (highest probability). • Method: • Decoder searches through all possible recognition • choices using a Viterbi decoding algorithm. • Challenges: • How do we build efficient structures (FSMs) for decoding and searching large vocabulary, complex language models tasks; • features x HMM units x phones x words x sentences can lead to search networks with 10 states • FSM methods can compile the network to 10 states—14 orders of magnitude more efficient • What is the theoretical limit of efficiency that can be achieved Pattern Classification Confidence Scoring Feature Extraction Language Model Word Lexicon 22 8
Features HMM states HMM units Phones Words Sentences Unlimited Vocabulary ASR Robustness Rejection Unlimited Vocabulary • The basic problem in ASR is to find the sequence of words that explain the input signal. This implies the following mapping: • For the WSJ 20,000 vocabulary, this results in a network • of 10 bytes! • State-of-the-art methods including fast match, multi-pass • decoding A* stack and finite-state transducers all provide • tremendous speed-up by searching through the network and • finding the best path that maximizes the likelihood function. 22
Weighted Finite State Transducers (WFST) • Unified Mathematical framework to ASR • Efficiency in time and space WFST Word:Phrase WFST Search Network Phone:Word Combination Optimization WFST HMM:Phone WFST can compile the network to 108 states --14 orders of magnitude more efficient WFST State:HMM
Weighted Finite State Transducer Word PronunciationTransducer ey:e/.4 dx:e/.8 ax:”data”/1 d:e/1 ae:e/.6 t:e/.2 Data
Algorithmic Speed-up for Speech Recognition AT&T Community North American Business vocabulary: 40,000 words branching factor: 85
Acoustic Model Confidence Scoring Goal: Identify possible recognition errors and out-of-vocabulary events. Potentially improves the performance of ASR, SLU and DM. Method: A confidence score based on a hypothesis test is associated with each recognized word. For example: Label:credit please Recognized:credit fees Confidence:(0.9) (0.3) Challenges: Rejection of extraneous acoustic events (noise, background speech, door slams) without rejection of valid user input speech. Pattern Classification Confidence Scoring Feature Extraction Language Model Word Lexicon
Rejection Robustness Rejection Problem: Extraneous acoustic events, noise, background speech and out-of-domain speech deteriorate system performance. Unlimited Vocabulary Measure of Confidence: Associating word strings with a verification cost that provide an effective measure of confidence (Utterance Verification). Effect: Improvement in the performance of the recognizer, understanding system and dialogue manager.
TTS ASR SLG SLU DM State-of-the-Art Performance Acoustic Model Input Speech Recognized Sentence Pattern Classification (Decoding, Search) Feature Extraction Confidence Scoring Language Model Word Lexicon
How to Evaluate Performance? • Dictation applications: Insertions, substitutions and deletions • Command-and-control: false rejection and false acceptance => ROC curves.
Word Error Rates factor of 17 increase in digit error rate
NIST Benchmark Performance Word Error Rate NAB Resource Management ATIS Year
North American Business vocabulary: 40,000 words branching factor: 85
Algorithmic Accuracy for Speech Recognition Switchboard/Call Home Vocabulary: 40,000 words Perplexity: 85
Human Speech Recognition vs ASR Machines Outperform Humans x100 x10 x1
Challenges in ASR System Performance - accuracy - efficiency (speed, memory) - robustness Operational Performance - end-point detection - user barge-in - utterance rejection - confidence scoring Machines are 10-100 times less accurate than humans
Large-Vocabulary ASR Demo Courtesy of Murat Saraclar
Voice-Enabled System Technology Components Speech Speech TTS ASR Automatic SpeechRecognition Text-to-SpeechSynthesis Data, Rules Words Words SLG SLU Spoken Language Generation Spoken LanguageUnderstanding Action Meaning DM DialogManagement
Spoken Language Understanding (SLU) • Goal:: Interpret the meaning of key words and phrases in the recognized speech string, and map them to actions that the speech understanding system should take • accurate understanding can often be achieved without correctly recognizing every word • SLU makes it possible to offer services where the customer can speak naturally without learning a specific set of terms • Methodology:: Exploit task grammar (syntax) and task semantics to restrict the range of meaning associated with the recognized word string; exploit ‘salient’ words and phrases to map high information word sequences to appropriate meaning • Performance Evaluation:: Accuracy of speech understanding system on various tasks and in various operating environments • Applications:: Automation of complex operator-based tasks, e.g., customer care, catalog ordering, form filling systems, provisioning of new services, customer help lines, etc. • Challenges: What goes beyond simple classifications systems but below full Natural Language voice dialogue systems
SLU Formulation • let W be a sequence of words and C be its underlying meaning (conceptual structure),; then, using Bayes rule, we get • finding the best conceptual structure can be done by parsing and ranking using a combination of acoustic, linguistic and semantics scores.
Knowledge Sources for Speech Understanding DM ASR SLU Acoustic/ Phonetic Syntactic Phonotactic Pragmatic Semantic Rules for phoneme sequences and pronunciation Relationship of speech sounds and English phonemes Structure of words, phrases in a sentence Relationship and meanings among words Discourse, interaction history, world knowledge Acoustic Model Word Lexicon Language Model Understanding Model Dialog Manager
DARPA Communicator • Darpa sponsored research and development of mixed-initiative dialogue systems • Travel task involving airline, hotel and car information and reservation “Yeah I uh I would like to go from New York to Boston tomorrow night with United” • SLU output (Concept decoding) XML Schema <itinerary> <origin> <city></city> <state></state> </origin> <destination> <city></city> <state></state> </destination> <date></date> <time></time> <airline></airline> </itinerary> • Topic: Itinerary • Origin: New York • Destination: Boston • Day of the week: Sunday • Date: May 25th, 2002 • Time: >6pm • Airline: United