Advancements in Speech Recognition Technologies: Past, Present, and Future

Components of Spoken Dialogue Systems Julia Hirschberg LSA 353

DIALOG SEMANTICS SPEECH RECOGNITION SPOKEN LANGUAGE UNDERSTANDING SYNTAX LEXICON MORPHOLOGY SPEECH SYNTHESIS PHONETICS DIALOG MANAGEMENT INNER EAR ACOUSTIC NERVE VOCAL-TRACT ARTICULATORS Recreating the Speech Chain

NP NP VP SEVEN THREE ZERO IS MY FOUR NUMBER TWO NINE SEVEN & E E & & r n b n n th ü e n o i s v z O & r r f m n t I e m v s r I The Illusion of Segmentation... or... Why Speech Recognition is so Difficult (user:Roberto (attribute:telephone-num value:7360474))

Ellipses and Anaphors Limited vocabulary Multiple Interpretations Speaker Dependency Word variations NP NP VP Word confusability SEVEN THREE ZERO IS MY Context-dependency FOUR NUMBER TWO NINE SEVEN Coarticulation Noise/reverberation E & E & & r n n o th e ü n n b i s v O z & r r m e f n m I t s v r I Intra-speaker variability The Illusion of Segmentation... or... Why Speech Recognition is so Difficult (user:Roberto (attribute:telephone-num value:7360474))

Fred Jelinek Acoustic HMMs Word Tri-grams a11 a22 a33 a12 a23 S1 S2 S3 1980s -- The Statistical Approach • Based on work on Hidden Markov Models done by Leonard Baum at IDA, Princeton in the late 1960s • Purely statistical approach pursued by Fred Jelinek and Jim Baker, IBM T.J.Watson Research • Foundations of modern speech recognition engines Jim Baker • No Data Like More Data • Whenever I fire a linguist, our system performance improves (1988) • Some of my best friends are linguists (2004)

1980-1990 – Statistical approach becomes ubiquitous • Lawrence Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, February 1989.

1995 1997 1996 1998 1999 2000 2001 HOSTING 2002 2003 MIT 2004 … APPLICATION DEVELOPERS STANDARDS TOOLS SRI PLATFORM INTEGRATORS STANDARDS TECHNOLOGY VENDORS STANDARDS 1980s-Today – The Power of Evaluation SPOKEN DIALOG INDUSTRY Nuance SPEECHWORKS NUANCE Pros and Cons of DARPA programs + Continuous incremental improvement - Loss of “bio-diversity”

Today’s State of the Art • Low noise conditions • Large vocabulary • ~20,000-64,000 words (or more…) • Speaker independent (vs. speaker-dependent) • Continuous speech (vs isolated-word) • World’s best research systems: • Human-human speech: ~13-20% Word Error Rate (WER) • Human-machine or monologue speech: ~3-5% WER

Components of an ASR System • Corpora for training and testing of components • Representation for input and method of extracting • Pronunciation Model • Acoustic Model • Language Model • Feature extraction component • Algorithms to search hypothesis space efficiently

Training and Test Corpora • Collect corpora appropriate for recognition task at hand • Small speech + phonetic transcription to associate sounds with symbols (InitialAcoustic Model) • Large (>= 60 hrs) speech + orthographic transcription to associate words with sounds (Acoustic Model) • Very large text corpus to identify unigram and bigram probabilities (Language Model)

Building the Acoustic Model • Model likelihood of phones or subphones given spectral features, pronunciation models, and prior context • Usually represented as HMM • Set of states representing phones or other subword units • Transition probabilities on states: how likely is it to see one phone after seeing another? • Observation/output likelihoods: how likely is spectral feature vector to be observed from phone state i, given phone state i-1?

Initial estimates for • Transition probabilities between phone states • Observation probabilities associating phone states with acoustic examples • Re-estimate both probabilities by feeding the HMM the transcribed speech training corpus (forced alignment) • I.e., we tell the HMM the ‘right’ answers -- which phones to associate with which sequences of sounds • Iteratively retrain transition and observation probabilities by running training data through model and scoring output (we know the right answers) until no improvement

HMMs for speech

Building the Pronunciation Model • Models likelihood of word given network of candidate phone hypotheses • Multiple pronunciations for each word • May be weighted automaton or simple dictionary • Words come from all corpora; pronunciations from pronouncing dictionary or TTS system

ASR Lexicon: Markov Models for Pronunciation

Language Model • Models likelihood of word given previous word(s) • Ngram models: • Build the LM by calculating bigram or trigram probabilities from (very large) text training corpus • Smoothing issues • Grammars • Finite state grammar or Context Free Grammar (CFG) or semantic grammar • Out of Vocabulary (OOV) problem

Search/Decoding • Find the best hypothesis given • Lattice of phone units (AM) • Lattice of words (segmentation of phone lattice into all possible words via Pronunciation Model) • Probabilities of word sequences (LM) • How to reduce this huge search space? • Lattice minimization and determinization • Pruning: beam search • Calculating most likely paths

Evaluating Success • Transcription • Low WER (S+I+D)/N * 100 Thesis test vs. This is a test. 75% WER Or That was the dentist calling. 125% WER • Understanding • High concept accuracy • How many domain concepts were correctly recognized? I want to go from Boston to Baltimore on September 29

Domain concepts Values • source city Boston • target city Baltimore • travel date September 29 • Score recognized string “Go from Boston to Washington on December 29” vs. “Go to Boston from Baltimore on September 29” • (1/3 = 33% CA)

Summary • ASR today • Combines many probabilistic phenomena: varying acoustic features of phones, likely pronunciations of words, likely sequences of words • Relies upon many approximate techniques to ‘translate’ a signal • ASR future • Can we include more language phenomena in the model?

Synthesizers Then and Now • The task: produce human-sounding speech from some orthographic or semantic representation of an input • An online text • A semantic representation of a response to a query • What are the obstacles? • What is the state of the art?

Possibly the First ‘Speaking Machine’ • Wolfgang von Kempelen, Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine, 1791 (in Deutsches Museum still and playable) • First to produce whole words, phrases – in many languages

Joseph Faber’s Euphonia, 1846

Constructed 1835 w/pedal and keyboard control • Whispered and ordinary speech • Model of tongue, pharyngeal cavity with changeable shape • Singing too “God Save the Queen” • Modern Articulatory Synthesis: Dennis Klatt (1987)

World’s Fair in NY, 1939 • Requires much training to ‘play’ • Purpose: reduce bandwidth needed to transmit speech, so many phone calls can be sent over single line

Answers: • These days a chicken leg is a rare dish. • It’s easy to tell the depth of a well. • Four hours of steady work faced us. • ‘Automatic’ synthesis from spectrogram – but can also use hand-painted spectrograms as input • Purpose: understand perceptual effect of spectral details

Formant/Resonance/Acoustic Synthesis • Parametric or resonance synthesis • Specify minimal parameters, e.g. f0 and first 3 formants • Pass electronic source signal thru filter • Harmonic tone for voiced sounds • Aperiodic noise for unvoiced • Filter simulates the different resonances of the vocal tract • E.g. • Walter Lawrence’s Parametric Artificial Talker (1953) for vowels and consonants • Gunnar Fant’s Orator Verbis Electris (1953) for vowels • Formant synthesis download (demo)

Concatenative Synthesis • Most common type today • First practical application in 1936: British Phone company’s Talking Clock • Optical storage for words, part-words, phrases • Concatenated to tell time • E.g. • And a ‘similar’ example • Bell Labs TTS (1977) (1985)

Variants of Concatenative Synthesis • Inventory units • Diphone synthesis (e.g. Festival) • Microsegment synthesis • “Unit Selection” – large, variable units • Issues • How well do units fit together? • What is the perceived acoustic quality of the concatenated units? • Is post-processing on the output possible, to improve quality?

TTS Production Levels: Back End and FrontEnd • Orthographic input: The children read to Dr. Smith • World Knowledge text normalization • Semantics • Syntax word pronunciation • Lexical Intonation assignment • Phonology intonation realization • F0, amplitude, duration • Acoustics synthesis

Text Normalization • Reading is what W. hates most. • Reading is what Wilde hated most. • Have the students read the questions. • In 1996 she sold 1995 shares and deposited $42 in her 401(k). • The duck dove supply.

Pronunciation in Context • Homograph disambiguation • E.g. bass/bass, desert/desert • Dictionaries vs letter-to-sound rules • Frequent or exceptional words: dictionary • Bellcore business name pronunciation • ‘New’ words: rules • Inferring language origin, e.g. Infiniti, Gomez, Paris • Pronunciation by analogy, e.g. universary • Learning rules automatically from dictionaries or small seed-sets + active learning

Intonation Assignment: Phrasing • Traditional: hand-built rules • Punctuation 234-5682 • Context/function word: no breaks after function word He went to dinner • Parsing? She favors the nuts and bolts approach • Current: statistical analysis of large labeled corpus • Punctuation, pos window, utt length,… • ~94% `accuracy’

Intonation Assignment: Accent • Hand-built rules • Function/content distinction • Accent content words; deaccent function words • But many exceptions • ‘Given’ items, focus, contrast • There are presidents and there are good presidents • Complex nominals: • Main Street/Park Avenue • city hall parking lot • Today: Statistical procedures trained on large labeled corpora

Intonation Assignment: Contours • Simple rules • ‘.’ = declarative contour • ‘?’ = yes-no-question contour unless wh-word present at/near front of sentence • Well, how did he do it? And what do you know? • Open problem • But realization even of simple variation is quite difficult in current TTS systems

Phonological and Acoustic Realization • Task: • Produce a phonological representation from phonemes and intonational assignment • Pitch contour aligned with text • Durations, intensity • Select best concatenative units from inventory • Post-process if needed/possible to smooth joins, modify pitch, duration, intensity, rate from original units • Produce acoustic waveform as output

TTS: Where are we now? • Natural sounding speech for some utterances • Where good match between input and database • Still…hard to vary prosodic features and retain naturalness • Yes-no questions: Do you want to fly first class? • Context-dependent variation still hard to infer from text and hard to realize naturally:

Appropriate contours from text • Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. • Variation in pitch range, rate, pausal duration to convey topic structure • Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. • How to mimic real voices? • ScanSoft/Nuance demo

Next Class • J&M 22.4,22.8 • Walker et al ‘97

Advancements in Speech Recognition Technologies: Past, Present, and Future

Advancements in Speech Recognition Technologies: Past, Present, and Future

Presentation Transcript

Spoken Dialogue Systems

Applications of Discourse Structure for Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue Systems: System Overview

Spoken Dialogue Systems

Spoken Dialogue Systems: Human and Machine

Turn-Taking in Spoken Dialogue Systems

Spoken Dialogue Technology

Spoken Dialogue Systems A Tutorial

Spoken Dialogue Technology

Evaluating Spoken Dialogue Systems

Intonational Variation in Spoken Dialogue Systems

Spoken Dialogue Systems

User Simulation for Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue Systems: Human and Machine

Spoken Dialogue Systems