300 likes | 324 Vues
Speech Generation: From Concept and from Text. Julia Hirschberg CS 6998. Today. TTS CTS. Traditional TTS Systems. Monologue News articles, email, books, phone directories Input: plain text How to infer intention behind text?. Human Speech Production Levels. World Knowledge Semantics
E N D
Speech Generation: From Concept and from Text Julia Hirschberg CS 6998
Today • TTS • CTS
Traditional TTS Systems • Monologue • News articles, email, books, phone directories • Input: plain text • How to infer intention behind text?
Human Speech Production Levels • World Knowledge • Semantics • Syntax • Word • Phonology • Motor Commands, articulator movements, F0, amplitude, duration • Acoustics
TTS Production Levels: Back End and Front End • Orthographic input: The children read to Dr. Smith • World Knowledge text normalization • Semantics • Syntax word pronunciation • Word • Phonology intonation assigment • F0, amplitude, duration • Acoustics synthesis
Text Normalization • Context independent: • Mr., 22, $N, NAACP, MAACO VISA • Context-dependent: • Dr., St., 1997, 3/16 • Abbreviation ambiguities: How to resolve? • Application restrictions – all names? • Rule or corpus-based decision procedure (Sproat et al ‘01)
Part-of-speech ambiguity: • The convict went to jail/They will convict him • Said said hello • They read books/They will read books • Use: local lexical context, pos tagger, parser? Sense ambiguity: I fish for bass/I play the bass Use: decision lists (Yarowsky ’94)
Word Pronunciation • Letter-to-Sound rules vs. large dictionary • O: _{C}e$ /o/ hope • O /a/ hop • Morphological analysis • Popemobile • Hoped • Ethnic classification • Fujisaki, Infiniti
Rhyming by analogy • Meronymy/metonymy • Exception Dictionary • Beethoven • Goal: phonemes+syllabification+lexical stress • Context-dependent too: • Give the book to John. • To John I said nothing.
Intonation Assignment: Phrasing • Traditional: hand-built rules • Punctuation 234-5682 • Context/function word: no breaks after function word He went to dinner • Parse? She favors the nuts and bolts approach • Current: statistical analysis of large labeled corpus • Punctuation, pos window, utt length,…
Intonation Assignment: Accent • Hand-built rules • Function/content distinction He went out the back door/He threw out the trash • Complex nominals: • Main Street/Park Avenue • city hall parking lot • Statistical procedures trained on large corpora • Contrastive stress, given/new distinction?
Intonation Assignment: Contours • Simple rules • ‘.’ = declarative contour • ‘?’ = yes-no-question contour unless wh-word present at/near front of sentence • Well, how did he do it? And what do you know?
The TTS Front End Today • Corpus-based statistical methods instead of hand-built rule-sets • Dictionaries instead of rules (but fall-back to rules) • Modest attempts to infer contrast, given/new • Text analysis tools: pos tagger, morphological analyzer, little parsing
TTS Back End: Phonology to Acoustics • Goal: • Produce a phonological representation from segmentals (phonemes) and suprasegmentals (accent and phrasing assignment) • Convert to an acoustic signal (spectrum, pitch, duration, amplitude) • From phonetics to signal processing
Phonological Modeling: Duration • How long should each phoneme be? • Identify of context phonemes • Position within syllable and # syllables • Phrasing • Stress • Speaking rate
Phonological Modeling: Pitch • How to create F0 contour from accent/phrasing/contour assignment plus duration assignment and phonemes? • Contour or target models for accents, phrase boundaries • Rules to align phoneme string and smooth • How does F0 align with different phonemes?
Phonetic Component: Segmentals • Phonemes have different acoustic realizations depending on nearby phonemes, stress • To/to, butter/tail • Approaches: • Articulatory synthesis • Formant synthesis • Concatenative synthesis • Diphone or unit selection
Articulatory Synthesis-by-Rule • Model articulators: tongue body, tip, jaw, lips, velum, vocal folds • Rules control timing of movements of each articulator • Easy to model coarticulation since articulators modeled separately • But: sounds very unnatural • Transform from vocal tract to acoustics not well understood • Knowledge of articulator control rules incomplete
Formant (Acoustic) Synthesis by Rule • Model of acoustic parameters: • Formant frequencies, bandwidths, amplitude of voicing, aspiration… • Phonemes have target values for parameters • Given a phonemic transcription of the input: • Rules select sequence of targets • Other rules determine duration of target values and transitions between
Speech quality not natural • Acoustic model incomplete • Human knowledge of linguistic and acoustic control rules incomplete
Concatenative Synthesis • Pre-recorded human speech • Cut up into units, code, store (indexed) • Diphones typical • Given a phonemic transcription • Rules select unit sequence • Rules concatenate units based on some selection criteria • Rules modify duration, amplitude, pitch, source – and smooth spectrum across junctures
Issues • Speech quality varies based on • Size and number of units (coverage) • Rules • Speech coding method used to decompose acoustic signal into spectral, F0, amplitude parameters • How much the signal must be modified to produce the output
Coding Methods • LPC: Linear Predictive Coding • Decompose waveform into vocal tract/formant frequencies, F0, amplitude: simple model of glottal excitation • Robotic • More elaborate variants (MPLPC, RELP) less robotic but distortions when change in F0, duration • PSOLA (pitch synchronous overlap/add): • No waveform decomposition
Delete/repeat pitch periods to change duration • Overlap pitch periods to change F0 • Distortion if large F0, durational change • Sensitive to definition of pitch periods • No coding (use natural speech) • Avoid distortions of coding methods • But how to change duration, F0, amplitude?
Corpus-based Unit Selection • Units determined case-by-case from large hand or automatically labeled corpus • Amount of concatenation depends on input and corpus • Algorithms for determining best units to use • Longest match to phonemes in input • Spectral distance measures • Matching prosodic, amplitude, durational features???
TTS Back End: Summary • Speech most natural when least signal processing: corpus-based unit selection and no coding….but….
TTS: Where are we now? • Natural sounding speech for some utterances • Where good match between input and database • Still…hard to vary prosodic features and retain naturalness • Yes-no questions: Do you want to fly first class? • Context-dependent variation still hard to infer from text and hard to realize naturally:
Appropriate contours from text • Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. • Variation in pitch range, rate, pausal duration to convey topic structure • Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. • How to mimic real voices?
TTS vs. CTS • Decisions in Text-to-Speech (TTS) depend on syntax, information status, topic structure,… information explicitly available to NLG • Concept-to-Speech (CTS) systems should be able to specify “better” prosody: the system knows what it wants to say and can specify how • But….generating prosody for CTS isn’t so easy
Next Week • Read • Discussion questions • Write an outline of your class project and what you’ve done so far