1 / 30

Speech Generation: From Concept and from Text

Speech Generation: From Concept and from Text. Julia Hirschberg CS 6998. Today. TTS CTS. Traditional TTS Systems. Monologue News articles, email, books, phone directories Input: plain text How to infer intention behind text?. Human Speech Production Levels. World Knowledge Semantics

dmelvin
Télécharger la présentation

Speech Generation: From Concept and from Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Generation: From Concept and from Text Julia Hirschberg CS 6998

  2. Today • TTS • CTS

  3. Traditional TTS Systems • Monologue • News articles, email, books, phone directories • Input: plain text • How to infer intention behind text?

  4. Human Speech Production Levels • World Knowledge • Semantics • Syntax • Word • Phonology • Motor Commands, articulator movements, F0, amplitude, duration • Acoustics

  5. TTS Production Levels: Back End and Front End • Orthographic input: The children read to Dr. Smith • World Knowledge text normalization • Semantics • Syntax word pronunciation • Word • Phonology intonation assigment • F0, amplitude, duration • Acoustics synthesis

  6. Text Normalization • Context independent: • Mr., 22, $N, NAACP, MAACO VISA • Context-dependent: • Dr., St., 1997, 3/16 • Abbreviation ambiguities: How to resolve? • Application restrictions – all names? • Rule or corpus-based decision procedure (Sproat et al ‘01)

  7. Part-of-speech ambiguity: • The convict went to jail/They will convict him • Said said hello • They read books/They will read books • Use: local lexical context, pos tagger, parser? Sense ambiguity: I fish for bass/I play the bass Use: decision lists (Yarowsky ’94)

  8. Word Pronunciation • Letter-to-Sound rules vs. large dictionary • O: _{C}e$  /o/ hope • O  /a/ hop • Morphological analysis • Popemobile • Hoped • Ethnic classification • Fujisaki, Infiniti

  9. Rhyming by analogy • Meronymy/metonymy • Exception Dictionary • Beethoven • Goal: phonemes+syllabification+lexical stress • Context-dependent too: • Give the book to John. • To John I said nothing.

  10. Intonation Assignment: Phrasing • Traditional: hand-built rules • Punctuation 234-5682 • Context/function word: no breaks after function word He went to dinner • Parse? She favors the nuts and bolts approach • Current: statistical analysis of large labeled corpus • Punctuation, pos window, utt length,…

  11. Intonation Assignment: Accent • Hand-built rules • Function/content distinction He went out the back door/He threw out the trash • Complex nominals: • Main Street/Park Avenue • city hall parking lot • Statistical procedures trained on large corpora • Contrastive stress, given/new distinction?

  12. Intonation Assignment: Contours • Simple rules • ‘.’ = declarative contour • ‘?’ = yes-no-question contour unless wh-word present at/near front of sentence • Well, how did he do it? And what do you know?

  13. The TTS Front End Today • Corpus-based statistical methods instead of hand-built rule-sets • Dictionaries instead of rules (but fall-back to rules) • Modest attempts to infer contrast, given/new • Text analysis tools: pos tagger, morphological analyzer, little parsing

  14. TTS Back End: Phonology to Acoustics • Goal: • Produce a phonological representation from segmentals (phonemes) and suprasegmentals (accent and phrasing assignment) • Convert to an acoustic signal (spectrum, pitch, duration, amplitude) • From phonetics to signal processing

  15. Phonological Modeling: Duration • How long should each phoneme be? • Identify of context phonemes • Position within syllable and # syllables • Phrasing • Stress • Speaking rate

  16. Phonological Modeling: Pitch • How to create F0 contour from accent/phrasing/contour assignment plus duration assignment and phonemes? • Contour or target models for accents, phrase boundaries • Rules to align phoneme string and smooth • How does F0 align with different phonemes?

  17. Phonetic Component: Segmentals • Phonemes have different acoustic realizations depending on nearby phonemes, stress • To/to, butter/tail • Approaches: • Articulatory synthesis • Formant synthesis • Concatenative synthesis • Diphone or unit selection

  18. Articulatory Synthesis-by-Rule • Model articulators: tongue body, tip, jaw, lips, velum, vocal folds • Rules control timing of movements of each articulator • Easy to model coarticulation since articulators modeled separately • But: sounds very unnatural • Transform from vocal tract to acoustics not well understood • Knowledge of articulator control rules incomplete

  19. Formant (Acoustic) Synthesis by Rule • Model of acoustic parameters: • Formant frequencies, bandwidths, amplitude of voicing, aspiration… • Phonemes have target values for parameters • Given a phonemic transcription of the input: • Rules select sequence of targets • Other rules determine duration of target values and transitions between

  20. Speech quality not natural • Acoustic model incomplete • Human knowledge of linguistic and acoustic control rules incomplete

  21. Concatenative Synthesis • Pre-recorded human speech • Cut up into units, code, store (indexed) • Diphones typical • Given a phonemic transcription • Rules select unit sequence • Rules concatenate units based on some selection criteria • Rules modify duration, amplitude, pitch, source – and smooth spectrum across junctures

  22. Issues • Speech quality varies based on • Size and number of units (coverage) • Rules • Speech coding method used to decompose acoustic signal into spectral, F0, amplitude parameters • How much the signal must be modified to produce the output

  23. Coding Methods • LPC: Linear Predictive Coding • Decompose waveform into vocal tract/formant frequencies, F0, amplitude: simple model of glottal excitation • Robotic • More elaborate variants (MPLPC, RELP) less robotic but distortions when change in F0, duration • PSOLA (pitch synchronous overlap/add): • No waveform decomposition

  24. Delete/repeat pitch periods to change duration • Overlap pitch periods to change F0 • Distortion if large F0, durational change • Sensitive to definition of pitch periods • No coding (use natural speech) • Avoid distortions of coding methods • But how to change duration, F0, amplitude?

  25. Corpus-based Unit Selection • Units determined case-by-case from large hand or automatically labeled corpus • Amount of concatenation depends on input and corpus • Algorithms for determining best units to use • Longest match to phonemes in input • Spectral distance measures • Matching prosodic, amplitude, durational features???

  26. TTS Back End: Summary • Speech most natural when least signal processing: corpus-based unit selection and no coding….but….

  27. TTS: Where are we now? • Natural sounding speech for some utterances • Where good match between input and database • Still…hard to vary prosodic features and retain naturalness • Yes-no questions: Do you want to fly first class? • Context-dependent variation still hard to infer from text and hard to realize naturally:

  28. Appropriate contours from text • Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. • Variation in pitch range, rate, pausal duration to convey topic structure • Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. • How to mimic real voices?

  29. TTS vs. CTS • Decisions in Text-to-Speech (TTS) depend on syntax, information status, topic structure,… information explicitly available to NLG • Concept-to-Speech (CTS) systems should be able to specify “better” prosody: the system knows what it wants to say and can specify how • But….generating prosody for CTS isn’t so easy

  30. Next Week • Read • Discussion questions • Write an outline of your class project and what you’ve done so far

More Related