DYNAMIC ADAPTATION FOR LANGUAGE AND DIALECT IN A SPEECH SYNTHESIS SYSTEM Craig Olinsky Media Lab Europe / University Co

DYNAMIC ADAPTATION FOR LANGUAGE AND DIALECT IN A SPEECH SYNTHESIS SYSTEMCraig OlinskyMedia Lab Europe / University College Dublin

OVERVIEW: ADAPTATION IN SPEECH SYNTHESIS • Many of the areas which could most benefit from community-focused IT resource development have very high illiteracy rates among their populace. For such users, speech-based systems provide the most obvious and natural mechanism for them to interface with computers. • Without the widespread available of high quality speech databases, computer-readable lexicons, and other pre-processed linguistic information that is available for, for instance, standard dialects of French or German, it is expensive and difficult to build such systems. (“learning from sample” case in other presentation)

OVERVIEW: ADAPTATION IN SPEECH SYNTHESIS • Even within a particular language (including those major ones), the personalization of a speech Synthesis system for a particular use, market, and especially accent can provide much benefit to a deployed system. Recent articles have suggested, in fact, that humans connect better as listeners with a speaker and voice who sound like them, not only finding it easier to listen to and understand what is said, but also finding it more natural to assign emotional state and judge such factors as authority and honesty, and even intelligibility.

OVERVIEW: ADAPTATION IN SPEECH SYNTHESIS Perhaps the system can LISTEN to the user, and then CHANGE ITS OUTPUT to sound more like what it hears? • Instead of creating a dedicated system for every purpose, set up a number of “baseline” systems (along different languages, language families, etc.) and set them learning. • We benefit from the work put in developing the baseline system, while requiring a (minimum?) of additional focused training data. • Assumption: Learning “Accent”, “Dialect”, “Language” – not a distinct process, but all a matter of degree?

OVERVIEW: ADAPTATION IN SPEECH SYNTHESIS • HUMAN ANALOGUE: People who live for a period of time in an area where a different accent or dialect of their language is spoken often (involuntarily) start to pick up the local manners of speech. • SPEECH RECOGNITION ANALOGUE: “Speaker Adaptation” -- a procedure in which the acoustic model of the recognition system (or in limited cases the language mode as well), after being fully trained, is provided with additional speech data. Based upon this data, the values, parameters, nodes, weights, or other coefficients representing the acoustic model are shifted “towards” the new information such that the system should exhibit improved performance on data representing the new training data, even though such data may not have been included in its initial training procedure.

BACKGROUND: SPEAKER ADAPTATION FOR SPEECH RECOGNITION SYSTEMS QUICK PROCEDURE OVERVIEW: Given a set of recording target utterances and associated transcripts: • Generate synthesized utterance from transcript using current synthesizer (letter-to-sound rules, phones, speech database, etc.) • Compare target recording to generated source form to determine how the two pronunciations differ. • Re-organize the phone units and speech unit selection process to incorporate differences and info from target recording units. • Modify the lexical entries and letter-to-sound rules of the existing synthesizer to produce output that closer resembles the target utterance.

VARIATION AND ADAPTATION Ignoring for a moment issues such as vocabulary choice and other semantic issues of usage, it is possible to consider variation from accent, “dialect”, and even across languages as a difference in degree of variation in a few key areas: • the phonetic inventory which comprises the basic building blocks in which things are pronounced; • a set of pronunciation rulesor examples which dictate how the phonetic units are put together to assign a pronunciation to an orthographic form, and subsequently speak the desired text, and • a collection of conventionalized stress and intonational patternswhich help provide structure and syntactic/semantic context to the overall produced utterances.

VARIATION AND ADAPTATION • Cross-Speaker Adaptation. In such a mode, a generalized speech synthesizer is adapted towards the voice of a single user of the system. This can be done in one of two ways: Assuming that the original “voice” of the synthesizer is that of a professional speaker, either qualities of the user’s voice can be applied to the default voice, while still retaining the database of sound samples of the original speaker for use as the concatenative synthetic voice; conversely, the database can be expanded (or replaced) with samples of the user’s voice, while some abstract “quality” of the original professional voice is nonetheless retained, ideally providing some measure of the clearness and understandability for which the original speaker was initially retained. The ability to create natual-sounding speech from concatenation of samples drawn from a speech database comprised of recordings from multiple users, and/or of multiple quality, would also help encourage an open-source “bazaar” of decentralized users attempting to amass the large number of recorded forms necessary for a multi-purpose unit-selection synthesizer.

VARIATION AND ADAPTATION • Cross-Dialect Adaptation.This is almost exactly the case expressed above, except for that the “default” voice form and the specific user’s voice different in dialect, or to some greater degree than the average set of native speakers from a given area. That is, we would expect not only quality of voice variation, but also limited difference, in vocabulary, phonetic inventory, distinguishable minimal-pairs, accent, and the like. The result is that not only the unit-selection database, but also those components which assign phonetic realizations to the given text: the letter-to-sound rules and the pronunciation dictionary or lexicon, may need alteration. • Cross-Language Adaptation. In this case, we retain some degree of phonetic inventory similarity between the source and destination language, but our letter-to-sound rules and lexicon need gross modification, or may even be unusable (even some language pairs where are very similar in pronunciation, such as Japanese and Korean, could nonetheless use unrelated orthographic form, or voice versa).

VARIATION AND ADAPTATION • Cross-Language Adaptation, Single Speaker Variant.In this case, we have recordings from a single speaker (i.e., the user), which we want to be able to speak naturally in languages in which the user is not a native speaker. We thus want to use information about these other languages to adapt the synthesizer of the user’s voice to speak multilingually. (This is especially significant in our global community, where many proper nouns of personal names and locations cannot be properly pronounced simply by following the phonological rules of a single language). • Language “Acquisition”.In the extreme case, we wish to bootstrap an “empty” synthesizer (with no lexicon or knowledge of pronunciation rules whatsoever) to speak like us simply by speaking to it, without hard-cording direct linguistic or phonetic knowledge. This is a task that a non-technical, non-expert native speaker user should be able to perform.

VARIATION AND ADAPTATION Ignoring for a moment issues such as vocabulary choice and other semantic issues of usage, it is possible to consider variation from accent, “dialect”, and even across languages as a difference in degree of variation in a few key areas: • the phonetic inventory which comprises the basic building blocks in which things are pronounced; • a set of pronunciation rulesor examples which dictate how the phonetic units are put together to assign a pronunciation to an orthographic form, and subsequently speak the desired text, and • a collection of conventionalized stress and intonational patternswhich help provide structure and syntactic/semantic context to the overall produced utterances.

OVERVIEW: ADAPTATION IN SPEECH SYNTHESIS • Synthesis adds an additional problem to recognition adaptation: the fact that the database of recorded segments themselves is itself used for concatentation. This means that we can not just merge the entire set of recorded data together – there would be noticeable discrepancies between concatenative units taken from each individual speaker. On the other hand, if we just use the new set of segments, we aren’t adapting; we’re just building a new synthesizer. For this study, we take the new target data to be a small data set; not enough to be a good set of units for synthesis on its own.

OVERVIEW: ADAPTATION IN SPEECH SYNTHESIS • We are thus required to use existing (source) units for synthesis. However, these source recordings and their associated existing synthetic voice have a specific accent/dialect, with a pre-defined phone set. Even with a proper dictionary and proper letter-to-sound rules providing use with a “proper” pronunciation taking into account pronunciation variation for our target accent., stringing the “best match” units together likely won’t sound like a native speaker of that accent. The vowel quality might be vastly different, or phones might be missing in the source language (e.g., a French /r/). We want to adapt for this. Overall, we want to sound native in the target accent/dialect/language, using units recorded from the speaker of a different one.

PHONE UNIT ADAPTATION • If the variation between source and target speech is large enough, it is likely that describe the target speech with a different phone set than that of the source speech. • We may still find that the pronunciation of a particular phone in the target corresponds more closely with that of a different one than our source pronunciation lexicon would suggest (for instance, schwa reduction). • Or we might have an existing target pronunciation lexicon or pronunciation rules with a predefined phone set we with to use. • To utilize data from our source synthesizer in such a case, we need to assign appropriate mappings between source and target phones. • This can be seen as a matter of degree as to how much effort or knowledge is incorporated into creating the mapping, how closely such a mapping corresponds to the observered data, and thus our (assumed) rating of the quality of such a mapping.

PHONE UNIT ADAPTATION Figure 1: Degrees of Phoneme Mapping: (alleged) WORST (alleged) BEST Source Naïve Mapping Linguistically-Motivated Data-Driven Target Phoneme Set Phoneme Mapping Mapping Phoneme Set

PHONE UNIT ADAPTATION • na(t)ive approach: this approach follows the principle a non-native would follow when speaking a second language: he basically has the phonetic inventory of the first language and partially uses that inventory when speaking the second language…. • phonetic approach: this strategy follows principles in the production of sounds in the human vocal tract … that sound that agrees in the most phonetic features with the untrained one is taken instead of the unknown one of the goal language…. • data-driven approach: this approach determines the similarity among phones with the data given by the trained recognizer… according to a distance measure the most similar units may be joined.

PRONUCIATION ADAPTATION • Typically taken for granted in multilingual speech adaptation studies is the presences of a pronunciation dictionary and/or rules for the target language – • On the far extremes, we assume the existing of well-targeted pronunciation rules: in the worst case, one designed for the source speech, and the best case, one specifically designed for the target. In between, we use a number of methods to derive or create a pronunciation module, based either upon the existing source-language methods, the target speech data itself, or some combination.

PRONUNCIATION ADAPTATION Figure 2: Letter to sound rules/ lexicon (alleged) WORST (alleged) BEST Principled “Foreign Langua Trrained Principled Source-Only Approximation” Neutral from Target data Target-Only

PRONUNCIATION ADAPTATION • Principled Source-Only: this approach merely uses pronunciation methods specifically designed for the source speech to generate a pronunciation form for the target. This approach can result in extremely inaccurate pronunciation approximations, such as one might inspect from a native English attempt at a native pronunciation of an unusual foreign • “Foreign Approximation”: this approach can be seen as akin to the na(t)ive approach of phone mapping as discussed above. In this case, the speaker recognizes that the word being pronounced is not a native one, and relaxes some of the language-specific rules or attempts to move the pronunciation closer to that of the “assumed” language of the word in question. The result is closer, but still inaccurate and strongly accented.

PRONUNCIATION ADAPTATION • Language-Neutral: this approach purely ignores all language-specific information, assuming either a set of very generic or regular pronunciation rules, proposing a (relatively) direct relation between orthographic form and pronunciation. Such rules would closely resembles those used for a language with artificially few pronunciation exceptions, such as Esperanto, rather than that of English. • Trained from Target Data: in this method, an aligned text and speech signal are provided to a recognizer, along with (possibly) a limited set of pronunciation transcriptions as training data. In some automatic way, the system learns a set of pronunciation rules and/or a lexicon of pronunciations which closely matches the training data. • Principled Target-Only: this approach assumes a provided pronunciation modules specifically designed to generate correct pronunciations for our target language/dialect/accent.

UNIT DATABASE COMPOSITION Figure 3: Methods of Comprising Unit Database (alleged) WORST (alleged) BEST Source Speaker Union of al Source Speaker Set of Digitally Target Speaker Only Recordings + uncovered phones Altered Segments Only (unprincipled) from target only

ADAPTATION FROM MIMICRY • We know from the beginning that our source unit database is of the best quality (in terms of recording, segmentation, labelling, etc.) • But we can’t directly synthesize from the source database, because we will get accented, non-native sounding speech. Is there a way to generate in a non-accented or differently-accented way from a single speech database? • Try to find a “neutrally” accented speaker? (What does this mean – someone heavily polylingual? Someone geographically in between the two languages or accents?) • Look at mimicry studies – how someone (intentionally) modifies their voice to sound like a different speaker.

ADAPTATION FROM MIMICRY • Anders Eriksson and Pär Wretling – “How Flexible is the Human Voice? – A Case Study of Mimicry” Close mimicry of global speech rate No change for timing at segmental level Mean fundamental frequency and variation matched timing closely Formant frequencies attained with variant success: Vowel imitation intermediate between voice and target “Fundamental frequency changes were more successful than changes in timing”

STAGES OF THE EXPERIMENT • Our development efforts and systems will follow the four modes listed in the research overview in order of ascribed complexity. For the Cross-Speaker Adaptation case, we will utilize a base voice and training speaker of native American English. • For the Cross-Dialect Adaptation study, we will retain the use of English for the basic case, adapting over a selection of American, British, and Irish English dialects. • We will then finish with two data sets for Cross-Language Adaptation, proceeding in order of linguistic variation – variation over the set of Celtic languages still in current use (Irish, Scottish Gaelic, and, slightly more distantly, Welsh) and a selection of Asian Indian Languages, including (at least) Bengali and Hindi.

DYNAMIC ADAPTATION FOR LANGUAGE AND DIALECT IN A SPEECH SYNTHESIS SYSTEM Craig Olinsky Media Lab Europe / University Co