1 / 20

Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006. Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Ž ganec Gros Alpineon d.o.o. , Ljubljana, Slovenia jerneja.gros@alpineon.com. ALPINEon

neve-wagner
Télécharger la présentation

Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006 • Considerations on using PLS for Slovenian Pronunciation Lexicon Construction • Jerneja Žganec Gros • Alpineond.o.o., Ljubljana, Slovenia • jerneja.gros@alpineon.com

  2. ALPINEon • SI-PRONlexicon: • word list • lexicon format • phonetic transcription • morpho-syntactic descriptions • Proposed extensions to PLS, SSML • Conclusions

  3. Language specifics • Slovenian language: • Slavic language, 2 million speakers, over 70 dialects • complex inflectional paradigm (common to Slavic languages) • including "dual" – like ancient Greek! • lexical stress position – undefined and moving, like Russian (unlike some other Slavic languages, e.g. Croatian never carries accent on the last syllable) • many homographs, usually POS info helps at disambiguation: • example: On je. (He is/eats). auxiliary_verb/indicative

  4. Pron lex • Speech technology applications: • automatic speech recognition (ASR) • text-to-speech synthesis (TTS) • require consistent specification of pronunciation • Slovenian: lexical stress position not fixed -> pron lex crucial • Pronunciation lexicons: • general: not supposed to be covered by PLS • application-specific • word/phrase pronunciations • application-specific proper nouns: personal&location names

  5. Word-list • SI-PRON wordlist: (a) 93,154 lemmas from SSKJ (b) over 1,000,000 word form derived from (a) – morphol. deriv. (c) additional word list: • corpus-based search • 20,000 most freq inflected word forms not covered by SSKJ lemmas (d) collocations, multi-word expressions SSKJ: Slovar slovenskega knjižnega jezika

  6. Phonetic transcriptions • SSKJ lemmas: • automatic derivation, based on dynamic/tonemic accent information • manual corrections for about 2.500 lemmas (words of foreign origin) • Word forms derived from SSKJ: • automatic: SSKJ lemma pronunciation look-up, inflectional paradigms • Additional corpus-based word list: • automatic lexical stress assignment • AlpSynth grapheme-to-phoneme rule set

  7. GTP rules • 193 context-dependent grapheme-to-phoneme rules: Left Grapheme Right Phonetic Example Rule explanation context string context transcr. $ er _ [@r] Gaber @ occurs before each -r not followed by a vowel (T opori sic 91, p.49) = m f [F] Simfonija <m> in front of <f> and <v> is pronounced as a labiodental (Pravopis90, p. 145)

  8. Transcription accuracy experiment • reference: hand-crafted pron lex, 30K lexemes, no loanwords(!) • automatic lexical stress assignment: 15% error rate • lexical stress & o/e pronunciation known in advance: • transcription success rate 99.1% (0.6% handcrafting errors) • conclusion: • for semi-automatic derivation of phonetic transcriptions with a 0.3% error rate only lexical stress positions & e/o need to be manually validated

  9. SI-PRON format • LC-STAR lexicon specs – STTS (Shamas & v Heuvel, 2004) • Pronunciation Lexicon Specification (PLS) • Version 1.0, W3C Last Call Working Draft 31 January 2006 • http://www.w3.org/TR/pronunciation-lexicon/ • PLS: • Ver 1.0 not designed for TTS internal lexicons • on the other hand, we want to have a stronger link between SSML and the lexicon • we are even thinking of introducing POS attribute into token-like elements! • leave these issues for PLS Ver 2.x or address them now?

  10. Pronunciation variations • multiple pronunciations: • several<phoneme> elements • preferred pronunciation: • indicated by the prefer element • usually the 1st pronunciation from the SSKJ • for some words, 2 prons are equally preferred, e.g.: - male Slovenian nouns, terminating with "ilec" like /borilec/, /darovalec/ • "iUts"/"ilts", "ilts"/"iUts", "ilts", or "iUts" • typically account for more fluent"iUts" or overarticulated"ilts"pronunciation

  11. Extensions… • proposed extension for PLS/SSML: • a new optional attribute for the <phoneme> element: • pron-styleattribute • values: "fluent", "overarticulated" • pron-stylealso for other elements(linkage SSML-lex!): • <voice>, <speak>, <p>, <s> • another optional attribute for the above elements: emotionfor expressive TTS ? • could this be covered by the new role attribute?  • similar to <speaking_style>, proposed yesterday

  12. Extensions… • PLS…. source/creator: • only the <metadata>element • source of multiple pronunciations: • useful info when merging multiple PLS dox • some sources/creators may be more reliable than others… - additional optional attribute pron-sourcefor the <phoneme>element

  13. Extensions… • part-of-speech tags: • Slovenian – complex inflectional paradigm • morphological, syntactic and semantic(?) descriptors welcome in future revisions of the PLS specification • SSML: POS tags could be defined as an optional attribute of the <token> element • lemma, MSD attributes used in SI-PRON • MULTEXT-East MSDs (Erjavec, 2004) – Telri, Concede Multext-East LRs, http://nl.ijs.si/ME/V3 EAGLES,TEI P4 compliant

  14. MSDs

  15. MSDs

  16. MSDs

  17. MDSs • TTS-internal lexicon (for high-inflected languages) • full-blown form (PLS or other) • compact lexicons: • exception lexicon • derivational scheme/paradigm for providing prefix/suffix morphological rules, indications of lexical stress position shifts (hardly an issue of PLS)

  18. Conclusion • possible extensions to PLS, SSML: • pron-styleattribute  • emotionattribute needed? • source/creatorattribute welcome • morpho-syntactic, semantic descriptors 

  19. Project Partners • L6-5405 project • Research of Slovenian Language in Lexicography and Lexicology based on Digital Language Resources • Spoken representation of Slovenian words: • http://bos.zrc-sazu.si/sskj.html • Alpineon • ZRC-SAZU • Fran Ramovš Institute of the Slovenian Language

  20. THANK YOU FOR YOUR ATTENTION!

More Related