1 / 15

Language-Independent Phone Recognition

Language-Independent Phone Recognition. Jui-Ting Huang, Mark Hasegawa-Johnson jhuang29@illinois.edu University of Illinois at Urbana-Champaign. Motivation. A*STAR challenge Audio task: audio retrieval given IPA queries or waveforms Need to transcribe the database/queries

peggyburke
Télécharger la présentation

Language-Independent Phone Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language-Independent Phone Recognition Jui-Ting Huang, Mark Hasegawa-Johnson jhuang29@illinois.edu University of Illinois at Urbana-Champaign

  2. Motivation • A*STAR challenge • Audio task: audio retrieval given IPA queries or waveforms • Need to transcribe the database/queries • Multilingual database and queries • We might encounter unseen (untrained) languages (Tamil, Malay…) • phone-based recognition instead of word-based recognition

  3. Training data • 10 languages, 11 corpora • Arabic, Croatian, English, Japanese, Mandarin, Portuguese, Russian, Spanish, Turkish, Urdu • 95 hours of speech • Sampled from a larger set of corpora • Mixed styles of speech: broadcast, read, and spontaneous

  4. Summarization of corpora

  5. Phone set • Phonetic symbols: Worldbet • An ASCII encoding of the IPA + additional symbols for multi-languages • Convenient use for HTK • We have totally 205 phones • 196 distinct phones from 10 languages • Non-speech “phones”: • vocalic pause, nasalized pause, short pause, silence, noise, comma, period, question mark

  6. IPA chart (consonants)

  7. IPA chart (vowels)

  8. Worldbet chart (consonants)

  9. Worldbet chart (vowels)

  10. Acoustic model ^’-A+b%ted ^’-A+b’%ted >A+cm%cmn …. • Context-dependent triphone modeling • cross-word triphones • Punctuation marks and lexical stress are also considered as context • Language diacritics are created for each triphone • In total, we have 141530 distinct triphones • Spectral features: 39-dim PLP, cepstral mean/variance normalization per speaker • Modeling: HMMs with {11,13,15,17}-mixture Gaussians

  11. Acoustic model(triphone clustering I) • State tying for triphone models • to ensure that all state distributions can be robustly estimated… • similar acoustic states of these triphones are clustered and tied • Number of states: 424573 -> 19485 [4.6%] total • Decision-tree-based clustering • Asking questions about the left and right contexts of each triphone • Each question split the pooled triphones to two acoustically different subsets

  12. Acoustic model (triphone clustering II) • Categories for decision tree questions • Right or left context • Distinctive phone features (manner/place of articulation) • Language identity • Lexical stress • Punctuation mark ^’-A+b%ted ^’-A+b’%ted >A+cm%cmn ….

  13. Language model • Triphone bigram language model • equivalent to monophone quad-gram • Language-independent model • pool the phone-level transcriptions from all corpora together • Vocabulary size: top 60K frequent triphones (since 140K is too much!) • For the rest of infrequent triphones, map them back to center monophones

  14. Recognition results • Test set: 50 sentences per corpus

  15. Future work • Preparation of training data • unify non-speech tags across corpora • add more training data • For language-independent task • language model: interpolation between language-specific LMs • For language-dependent task • multi-lingual AM + language-specific LM (word-level recognition)

More Related