1 / 28

Features for Factored Language Models for Code-Switching Speech

Features for Factored Language Models for Code-Switching Speech. Heike Adel, Katrin Kirchhoff, Dominic Telaar , Ngoc Thang Vu, Tim Schlippe , Tanja Schultz. Outline. Introduction Motivation Seame Corpus Main Contributions Factored Language Models Features for Code-Switching Speech

patch
Télécharger la présentation

Features for Factored Language Models for Code-Switching Speech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Features for Factored Language Models for Code-Switching Speech Heike Adel, Katrin Kirchhoff, Dominic Telaar, Ngoc Thang Vu, Tim Schlippe, TanjaSchultz

  2. Outline • Introduction • Motivation • Seame Corpus • Main Contributions • Factored Language Models • Features for Code-Switching Speech • Experiments • Conclusion • Summary Features for Factored Language Models for Code-Switching Speech

  3. Motivation • Code-Switching (CS) = speechwithmorethanonelanguage • Exists in multilingual communitiesoramongimmigrants Challenges: multilingual modelsandCS trainingdatanecessary Features for Factored Language Models for Code-Switching Speech

  4. SEAME corpus • SEAME = South East Asia Mandarin-English • Conversational speech, recorded from Singaporean and Malaysian speakers by [1] Challenges • much CS per utterance (Æ: 2.6) • short monolingual segments (mostly less than 1 sec, 2-4 words) • not much training data for LM (575k words) [1] Lyu, D.C. et al.,2010Originally used: research project ‘Code-Switch‘ (NTU and KIT) other particle EN MAN Features for Factored Language Models for Code-Switching Speech

  5. Main contributions • Investigation of different featuresfor Code-Switchingspeech • Integration offactoredlanguagemodelsinto a dynamicone-pass decoder Features for Factored Language Models for Code-Switching Speech

  6. Factored Language Models (FLMs) [2] • Idea: word = feature bundle • Good e.g. in the case of • Rich morphology • Few training data=> applicable to CS task • Generalized backoff F | F1 F2 F3 F | F1 F2 F | F2 F3 F | F1 F3 F | F1 F | F2 F | F3 F [2] Bilmes, J. and Kirchhoff, K., 2003 Features for Factored Language Models for Code-Switching Speech

  7. Features: Words, POS, LID • Problems: • POS tagging of CS speech: challenging • Accuracy of POS tagger: unknown •  different clustering method may be more robust Features for Factored Language Models for Code-Switching Speech

  8. Features: Brown Word Clusters • Clusters based on word distributions in text [3] • minimize average mutual information loss • Best number of classes in terms of PPL: 70 • So far: clusters based on syntax or word distributions  next step: semantic features • [3] Brown, P.F. Et al. ,1992 Features for Factored Language Models for Code-Switching Speech

  9. Features: Open Class Words • Definition: content words, e.g. nouns, verbs, adverbs • “open” because class can be extended with new words, e.g. “Bollywood”  open class words indicate semantic of sentence Features for Factored Language Models for Code-Switching Speech

  10. Features: Open Class Word Clusters • Idea: • Semantic clusters in comparison to distribution based clusters (oc Brown clusters) OC word 1, OC word 2, OC word 3, …, OC word 8, OC word 9 OC word 2 OC word 5 OC word 9 Topic c Topic a OC word 4 OC word 7 OC word 8 OC word 1 OC word 3 OC word 6 Topic b Features for Factored Language Models for Code-Switching Speech

  11. Features: Semantic OC Word Clusters • Clustering of open class word vectors • RNNLMs learn syntactic and semantic similarities [4] • RNNLMs represent words as vectors  apply clustering to these word vectors • k-means clustering • spectral clustering [4] Mikolov, T. et al., 2013 Features for Factored Language Models for Code-Switching Speech

  12. Features: Semantic OC Word Clusters • Experiments with different • Clustering methods • Brown, k-means, Spectral Clustering • Monolingual and bilingual clusters • Monolingual Clusters • Based on English and Mandarin Gigaword data (2005) • Bilingual Clusters • Based on CS text • Mixed lines of Gigaworddata • different numbers of clusters  Lowest perplexity (247.24, but unclusteredoc words: 247.18): • Spectral Clustering • Bilingual Clusters • 800 OC word clusters Features for Factored Language Models for Code-Switching Speech

  13. FLMs: Decoding Experiments • Interpolation weight of FLM and n-gram FLM weight Features for Factored Language Models for Code-Switching Speech

  14. FLMs: Decoding Experiments (2) • Decoding results Features for Factored Language Models for Code-Switching Speech

  15. Conclusion • Summary • Best features in termsof FLM perplexity: words + POS + Brown clusters + ocwords • Relative PPL reductionofupto 10.8% (eval) • Best features in termsof MER:words + POS + Brown clusters (+ occlusters) • Relative MER reductionofupto 3.4% (eval) Features for Factored Language Models for Code-Switching Speech

  16. Thank you for your attention! Features for Factored Language Models for Code-Switching Speech

  17. Features: Semantic OC Word Clusters (2) • Monolingual Clusters • Based on English and Mandarin Gigaword data (2005) • Factors for FLMs: words, part-of-speech tags, Brown word clusters and open class word clusters unclustered Baseline 3-gram: 268.39 Features for Factored Language Models for Code-Switching Speech

  18. Features: Semantic OC Word Clusters (3) • Bilingual Clusters • Based on CS text or mixed lines of Gigaword data • Factors for FLMs: words, part-of-speech tags, Brown word clusters and open class word clusters 1000 2000 3000 250 2000 500 4000 800 6000 unclustered Baseline 3-gram: 268.39 Features for Factored Language Models for Code-Switching Speech

  19. Factored Language Models • Features investigated in this study: • Words • Part-of-speech tags • Language information • Brown word clusters • Open class words • Open class word clusters Features for Factored Language Models for Code-Switching Speech

  20. POS taggingof CS speech Post-processing Schultz, T. et al.: Detecting code-switch events based on textual features, 2010. POS taggerfor English Output Language islands (> 2 embedded words) „Matrix language“= Mandarin „Embedded language“ = English English segments in remainingtext Analysis CS-text Remainingtext POS taggerfor Mandarin Output Heike Adel - Integration of Syntactic and Semantic Features into Statistical Code-Switching Language Modeling

  21. Features: Brown Word Clusters • Clusters based on word distributions in text • Find classes which minimize the average mutual information loss • Tool: SRILM toolkit • Number of classes determined based on PPL results on SEAME develop-ment set perplexity numberof classes Features for Factored Language Models for Code-Switching Speech

  22. Features: Brown Word Clusters (2) • Brown word clusters: 70 classes • So far: clusters based on syntax or word distributions => next step: semantic features Features for Factored Language Models for Code-Switching Speech

  23. Features: Open Class Words • Definition: content words, e.g. nouns, verbs, adverbs • “open” because class can be extended with new words, e.g. “Bollywood” => open class words indicate semantic of sentence Features for Factored Language Models for Code-Switching Speech

  24. Features: Open Class Word Clusters OC word 1, OC word 2, OC word 3, …, OC word 8, OC word 9 • Idea: • Comparing bilingual and monolingual clusters • data: English and Chinese Gigaword data [2], CS corpus • Brown clustering of open class words • Clustering of open class word vectors [2] fifth edition; English: LDC2011T07, Chinese: LDC2011T13 OC word 2 OC word 5 OC word 9 Topic c Topic a OC word 4 OC word 7 OC word 8 OC word 1 OC word 3 OC word 6 Topic b Features for Factored Language Models for Code-Switching Speech

  25. Open Class Word Clusters (2) • Spectral clustering with Graclus [4] [4] Dhillon, I.S. et al.: Weighted graph cuts without eigenvectors: A multilevel approach similarity graph Bisectionclustering 5k nodes mergenodes • Uncoarsengraph • weightedkernel k-meansclustering (bisectionclusteringresultsasinitialization) k classes Features for Factored Language Models for Code-Switching Speech

  26. CS Trigger Ability of Different Factors • CS-rate: Features for Factored Language Models for Code-Switching Speech

  27. FLMs: Significance tests • Perplexity results on eval set Features for Factored Language Models for Code-Switching Speech

  28. FLMs: Significance tests • Mixed error rate results on eval set Features for Factored Language Models for Code-Switching Speech

More Related