1 / 21

Automatic speech recognition of Cantonese-English code-mixing utterances

Automatic speech recognition of Cantonese-English code-mixing utterances. Joyce Y. C. Chan, P. C. Ching, Tan Lee and Houwei Cao Department of Electronic Engineering The Chinese University of Hong Kong, Hong Kong SAR, China. Presenter: Hsu Ting-Wei. Reference.

Télécharger la présentation

Automatic speech recognition of Cantonese-English code-mixing utterances

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic speech recognition of Cantonese-English code-mixing utterances Joyce Y. C. Chan, P. C. Ching, Tan Lee and Houwei Cao Department of Electronic Engineering The Chinese University of Hong Kong, Hong Kong SAR, China Presenter: Hsu Ting-Wei

  2. Reference • [11] Joyce Y. C. Chan, P. C. Ching and Tan Lee, “Development of a Cantonese-English Code-mixing Speech Corpus”, in Proc. of Eurospeech 2005, pp. 1533-1536, Lisbon, 2005 • [13] Joyce Y. C. Chan, P. C. Ching, Tan Lee and Helen M. Meng, “Detection of Language Boundary in Code-switching Utterances by Bi-phone Probabilities”, in Proc. of ISCSLP 2004, pp. 293-296, Hong Kong, 2004 • [6] Mirjam Wester “Syllable Classification using Articulatory- Acoustic Features”, in Proc. of Eurospeech 2003, pp. 233-236, Geneva, Switerzerland, 2003 • [10] W. K. Lo, Tan Lee and P. C. Ching, “Development of Cantonese spoken language corpora for speech applications”,in Proc. of ISCSLP 1998, pp. 102-107, Singapore, 1998 NTNU Speech Lab

  3. Outline • 1. Definition • 2. Introduction • 3. Acoustic modeling • 4. Language modeling • 5. Language boundary detection (LBD) • 6. Experiment • 7. Conclusion NTNU Speech Lab

  4. 1. Definition • Code-switching • John Gumperz,1982, • The juxtaposition within the same speech exchange of passages of speech belonging to two different grammatical systems or sub-system • Code-mixing • In Hong Kong, code switching tends to be intra-sentential and switching involving linguistic units above the clause level is rare, hence the preference for the term "code-mixing" in many studies • Ex: (Cantonese) NTNU Speech Lab

  5. 2. Introduction • Hong Kong is a truly international city and most people are Cantonese-English bilinguals. • Cantonese is usually the matrix language while English is the embedded language that is often used to better describe meanings, feelings and phenomena in Hong Kong. • However, the English words uttered by many local people do contain Cantonese accent (口音), which makes automatic speech recognition difficult. NTNU Speech Lab

  6. 2. Introduction (cont.) • 2.0 Phonological structure of Cantonese and English • Cantonese • One of the major Chinese dialects which is a Sino-Tibetan language • It is monosyllabic in nature and has a general syllable structure C1VC2 • All the Cantonese syllables are of the canonical forms V, CV, CVC or VC • English • English is Indo-European language • Phonological structure is much more complicated than Cantonese. • In English discourse, over 80% of the syllables are of the canonical form of Cantonese, and the remainings are C, CC, CCV, VCC, CCCV, CCCVCC NTNU Speech Lab

  7. 2. Introduction (cont.) • 2.1 Cantonese accent in the embedded English words • This phenomenon is called borrowing. (1990) • For Cantonese speakers, the borrowing words are pronounced with the following characteristics: • Softening or dropping the second consonant in a CC sequence, e.g. plan /p l ae n/ is pronounced as /p ae n/ • Softening or dropping the final stop consonant e.g. check /ch eh k/ is pronounced as /ch eh/ • Adapting a monosyllabic word with fricative endings to produce a disyllabic, e.g. notes /n ow t s/ is pronounced as /n ow t s iy/ • Retroflex such as /r/ is read as /l/ sound or /w/ sound, e.g. pressure /p r eh sh er/ is pronounced as /p l eh sh er/, andrepeat /r iy p iy t/ is pronounced as /w iy p iy t/ • If the phone exists in English only but not in Cantonese, they will be pronounced as the similar phones in Cantonese, such that /th/ becomes /f/, and /eh/ becomes /ae/ NTNU Speech Lab

  8. 2. Introduction (cont.) • 2.2 Phone change and syllable fusion in Cantonese • Hong Kong people do not use romanization systems when they learn Chinese or Cantonese. People may not know the correct pronunciation of the words, and confuse a phoneme with the other. • Besides, syllable fusion may occur in fast speech. The pronunciation of the second syllable of disyllabic words may be ignored or changed. For example, the word “知道”/zi1 dou3/ may be pronounced as /zi1 ou3/, “今日”/gam1 jat6/ becomes /gam1 mat6/ . (Cantonese) • Lead to phone insertion or phone deletion NTNU Speech Lab

  9. 2. Introduction (cont.) • Scenario • 1. Preparing the monolingual and cross-lingual acoustic models • 2. Preparing the modified pronunciation dictionary • To handle accents in the code-switch words, the phonetic sequence of the English lexicons in the pronunciation dictionary is modified • 3. Preparing the language models • Four different statistic language models are proposed in order to solve the problem on the lack of code-mixing training text data NTNU Speech Lab

  10. 2. Introduction (cont.) • Scenario • 4. Code-mixing speech recognizer • Bilingual speech recognizer, which is syllable based for Cantonese and word based for English. • Two-pass system • First pass • No language models are applied in the first pass. • A lattice will be generated by the bilingual speech recognizer, and language boundary (LB) information will be integrated to the lattice by re-scoring the acoustic scores of the hypothesis words. • Two pass • Language model scores will finally be integrated to the lattice, and the Generalized Word Posterior Probability (GWPP) will be derived. • According to the GWPP score, a character-based hypothesis will then be obtained by best path searching NTNU Speech Lab

  11. 3. Acoustic modeling • Three speech corpora are involved in this research; • TIMIT: Monolingual English corpus (native speakers) • CUSENT : Monolingual Cantonese corpus (newspaper content) • CUMIX : Cantonese-English code-mixing corpus (C+E, C, Modified lexicon) No accents Cross-lingual • All the acoustic models are triphone models • The language-dependent models are monolingual(單語), which includes 39 English phones and 56 Cantonese phones. NTNU Speech Lab

  12. 3. Acoustic modeling (cont.) In model set C, similar phones of the two languages are clustered, and therefore, the total number of phones is reduced to 70. The dictionary contains an average of 2.267 different pronunciations for each English lexicon. NTNU Speech Lab

  13. 4. Language modeling Mixing between standard Chinese and spoken Cantonese is another problem, since this will involve different sets of lexicons and grammar. Instead of searching for code-mixing text data, we searched for spoken Cantonese text. Articles that contain the selected spoken Cantonese characters (those do not appear in standard Chinese, e.g. ) are selected. Among the collected data, 10% of them are code-mixing. NTNU Speech Lab

  14. 4. Language modeling (cont.) • All the language models are tri-gram, which is character based for Cantonese. • Monolingual language model (CAN_LM)– consider all English words as out-of-vocabulary (OOV). • Code-mixing language model (CS_LM)– all English words share the same probability. • Class-based language model (CLASS_LN)– classify the English words into 13 classes according to their part-of-speech (POS) and meaning. The classes are: adjective, companies, date and time, event and activities, fashion,food, brand name, objects and tools, human name, place, sentence and phrase, shops and restaurants, software, verb and the remaining nouns. Most of the classes are nouns since they are in major among code-switch words. • Translation-based language model (TRANS_LN)– translate the English words into their Cantonese equivalent if available; otherwise, use the classes in CLASS_LM. The language model is still character-based, even if the corresponding Cantonese contains multiple characters. NTNU Speech Lab

  15. 5. Language boundary detection (LBD) (cont.) • General equation for intra-syllable bi-phone probability is given by: NTNU Speech Lab

  16. 5. Language boundary detection (LBD) (cont.) • The same character may have different phone sequences when it has different meanings. • For example, the character 行 can be pronounced as /haang/, /hong/ and /hang/ in different phrases. The following example is to calculate the probability that行is pronounced as /haang/. NTNU Speech Lab

  17. 5. Language boundary detection (LBD) (cont.) • Ex: g-am n-in j-au B OW N AH S g-e Phone based => g_am n_in j_au B_OW OW_N N_AH AH_S g_e Intra bi-phone=> CAN Probability => CAN ENG ENG CAN CAN CAN ENG CAN(3) ENG(2) CAN(1) ENG(1) CAN(1) NTNU Speech Lab

  18. 6. Experiment NTNU Speech Lab

  19. 6. Experiment (cont.) • However, when there are accents, the syllable structure of the code-switch words changes. Therefore, the English words would sound like Cantonese words. • To tackle(處理) problems due to accents, larger units should be considered. • Hence, we propose to use a syllable-based LBD, or apply LBD algorithms to the lattice generated by a bilingual speech recognizer. • LBD approach based on lattice searches the English word with the longest (WE) duration from the word lattice. NTNU Speech Lab

  20. 6. Experiment (cont.) NTNU Speech Lab

  21. 7. Conclusion • The duration of English words is longer than that of Cantonese characters, since Cantonese is monosyllabic. Hence, the lattice-based LBD algorithm obtains a higher LBD accuracy. • When the correct language boundary is obtained, the accuracy of the code-switch words can be increased. • Therefore, studies on language boundary detection are necessary for further research. NTNU Speech Lab

More Related