1 / 59

Speech Assessment: Methods and Applications for Spoken Language Learning 語音評分的方法、應用與分享

Speech Assessment: Methods and Applications for Spoken Language Learning 語音評分的方法、應用與分享. J.-S. Roger Jang ( 張智星 ) jang@cs.nthu.edu.tw http://www.cs.nthu.edu.tw/~jang Multimedia Information Retrieval Lab CS Dept, Tsing Hua Univ, Taiwan. Outline. Introduction to speech assessment Methods

emmaline
Télécharger la présentation

Speech Assessment: Methods and Applications for Spoken Language Learning 語音評分的方法、應用與分享

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Assessment: Methods andApplications for Spoken Language Learning語音評分的方法、應用與分享 J.-S. Roger Jang (張智星) jang@cs.nthu.edu.tw http://www.cs.nthu.edu.tw/~jang Multimedia Information Retrieval Lab CS Dept, Tsing Hua Univ, Taiwan

  2. Outline • Introduction to speech assessment • Methods • Using learning to rank for speech assessment • Demos • Conclusions

  3. Intro. to Speech Assessment • Goal • Evaluate a person’s utterance based on some acoustic features, for language learning • Also known as • Pronunciation scoring • CAPT (computer-assisted pronunciation training)

  4. Four Aspects of Language Learning Skills Media SA! Easier for CALL Harder for CALL

  5. Speech Assessment • Characteristics of ideal SA • Assessment levels: as detailed as possible • Syllables, words, sentences, paragraphs • Assessment criteria: as many as possible • timbre, tone, energy, rhythm, co-articulation, … • Feedbacks: as specific as possible • High-level correction and suggestions

  6. Basic Assessment Criteria • Timber (咬字/音色) • Based on acoustic models • Tone (音調/音高) • Based on tone recognition (for tonal language) • Based on pitch similarity with the target utterance • Rhythm (韻律/音長) • Based on duration comparison with the target utterance • Energy (強度/音量) • Based on energy comparison with the target utterance

  7. Additional Assessment Criteria • English • Stress (重音) • Levels (word or sentence) • Intonation (整句音調) • Declarative sentence • Interrogative sentence • Co-articulation(連音) • A red apple. • Did you call me? • Won’t you go? • Raise your hand. • Mandarin • Tone (聲調) • Retroflex (捲舌音) • Co-articulation (連音) • 兒化音 • Others • Pause

  8. Types of SA • Types of SA (ordered by difficulty) • Type 1:有目標文字、有目標語句 • Type 2:有目標文字、無目標語句 • Type 3:無目標文字、有目標語句 • Type 4:無目標文字、無目標語句 • We are focusing on type 1 and 2.

  9. 第一類:有目標文字、有目標語句 • 方法: • 以語音辨識核心為基礎,進行語音和文字的強制對位(Forced Alignment, FA),再根據每一個Phone的相似度來進行評分 • 評分方式 • 音色:和語音辨識核心的語音模型比對 • 音調、韻律、強度:和目標語句比對 • 特性: • 由於FA的準確度很高,因此比較容易得到一致性較高的評分結果 • 範例: • myET (艾爾實驗室): www.myet.com • Saybot (說寶堂): www.saybot.com

  10. 第二類:有目標文字、無目標語句 • 方法: • 以語音辨識核心為基礎,進行語音和文字的強制對位(Forced Alignment),再根據每一個Phone的相似度來進行評分 • 評分方式 • 音色:和語音辨識核心的語音模型比對 • 音調:對於中文,可以經由文字處理來得到標準音調,再由語音進行音調辨識與評分。英文則無類似方法。 • 韻律、強度:無法比對 • 特性: • 由於FA的準確度很高,因此比較容易得到一致性較高的評分結果 • 教材準備較容易 • 但無法對韻律及音量進行評分 • 範例: • 階梯英文的 speak & score

  11. 第三類:無目標文字、有目標語句 • 方法: • 以語音辨識核心為基礎,進行語音的自由音節解碼(Free Syllable Decoding, FSD),再根據每一個音節字串的相似度來進行評分。 • 評分方式 • 音色:和目標語句音節字串進行比對 • 音調、韻律、強度:由FSD產生的音節來比對 • 特性: • 由於FSD的辨識率只有6~7成,因此比較難得到一致的評分結果。 • 也可以直接改用DTW來進行比對,但由於個人音色差異,評分的一致性較低。

  12. Our Approach • Basic approach to timbre assessment • Lexicon net construction (Usually a sausage net) • Forced alignment to identify phone boundaries • Phone scoring based on several criteria, such as ranking, histograms, posterior prob., etc. • Weighted average to get syllable/sentence scores

  13. Lexicon Net Construction • Lexicon net for “what are you allergic to?” • Sausage net with all possible (and correct) multiple pronunciations • Optional sil between words

  14. Lexicon Net with Confusing Phones • Common errors for Japanese learners of Chinese • ㄖㄌ • 例:天氣熱天氣樂 • ㄑㄐ • 例:打哈欠 打哈見 • ㄘㄗ • 例:一次旅行一字旅行 • ㄢㄤ • 例:晚安晚ㄤ • Rule-based approach to creating confusing syllables (phonological rules!) • Rules: • Rule 1: re  le • Rule 2: qi  ji • Rule 3: ci  zi • Rule 4: an  ang • Example • 欠 (qian)見 (jian)、嗆 (qiang)、降 (jiang)

  15. Example of Japanese Learners Speaking Chinese • 去年夏天熱死了 • Example 1 • Example 2 • 晚安 • Example 1 • Example 2 • 坐下來、慢慢吃 • Example 1 • 他不住的打哈欠 • Example 1 • 一次旅行 • Example 1 • 起風 • Example 1 • 休息 • Example 1

  16. Lexicon Net with Confusing Phones • Lexicon net for “天氣熱、打哈欠” • Canonical form: tian qi re da ha qian • 16 variant paths in the net: 欠 見 熱 氣 嗆 樂 降 記

  17. Automatic Confusing Syllable Id. Corpus of Japanese learners Of Chinese 強制對位以得到初步切音結果 對華語411音節進行比對 以找出每個音的混淆音 將混淆音節加入辨識網路 再進行強制對位及切音 切音結果不再變動? 輸出混淆音節 及辨識網路 No Yes

  18. Error Pattern Identification (EPI) • Common insertions/deletions from users • 以「朝辭白帝彩雲間」為標準語句 • 任意處結束,例如「朝辭白帝」 • 任意處開始,例如「彩雲間」 • 任意處開始與結束,例如「白帝彩雲」 • 任意處開始與結束,並出現跳字,例如「白彩雲」 • 疊字,例如「朝…朝辭白帝彩雲間」 • 疊詞,例如「朝辭…朝辭白帝彩雲間」 • 疊字加換音,例如「朝(cao)…朝(zhao)辭白帝彩雲間」 • 兩字對調,例如「朝辭彩帝白雲間」 • 錯字,例如「朝辭白帝黑山間」

  19. Lexicon Net for EPI (I) • 偵測「從頭開始、在任意處結束」的發音

  20. Lexicon Net for EPI (II) • 偵測「從任意處開始,在尾端結束」的發音

  21. Lexicon Net for EPI (III) • 偵測「從任意處開始,結束於任意處(但不可跳字)」的發音

  22. Lexicon Net for EPI (IV) • 偵測「從任意處開始,結束於任意處,而且可以跳字)」的發音

  23. Design Philosophy of Lexicon Nets • We need to strike a balance between recognition and lexicon • In the extreme, we can have a net for free syllable decoding to catch all error patterns. • The feasibility of free syllable decoding is offset by its not-so-high recognition rate.

  24. Scoring Methods for Speech Assessment • Five phone-based scoring methods • Duration-distribution scores (durDis) • Log-likelihood scores (hmmLike) • Log-posterior scores (hmmPost) • Log-likelihood-distribution scores (likeDis) • Rank ratio scores (rkRatio) • All based on forced alignment to segment phones

  25. Method 1: Duration-distribution Scores • PDF of phone duration • Obtained from forced alignment • Normalized by speech rate • Fitted by log-normal PDF • Max PDF  score 100

  26. Method 2: Log-likelihood Scores • Log-likelihood of phone with duration of frames : where is the likelihood of the frame with the observation vector

  27. Method 3: Log-posterior Scores • Log-posterior of phone with duration : where

  28. Method 4: Log-likelihood-distribution Scores • Use CDF of Gaussian for log-likelihood • CDF = 1  score = 100

  29. Method 5: Rank Ratio Scores • Rank ratio • RR to score conversion where parameters a, b are phone specific. • Possible sets of competing phones for x+y • *+y • *+*

  30. Examples of Rank Ratio Scores

  31. Demo of Our Prototype • ASR toolbox • http://mirlab.org/jang/matlab/toolbox/asr • Command: goDemoSa.m

  32. Intro. to Learning to Rank • Learning to rank • A supervised learning algorithmwhich generates a ranking model based on a training set of partially order items. (A task somewhat between classification and regression.) Item 1 Item 9 Ordered by preference Rank function Item 9 Item 3 Item 7 Item 3 Item 7 Item 2

  33. Learning to Rank: Methods and App. • Methods • Pointwise (e.g., Pranking) • Pairwise (e.g., RankSVM, RankBoost, RankNet) • Listwise (e.g., ListNet) • Applications • Webpage ranking • Machine translation • Protein structure prediction

  34. Application of LTR to SA • Why use LTR for SA? • Human scoring is rank-based • Tsing Hua’s grading system is moving from scores (0~100) to ranks (A, B, C, D…). • Combination of features (scores) • Features are complementary. • Effective determination of ranking • LTR only generates numerical output with a ranking order as close as possible to the correct order. A optimum DP-approach is proposed.

  35. LTR Score Segmentation Given: LTR scores (sorted) Desired rank We want to find the separating scores with score-to-rank function Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Such that is minimized.

  36. LTR Score Segmentation by DP (I) • Formulate the problem in DP framework • Optimum-value function D(i,j): The minimum cost of mapping to rank • Recurrent equation • Boundary condition: • Optimum cost:

  37. LTR Score Segmentation by DP (II) Computed rank 5 4 3 2 Desired rank 1 Local constraint: Recurrent formula:

  38. LTR Score Segmentation with DP (III) Data distribution: DP path:

  39. Flow Charts of Our Experiment

  40. Corpora for Experiments • WSJ0 • 8000 training utterances, 84 speakers. For training biphone acoustic models for forced alignment • MIR-SD • Recordings of about 4000 multi-syllable English words by 22 students (12 females and 10 males.) with an intermediate competence level. • Originally designed for stress detection • Available at http://mirlab.org/dataSet/public

  41. Human Scoring of MIR-SD • Human scoring • Only 50 utterances from each speaker of MIR-SD are scored by 2 humans, making a total of 1100 utterances • Human scoring are consistent:

  42. Examples of MIR-SD • Level 5 • apparent, paragraphic, constellation • Level 3 • additive, timorous, availably • Level 1 • ambiguity, auxiliary, anachronism

  43. Performance Indices • Performance indices used in the literature • hr = [1 3 5 4 2 2], cr = [2 3 5 2 1 4] • Recognition rate rRate = 33.33% • Recognition rate with tolerance 1 = 66.67% • Average absolute difference = 1 • Correlation coef = 0.54

  44. Performance Evaluation of Different Scoring Methods

  45. LTR Combination of Scores • Features for LTR • durDis and rkRatio: raw scores • hmmLike, hmmPost, likeDis: DP segmentation • LTR • RankSVM • Linear kernel • Baseline • hmmPost with DP-based segmentation

  46. Overall Performance Comparison • Legends • Score segmentation • Circles: DP • Triangles: k-means • Inside/outside tests • Solid lines: Inside • Dashed lines: Outside • Black lines: Baselines

  47. Summary of the Experiment • Segmentation • DP (supervised learning) is betten than k-means (unsupervised learning) • Performance indices • Correlation coefficient is not intuitive (consider [4 5 4] and [1 2 1]) • Recog. rate and sum of abs. diff. can be optimized by LTR and DP segmentation

  48. Demo: Practice of Mandarin Idioms of Length 4 (一語中的) • Level (difficulty) of an idiom is based on it’s freq. via Google search: • 孤掌難鳴 ===> 260,000 • 鶼鰈情深 ===> 43,300 • 亡鈇意鄰 ===> 22,700 • 舉案齊眉 ===> 235,000 • Can be adapted for English learning • Next step: multi-threading, fast decoding via FSM

  49. Support Mandarin & English Support user-defined recitation script Next step: multithreading for recording & recognition Demo: Recitation Machine(唸唸不忘)

  50. For Mandarin, English, Japanese Licensing for PC Applications

More Related