1 / 22

AUTOMATIC PHONETIC ANNOTATION OF AN ORTHOGRAPHICALLY TRANSCRIBED SPEECH CORPUS

AUTOMATIC PHONETIC ANNOTATION OF AN ORTHOGRAPHICALLY TRANSCRIBED SPEECH CORPUS. Rui Amaral, Pedro Carvalho, Diamantino Caseiro, Isabel Trancoso, Luís Oliveira IST, Instituto Superior Técnico INESC, Instituto de Engenharia de Sistemas e Computadores. Summary. Motivation System Architecture

herbst
Télécharger la présentation

AUTOMATIC PHONETIC ANNOTATION OF AN ORTHOGRAPHICALLY TRANSCRIBED SPEECH CORPUS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AUTOMATIC PHONETIC ANNOTATIONOF AN ORTHOGRAPHICALLY TRANSCRIBED SPEECH CORPUS Rui Amaral, Pedro Carvalho, Diamantino Caseiro, Isabel Trancoso, Luís Oliveira IST, Instituto Superior Técnico INESC, Instituto de Engenharia de Sistemas e Computadores

  2. Summary • Motivation • System Architecture • Module 1: Grapheme-to-phone converter (G2P) • Module 2: Alternative transcriptions generator (ATG) • Module 3: Acoustic signal processor • Module 4: Phonetic decoder and aligner • Training and Test Corpora • Results • Transcription and alignment (Development phase) • Test corpus annotation (Evaluation phase) • Conclusions and Future Work

  3. Motivation • Time consuming, repetitive task ( over 60 x real time) • Large corpora processing • No expert intervention • Non-existence of widely adopted standard procedures • Error prone • Inconsistency's among human annotators

  4. Acoustic signal processor Alternative Grapheme-to-Phone Phonetic Transcriptions Converter Decoder/Aligner Phonetically annotated Orthographically transcribed Generator speech corpus speech corpus Lexicon Rules System Architecture

  5. - Module 1 - Grapheme-to-Phone Converter Modules of the Portuguese TTS system (DIXI) • Text normalisation • Special symbols, numerals, abbreviations and acronyms • Broad Phonetic Transcription • Careful pronunciation of the word pronunciation • Set of 200 rules • Small exceptions dictionary (364 entries) • SAMPA phonetic alphabet

  6. - Module 2 -Alternative Transcriptions Generator Transformation of phone sequences into lattices • Based on optional rules: • Which account for: • Sandhi • Vowel reduction • Specified using finite-state-grammars and simple transduction operators A (B  C) D

  7. Type Text Broad P.T. Alternative P.T. with vowel de uma [d@ um6] [djum6] sandhi quality change mesmo assim [m"eZmu 6s"i~] [m"eZmw6s"I~] with de uma [d@ um6] [dum6] sandhi vowel reduction mesmo assim [m"eZmu 6s"i~] [m"eZm6s"i~] semana [s@m"6n6] [sm"6n6] Examples: vowel reduction oito ["ojtu] ["ojt] restaurante [R@Stawr"6~t] [R@StOr"6~t] Alternative pronunciations viagens [vj"aZ6~j~S] [vj"aZe~S]

  8. p "6 r 6 sil 6 sil p r ... ... r a Example (rules application): Phrase “vou para a praia.” Canonical P.T. [v"o p6r6 6 pr"aj6] Narrow P. T. (most freq.) [v"o pr"a pr"ai6] = sandhi + vowel reduction Rules: DEF_RULE 6a, ( (6  NULL) (sil  NULL) (6  a) ) DEF_RULE pra, ( p ("6  NULL) r 6 ) Lattice

  9. - Module 3 - Acoustic Signal Processor Extraction of acoustical signal characteristics • Sampling: 16 kHz, 16 bits • Parameterisation: MFCC (Mel - Frequency Cepstral Coefficients) • Decoding: 14 coefficients, energy, 1st and 2nd order differences, 25 ms Hamming windows, updated every 10 ms. • Alignment: 14 coefficients, energy, 1st and 2nd order differences, 16 ms Hamming windows, updated every 5 ms.

  10. - Module 4 - Phonetic Decoder and Aligner Selection of the phonetic transcription which is closest to the utterance • Viterbi algorithm • 2 x 60 HMM models • Architecture • left-to-right • 3-state • 3-mixture NOTE:modules 3 and 4 use Hidden Markov Model Toolkit (Entropic Research Labs)

  11. Training and Test Corpora • Subset of the EUROM 1 multilingual corpus • European Portuguese • Collected in an anechoic room, 16 kHz, 16 bits. • 5 male + 5 female speakers (few talkers) • Prompt texts • Passages: • Paragraphs of 5 related sentences • Free translations of the English version of EUROM 1 • Adapted from books and newspaper text • Filler sentences: • 50 sentences grouped in blocks of 5 sentences each • Built to increase the numbers of different diphones in the corpus • Manually annotated.

  12. Speaker Passages Phrases Training Corpus 1 O0 - O4 O5 - O9 P0 - P4 F5 - F9 Test Corpus 1 2 O0 - O4 O5 - O9 P0 - 04 F0 - F4 Test Corpus 2 3 P5 - P9 Q0 - Q4 Q5 - Q9 F5 - F9 4 P0 - P4 P5 - P9 Q0 - Q4 F5 - F9 5 O5 - O9 P0 - P4 P5 - P9 F0 - F4 6 P5 - P9 Q0 - Q4 Q5 - Q9 F5 - F9 7 O0 - O4 O5 - O9 P0 - P4 F0 - F4 8 Q0 - Q4 Q5 - Q9 R0 - R4 F0 - F4 9 R5 - R9 O0 - O4 O5 - O9 F5 - F9 10 Q5 - Q9 R0 - R4 R5 - R9 F5 - F9 Training and Test Corpora (cont.) Passages: O0-O9, P0-P9: English translations Q0-Q9, R0-R9: Books and newspaper text. Filler sentences: F0-F9

  13. Transcription Alignment Models Precision < 10ms Percentile 90% 52,8 % 66,9 % 20 ms HMM (transcription) 43 % 78,9 % 18 ms HMM (alignment) Transcription and alignment results • Transcription: • Precision = ((correct - inserted)/Total) x 100% • Alignment: • % of cases in which the absolute error is < 10 ms • average absolute error including 90 % of cases

  14. Transcription Alignment HMM alignment HMM alignment Strategy 1 HMM recognition HMM recognition Strategy 2 HMM recognition HMM alignment Strategy 3 Transcription Alignment Models Precision < 10ms Percentile 90% 85,3 % 77,4 % 20 ms Strategy 1 85,8 % 44 % 29 ms Strategy 2 85,8 % 78 % 19 ms Strategy 3 Annotation strategies and Results NOTE: Alignment evaluated only in places where the decoded sequence matched the manual sequence

  15. Annotation results - Transcription - • Comments • Better precision achieved for canonical transcriptions of Test 2 • Highest global precision achieved in Test 1 • Successive application of the rules leads to a better precision Precision Rules Test 1 Test 2 74 % 76,9 % Canonical 77,1 % 79,4 % Sandhi Vowel reduction and 85,1 % 84,5 % alternative pronunciation

  16. Alignment Rules Test 1 Test 2 < 10 ms 90 % < 10 ms 90 % 74,68 % 24 ms 75,18 % 25 ms Canonical Sandhi 75,04 % 23 ms 75,41 % 24 ms Vowel reduction and 78,76 % 19 ms 77,27 % 22 ms alternative pronunciations Annotation results - Alignment - • Comments • Better alignment obtained with the best decoder • Some problematic transitions: vowels, nasals vowels and liquids.

  17. Conclusions • Better annotations results with: • Alternative Transcriptions (comparatively to canonical). • Use of different models for alignment and recognition • About 84 % precision in transcription and 22 ms of maximum alignment error for 90 % of the cases

  18. Future Works • Automatic rule inference • 1st Phase: comparison and selection of rules • 2nd Phase: validation or phonetic-linguistic interpretation • Annotation of other speech corpora to build better acoustic models • Assignment of probabilistic information to the alternative pronunciations generated by rule

  19. TOPIC ANNOTATION IN BROADCAST NEWS Rui Amaral, Isabel Trancoso IST, Instituto Superior Técnico INESC, Instituto de Engenharia de Sistemas e Computadores

  20. Preliminary work • System Architecture • Two-stage unsupervised clustering algorithm • nearest-neighbour search method • Kullback-Leibler distance measure • Topic language models • smoothed unigrams statistics • Topic Decoder • based on Hidden Markov Models (HMM) NOTE: topic models created with CMU Cambridge Statistical Language Modelling Toolkit

  21. System Architecture

  22. Training and Test Corpora • Subset of the BD_PUBLICO newspaper text corpus • 20000 stories • 6 month period (September 95 - February 96) • topic annotated • size between 100 and 2000 word • normalised text

More Related