Accent & Dialect Identification

Accent & Dialect Identification Chuck Curtis LING575 – Discourse & Dialogue 6/1/2011

HTK • Hidden Markov Model Toolkit • Library of modules and tools written in C • First release was in 1989 • Eventually wound up in Microsoft’s hands, but it is publicly available • http://htk.eng.cam.ac.uk/ • Ended up being too difficult to implement • Too much background and theory that was unfamiliar • Many incremental steps that were confusing

TIMIT • LDC Acoustic-Phonetic Continuous Speech Corpus (1993) • 8 American English dialect groups • 630 speakers total, 10 read sentences per speaker • 70% male, 30% female • ARPABET phonetic transcriptions • On Patasat /corpora/LDC/LDC93S1/

Example Sentence 0 63488 She had your dark suit in greasy wash water all year. 7470 11362 she11362 16000 had 15420 17503 your 17503 23360 dark 23360 28360 suit 28360 30960 in 30960 36971 greasy 36971 42290 wash 43120 47480 water 49021 52184 all 52184 58840 year

Example Phone Sequence 0 7470 h# 7470 9840 sh 9840 11362 iy 11362 12908 hv 12908 14760 ae 14760 15420 dcl 15420 16000 jh 16000 17503 axr 17503 18540 dcl 18540 18950 d 18950 21053 aa 21053 22200 r 22200 22740 kcl … …

What’s our vector, Victor? • As a starting point, I’m looking at phone sequences for each word as separate features /corpora/LDC/LDC93S1/TIMIT/TRAIN/DR4/MBMA0_2 South_Midlandask_ae_s 1 an_ix_n 1 rag_r_ae_gcl_g 1 like_l_ay_kcl 1 that_dh_ae_tcl 1 oily_oy_l_iy 1 me_m_iy 1 carry_kcl_k_eh_r_iy 1 don't_d_ow_nx 1 to_dx_ix 1

Using MalleT and TBL • MaxEnt classifier • TBL algorithm that we implemented for 572

FEATURES = WORDS, VECTORS = SPEAKERS

FEATURES = WORDS, VECTORS = SENTENCES

FEATURES = Monophones, VECTORS = sentences

FEATURES = diphones, VECTORS = Sentences

FEATURES = Triphones, VECTORS = Sentences

TODO • Trigrams w/ word boundaries • Try DecisionTree classifier (which uses InfoGain) • Possibly add gender to feature vectors

Questions / Comments?

Accent & Dialect Identification