140 likes | 304 Vues
On Use of Temporal Dynamics of Speech for Language Identification. Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky Anthropic Signal Processing Group http://www.asp.ogi.edu. Target language model. (+). Speech signal. Segmentation. units.
E N D
On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky Anthropic Signal Processing Grouphttp://www.asp.ogi.edu
Target languagemodel (+) Speechsignal Segmentation units score + Backgroundmodel (-) OGI-4 – ASP System • Goal • Convert the speech signal into a sequence of discrete sub-word units that can characterize the language • Approach • Use temporal trajectories of speech parameters to obtain the sequence of units • Model the sequence of discrete sub-word units using a N-gram language model • Sub-word units • TRAP-derived American English phonemes • Symbols derived from prosodic cues dynamics • Phonemes from OGI-LID
Short-term analysis classifier Frequency Time phone Temporal patterns paradigm American English Phoneme Recognition • Phoneme set • 39 American English phonemes (CMU-like) • Phoneme Recognizer • trained on NTIMIT • TRAP (Temporal Patterns) based • Speech segments for training obtained from energy-based speech/nonspeech segmentation • Modeling • 3-gram language model
English Phoneme System Merger Band Classifier 1 Viterbi search Band Classifier 2 frequency Band Classifier N time • Merger • MLP (897x300x39) • Viterbi search • Penalty factor tuning : deletions = insertions • Training • NTIMIT • Temporal trajectories • 23 mel-scale frequency band • 1 s segments of log energy trajectory • Band classifiers • MLP (101x300x39) • Hidden unit nonlinearities: sigmoids • Output nonlinearities: softmax
Prosodic Cues Dynamics • Technique • Using prosodic cues (intensity and pitch trajectories) to derive the sub-word units • Approach • Segment the speech signal at the inflection points of trajectories (zero-crossings of the derivative) and at the onsets and offsets of voicing • Label the segment by the direction of change of the parameter within the segment
Duration The duration of the segment is characterized as “short” (less than 8 frames) or “long” 10 symbols Broad-phonetic-category (BFC) Finer labeling achieved by estimating the broad-phonetic category (vowel+diphthong+glide, schwa, stop, fricative, flap, nasal, and silence) coinciding with each prosodic segment BFC TRAPs trained on NTIMIT is used for deriving the broad phonetic categories 61 symbols 3-gram language model BFC TRAPS Setup Input temporal vectors 15 bark-scale frequency band energy 1s segments of log energy trajectory Mean and variance normalized Dimension reduction:DCT Band classifiers MLP (15x100x7) Hidden units: sigmoid Output units: softmax Merger MLP (105x100x7) Prosodic Cues Dynamics
OGI-4 – ASP System EER30s=17.8% EER30s=41.4% EER30s=19.3% EER30s=32.1%
OGI-4 – ASP System EER30s=17.8%
Post-Evaluation – Phoneme System • Speech-nonspeech segmentation using silence classes from TRAP-based classification • TRAPs classifier • Temporal trajectory duration - 400ms • 3 bands as the input trajectory for each band classifier to explore the correlation between adjacent bands • The trajectories of 3 bands are projected into a DCT basis (20 coefficients) • Viterbi search tuned for language identification • Training data • CallFriend training and development sets
Post-Evaluation – Phoneme System 34% relative improvement EER30s=12.7%
Post-Evaluation – Prosodic Cues System • No energy-based segmentation • Unvoiced segments longer than 2 seconds are considered non-speech • No broad-phonetic category labeling applied • Rate of change plus the quantized duration (10 tokens) • Training data • CallFriend training and development sets
Post-Evaluation – Prosodic Cues System 30% relative improvement EER30s=22.2%
Fusion - 30 sec condition • Fusing the scores from the prosodic cues system • with TRAP-derived phonemes: EER30s= 10.5%(17% relative improvement) • with OGI-LID derived phonemes: EER30s= 6.6% 14% relative improvement • TRAP-derived phoneme system fused with OGI-LID: • EER30s= 6.2%19% relative improvement EER30s=5.7% 26% relative improvement
Conclusions • Sequences of discrete symbols derived from speech dynamics provide useful information for characterizing the language • Two techniques for deriving the sequences of symbols investigated • segmentation and labeling based on prosodic cues • segmentation and labeling based on TRAP-derived phonetic labels • The introduced techniques combine well with each other as well as with the more conventional language ID techniques