On Use of Temporal Dynamics of Speech for Language Identification

On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky Anthropic Signal Processing Grouphttp://www.asp.ogi.edu

Target languagemodel (+) Speechsignal Segmentation units score + Backgroundmodel (-) OGI-4 – ASP System • Goal • Convert the speech signal into a sequence of discrete sub-word units that can characterize the language • Approach • Use temporal trajectories of speech parameters to obtain the sequence of units • Model the sequence of discrete sub-word units using a N-gram language model • Sub-word units • TRAP-derived American English phonemes • Symbols derived from prosodic cues dynamics • Phonemes from OGI-LID

Short-term analysis classifier Frequency Time phone Temporal patterns paradigm American English Phoneme Recognition • Phoneme set • 39 American English phonemes (CMU-like) • Phoneme Recognizer • trained on NTIMIT • TRAP (Temporal Patterns) based • Speech segments for training obtained from energy-based speech/nonspeech segmentation • Modeling • 3-gram language model

English Phoneme System Merger Band Classifier 1 Viterbi search Band Classifier 2 frequency Band Classifier N time • Merger • MLP (897x300x39) • Viterbi search • Penalty factor tuning : deletions = insertions • Training • NTIMIT • Temporal trajectories • 23 mel-scale frequency band • 1 s segments of log energy trajectory • Band classifiers • MLP (101x300x39) • Hidden unit nonlinearities: sigmoids • Output nonlinearities: softmax

Prosodic Cues Dynamics • Technique • Using prosodic cues (intensity and pitch trajectories) to derive the sub-word units • Approach • Segment the speech signal at the inflection points of trajectories (zero-crossings of the derivative) and at the onsets and offsets of voicing • Label the segment by the direction of change of the parameter within the segment

Duration The duration of the segment is characterized as “short” (less than 8 frames) or “long” 10 symbols Broad-phonetic-category (BFC) Finer labeling achieved by estimating the broad-phonetic category (vowel+diphthong+glide, schwa, stop, fricative, flap, nasal, and silence) coinciding with each prosodic segment BFC TRAPs trained on NTIMIT is used for deriving the broad phonetic categories 61 symbols 3-gram language model BFC TRAPS Setup Input temporal vectors 15 bark-scale frequency band energy 1s segments of log energy trajectory Mean and variance normalized Dimension reduction:DCT Band classifiers MLP (15x100x7) Hidden units: sigmoid Output units: softmax Merger MLP (105x100x7) Prosodic Cues Dynamics

OGI-4 – ASP System EER30s=17.8% EER30s=41.4% EER30s=19.3% EER30s=32.1%

OGI-4 – ASP System EER30s=17.8%

Post-Evaluation – Phoneme System • Speech-nonspeech segmentation using silence classes from TRAP-based classification • TRAPs classifier • Temporal trajectory duration - 400ms • 3 bands as the input trajectory for each band classifier to explore the correlation between adjacent bands • The trajectories of 3 bands are projected into a DCT basis (20 coefficients) • Viterbi search tuned for language identification • Training data • CallFriend training and development sets

Post-Evaluation – Phoneme System 34% relative improvement EER30s=12.7%

Post-Evaluation – Prosodic Cues System • No energy-based segmentation • Unvoiced segments longer than 2 seconds are considered non-speech • No broad-phonetic category labeling applied • Rate of change plus the quantized duration (10 tokens) • Training data • CallFriend training and development sets

Post-Evaluation – Prosodic Cues System 30% relative improvement EER30s=22.2%

Fusion - 30 sec condition • Fusing the scores from the prosodic cues system • with TRAP-derived phonemes: EER30s= 10.5%(17% relative improvement) • with OGI-LID derived phonemes: EER30s= 6.6% 14% relative improvement • TRAP-derived phoneme system fused with OGI-LID: • EER30s= 6.2%19% relative improvement EER30s=5.7% 26% relative improvement

Conclusions • Sequences of discrete symbols derived from speech dynamics provide useful information for characterizing the language • Two techniques for deriving the sequences of symbols investigated • segmentation and labeling based on prosodic cues • segmentation and labeling based on TRAP-derived phonetic labels • The introduced techniques combine well with each other as well as with the more conventional language ID techniques

On Use of Temporal Dynamics of Speech for Language Identification

On Use of Temporal Dynamics of Speech for Language Identification

Presentation Transcript

Speech Dynamics

Effective Use of Language

Identification of Temporal Phrases in Natural Language

fMRI of speech and language

Temporal and spatial dynamics of populations

Effective Use of Language

Figurative Language (figures of speech)

Uncovering the Temporal Dynamics of Diffusion Networks

Recent work on Language Identification

Modeling Prosody for Language Identification on Read and Spontaneous Speech

Evolution of Language: Neanderthal Speech

The Use of Speech in Speech-to-Speech Translation

Identification of voices in disguised speech

SURVEY OF SPEECH/LANGUAGE PATHOLOGISTS

RTI for Speech-Language

System Identification for X-dynamics

Dynamics of Gestures: Temporal Patterning

Language use and identification

Identification of Spatial-Temporal Switched ARX Systems