Visual Articulatory Feature Classification, and A Comparison of MLP and SVMs

Visual Articulatory Feature Classification, and A Comparison of MLP and SVMs Ozgur Cetin International Compute Science Institute

Goals • Assess the performance of visual features for articulatory feature classification • Compare the performance of multi-later perceptrons (MLPs) with support vector machines (SVMs) for articulatory feature classification, for audio

Data • MIT webcam digits corpus • Around 53 minutes of data; use 45 minutes of this data for training, and the remaining 8 minutes for testing purposes • 21 unique phones • Audio observation vectors • MFCCs and their 1st and 2nd order derivatives • 41 dimensions extracted at every 10 msec • Visual observation vectors • DCT coefficients computed on a histogram-equalized medium-sized ROI around lips; upsampled to match audio frames • Speech recognition experiments used the first 35 dimensions; use the first 41 dimensions for comparison with audio features

Articulatory Feature Set 1 • The feature set we decided to use for acoustic modeling (restricted to 21 phones) PL1: sil alv den lab lab-den none rho vel DG1: sil app clo fric vow DG2: sil vow NAS: sil - + ROU: sil - + GLO: sil vl voi VOW: sil ah ao ax ay1 ay2 eh ey1 ey2 ih iy uw nil HT: sil high low mid mid-h mid-l v-hi nil FRT: sil bk frt mid mid-f nil

Articulatory Feature Set 2 • Based on vocal tract variables of phonology • They could be more meaningful to infer than to infer Set 1, from video LIP-LOC: pro, lab, den LIP-OPEN: clo, crit, nar, wide TT-LOC: den, alv, p-a, ret TT-OPEN: clo, crit, nar, m-n, mid, wide TB-LOC: pal, vel, uvu, phar, TB-OPEN: clo, crit, nar, m-n, mid, wide VEL: clo, open GLOT: clo, crit, wide

MLP-based Classification • Train a separate neural network for each articulatory variable: • Input layer: 9 frame window centered around the current frame (41 x 9 = 369 inputs) • Hidden layer: 100 hidden variables (set so that the number of parameters is around 5% of the data) • Output layer: 1-of-N mapping for each articulatory variable (obtained from phonetic forced alignments and phone-to-articulatory mappings)

MLP-based Classification (2) • Training using gradient-descent • Performance monitored on a cross-validation set • Very fast (less than one minute for this data) • Quicknet and Pfile utilities • Mean/variance normalization (pfile_normutts) • Training (qnstrn) • Testing (qnsfwd) • Easy to use and available from www.icsi.berkeley.edu/~dpwe/projects/sprach/sprachcore.html

SVM-based Classification • The same processing stages as the MLP-based classifier except: • Use a 5-frame window around the current frame to reduce dimensionality (-4, -2, 0, 2, 4) around the current frame • Training is slow so use only 2500-5000 data points (less than 2% of the all available data) • Randomly selected to balance different classes

SVM-based Classification (2) • Training and testing using Joachims’ SVMmulticlass toolkit • svm_multiclass_learn, and _classify • Based on SVMlightquadratic optimizer • Very easy to use, and slow memory footprint • But, does not seem to scale well with amount of training data, and slow (1-2 days for this data set)

Results: Visual Articulatory Classification using Feature Set 1 • Both audio and visual classifiers are MLP • Performance of audio MLPs is very good • Visual MLPs perform better than the a-priori classifier for PL2, NAS, ROU, and GLO • Still, their performance lags much behind audio MLPs

Results: Visual Articulatory Classification using Feature Set 2 • All classifiers are again MLP • Apart from LIP-OPEN, visual MLPs do not fare better than the a-priori classifier • Priors are so high due to silence

Results: SVM- vs. MLP-based Classification using Feature Set 2 (γ= 1) • 1 minute of data is too small to learn much from • SVMs use RBF kernel (γ = 1) • Tried several C and γ • SVMs do not seem to be competitive based on these very limited experiments

Results: SVM- vs. MLP-based Classification using Feature Set 2 (γ= 0.1) • 1 minute data is too small to learn much from, but SVMs seem to fare better than MLPs • SVMs are not competitive with MLPs using all data • γ=0.1 is much better than γ=1 gamma (i.e. more training examples contribute during testing)

Conclusions: Visual Articulatory Classification • Visual articulatory classification performance is not very good • Mostly near to or the same as the a-priori classifier • Exploit the dependency between articulatory variables • Try different visual features, or data set? • The ASR performance of the current visual features is not very good either (around 80% WER)

Conclusions: SVM-based Classification • SVMs seem to be competitive for smalls amount of data, but not with the MLPs using all the data • The main problems I found to be the training time and the amount of data they can handle • Alternatives: • Instead of frame-based classification, try sequence-based classification, e.g. for each phone segment, using Fisher kernel • This would significantly reduce the amount of data, and it is actually what we do in practice

Visual Articulatory Feature Classification, and A Comparison of MLP and SVMs

Visual Articulatory Feature Classification, and A Comparison of MLP and SVMs

Presentation Transcript

Scalable Visual Comparison of Biological Trees and Sequences

a Feature Comparison

Classification and Feature Selection for Craniosynostosis

Classification ( SVMs / Kernel method)

Feature Selection in Classification and R Packages

Overview of Feature Design and Classification Result

A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering

Articulatory Feature-Based Speech Recognition

Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee

Microarrays: A Comparison of Classification and Feature Selection Algorithms for Interpretation

Articulatory Feature-Based Speech Recognition

Comparison of Petroleum and Mineral Resources Classification

Interactive matching and visual comparison of graphs

Feature Extraction and Classification of Mammographic Masses

Comparison of Image Feature Descriptors for Mobile Visual Search

An Elitist Approach to Articulatory-Acoustic Feature Classification in English and in Dutch

A Review on Speech Feature Techniques and Classification Techniques

Articulatory Feature-Based Speech Recognition

Genetic Feature Subset Selection for Gender Classification: A Comparison Study

Articulatory Feature-Based Speech Recognition

An Elitist Approach to Articulatory-Acoustic Feature Classification in English and in Dutch

Classification: Feature Vectors