1 / 15

Visual Articulatory Feature Classification, and A Comparison of MLP and SVMs

Visual Articulatory Feature Classification, and A Comparison of MLP and SVMs. Ozgur Cetin International Compute Science Institute. Goals. Assess the performance of visual features for articulatory feature classification

camila
Télécharger la présentation

Visual Articulatory Feature Classification, and A Comparison of MLP and SVMs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Visual Articulatory Feature Classification, and A Comparison of MLP and SVMs Ozgur Cetin International Compute Science Institute

  2. Goals • Assess the performance of visual features for articulatory feature classification • Compare the performance of multi-later perceptrons (MLPs) with support vector machines (SVMs) for articulatory feature classification, for audio

  3. Data • MIT webcam digits corpus • Around 53 minutes of data; use 45 minutes of this data for training, and the remaining 8 minutes for testing purposes • 21 unique phones • Audio observation vectors • MFCCs and their 1st and 2nd order derivatives • 41 dimensions extracted at every 10 msec • Visual observation vectors • DCT coefficients computed on a histogram-equalized medium-sized ROI around lips; upsampled to match audio frames • Speech recognition experiments used the first 35 dimensions; use the first 41 dimensions for comparison with audio features

  4. Articulatory Feature Set 1 • The feature set we decided to use for acoustic modeling (restricted to 21 phones) PL1: sil alv den lab lab-den none rho vel DG1: sil app clo fric vow DG2: sil vow NAS: sil - + ROU: sil - + GLO: sil vl voi VOW: sil ah ao ax ay1 ay2 eh ey1 ey2 ih iy uw nil HT: sil high low mid mid-h mid-l v-hi nil FRT: sil bk frt mid mid-f nil

  5. Articulatory Feature Set 2 • Based on vocal tract variables of phonology • They could be more meaningful to infer than to infer Set 1, from video LIP-LOC: pro, lab, den LIP-OPEN: clo, crit, nar, wide TT-LOC: den, alv, p-a, ret TT-OPEN: clo, crit, nar, m-n, mid, wide TB-LOC: pal, vel, uvu, phar, TB-OPEN: clo, crit, nar, m-n, mid, wide VEL: clo, open GLOT: clo, crit, wide

  6. MLP-based Classification • Train a separate neural network for each articulatory variable: • Input layer: 9 frame window centered around the current frame (41 x 9 = 369 inputs) • Hidden layer: 100 hidden variables (set so that the number of parameters is around 5% of the data) • Output layer: 1-of-N mapping for each articulatory variable (obtained from phonetic forced alignments and phone-to-articulatory mappings)

  7. MLP-based Classification (2) • Training using gradient-descent • Performance monitored on a cross-validation set • Very fast (less than one minute for this data) • Quicknet and Pfile utilities • Mean/variance normalization (pfile_normutts) • Training (qnstrn) • Testing (qnsfwd) • Easy to use and available from www.icsi.berkeley.edu/~dpwe/projects/sprach/sprachcore.html

  8. SVM-based Classification • The same processing stages as the MLP-based classifier except: • Use a 5-frame window around the current frame to reduce dimensionality (-4, -2, 0, 2, 4) around the current frame • Training is slow so use only 2500-5000 data points (less than 2% of the all available data) • Randomly selected to balance different classes

  9. SVM-based Classification (2) • Training and testing using Joachims’ SVMmulticlass toolkit • svm_multiclass_learn, and _classify • Based on SVMlightquadratic optimizer • Very easy to use, and slow memory footprint • But, does not seem to scale well with amount of training data, and slow (1-2 days for this data set)

  10. Results: Visual Articulatory Classification using Feature Set 1 • Both audio and visual classifiers are MLP • Performance of audio MLPs is very good • Visual MLPs perform better than the a-priori classifier for PL2, NAS, ROU, and GLO • Still, their performance lags much behind audio MLPs

  11. Results: Visual Articulatory Classification using Feature Set 2 • All classifiers are again MLP • Apart from LIP-OPEN, visual MLPs do not fare better than the a-priori classifier • Priors are so high due to silence

  12. Results: SVM- vs. MLP-based Classification using Feature Set 2 (γ= 1) • 1 minute of data is too small to learn much from • SVMs use RBF kernel (γ = 1) • Tried several C and γ • SVMs do not seem to be competitive based on these very limited experiments

  13. Results: SVM- vs. MLP-based Classification using Feature Set 2 (γ= 0.1) • 1 minute data is too small to learn much from, but SVMs seem to fare better than MLPs • SVMs are not competitive with MLPs using all data • γ=0.1 is much better than γ=1 gamma (i.e. more training examples contribute during testing)

  14. Conclusions: Visual Articulatory Classification • Visual articulatory classification performance is not very good • Mostly near to or the same as the a-priori classifier • Exploit the dependency between articulatory variables • Try different visual features, or data set? • The ASR performance of the current visual features is not very good either (around 80% WER)

  15. Conclusions: SVM-based Classification • SVMs seem to be competitive for smalls amount of data, but not with the MLPs using all the data • The main problems I found to be the training time and the amount of data they can handle • Alternatives: • Instead of frame-based classification, try sequence-based classification, e.g. for each phone segment, using Fisher kernel • This would significantly reduce the amount of data, and it is actually what we do in practice

More Related