1 / 53

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA. Lecture 5: Generalization Error; Support Vector Machines.

annona
Télécharger la présentation

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA

  2. Lecture 5: Generalization Error; Support Vector Machines • Observation Vector Summary Statistic; Principal Components Analysis (PCA) • Risk Minimization • If Posterior Probability is known: MAP is optimal • Example: Linear Discriminant Analysis (LDA) • When true Posterior is unknown: Generalization Error • VC Dimension, and bounds on Generalization Error • Lagrangian Optimization • Linear Support Vector Machines • The SVM Optimality Metric • Lagrangian Optimization of SVM Metric • Hyper-parameters & Over-training • Kernel-Based Support Vector Machines • Kernel-based classification & optimization formulas • Hyperparameters & Over-training • The Entire Regularization Path of the SVM • High-Dimensional Linear SVM • Text classification using indicator functions • Speech acoustic classification using redundant features

  3. What is an Observation? • Observation can be: • A vector created by “vectorizing” many consecutive MFCC or mel-spectra • A vector including MFCC, formants, pitch, PLP, auditory model features, …

  4. Normalized Observations

  5. Plotting the Observations, Part I: Scatter Plots and Histograms

  6. Problem: Where is the Information in a 1000-Dimensional Vector?

  7. Statistics that Summarize a Training Corpus

  8. Summary Statistics: Matrix Notation Examples of y=-1 Examples of y=+1

  9. Eigenvectors and Eigenvalues of R

  10. Plotting the Observations, Part 2: Principal Components Analysis

  11. What Does PCA Extract from the Spectrogram? Plot: “PCAGram” • 1024-dimensional principal component → 32X32 spectrogram, plot as an image: • 1st principal component (not shown) measures total energy of the spectrogram • 2nd principal component: E(after landmark) – E(before landmark) • 3rd principal component: E(at the landmark) – E(surrounding syllables)

  12. Minimum-Risk Classifier Design

  13. True Risk, Empirical Risk, and Generalization

  14. When PDF is Known: Maximum A Posteriori (MAP) is Optimal

  15. Another Way to Write the MAP Classifier: Test the Sign of the Log Likelihood Ratio

  16. MAP Example: Gaussians with Equal Covariance

  17. Linear Discriminant Projection of the Data

  18. Other Linear Classifiers: Empirical Risk Minimization (Choose v, b to Minimize Remp(v,b))

  19. A Serious Problem: Over-Training The same projection, applied to new test data Minimum-Error projection of training data

  20. When the True PDF is Unknown: Upper Bounds on True Risk

  21. The VC Dimension of a Hyperplane Classifier

  22. Schematic Depiction: |w| Controls the Expressiveness of the Classifier(and a less expressive classifier is less prone to overtrain)

  23. The SVM = An Optimality Criterion

  24. Lagrangian Optimization: Inequality Constraint • Consider minimizing f(v), subject to the constraint g(v) ≥ 0. Two solution types exist: • g(v*) = 0 • g(v)=0 curve is tangent to f(v)=fmin curve at v=v* • g(v*) > 0 • v* minimizes f(v) g(v) < 0 Unconstrained Minimum g(v) < 0 v* g(v) = 0 v* g(v) > 0 g(v) > 0 g(v) = 0 Diagram from Osborne, 2004

  25. Case 1: gm(v*)=0

  26. Case 2: gm(v*)>0

  27. Training an SVM

  28. Differentiate the Lagrangian

  29. … now Simplify the Lagrangian…

  30. … and impose Kuhn-Tucker…

  31. Three Types of Vectors Interior Vector: a=0 Margin Support Vector: 0<a<C Error: a=C Partial Error: a=C From Hastie et al., NIPS 2004

  32. … and finally, Solve the SVM

  33. Quadratic Programming ai2 C ai1 C ai* ai2 is off the margin; truncate to ai2=0. ai1 is still a margin candidate; solve for it again in iteration i+1.

  34. Linear SVM Example

  35. Linear SVM Example

  36. Choosing the Hyper-Parameter to Avoid Over-Training(Wang, Presentation at CLSP workshop WS04) SVM test corpus error vs. l=1/C, classification of nasal vs. non-nasal vowels.

  37. Choosing the Hyper-Parameter to Avoid Over-Training • Recall that v=Sm amymxm • Therefore, |v| < (C Sm |xm|)1/2 < (CM max|xm|)1/2 • Therefore, width of the margin is constrained to 1/|v| > (CM max|xm|)-1/2, and therefore, the SVM is not allowed to make the margin very small in its quest to fix individual errors • Recommended solution: • Normalize xm so that max|xm|≈1 (e.g., using libsvm) • Set C≈1/M • If desired, adjust C up or down by a factor of 2, to see if error rate on independent development test data will decrease

  38. From Linear to Nonlinear SVM

  39. Example: RBF Classifier

  40. An RBF Classification Boundary

  41. Two Hyperparameters  Choosing Hyperparameters is Much Harder(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)

  42. Optimum Value of C Depends on g(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004) From Hastie et al., NIPS 2004

  43. SVM is a “Regularized Learner” (l=1/C)

  44. SVM Coefficients are a Piece-Wise Linear Function of l=1/C(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)

  45. The Entire Regularization Path of the SVM: Algorithm(Hastie, Zhu, Tibshirani and Rosset, NIPS 2004) • Start with l large enough (C small enough) so that all training tokens are partial errors (am=C). Compute the solution to the quadratic programming problem in this case, including inversion of XTX or XXT. • Reduce l (increase C) until the initial event occurs: two partial error points enter the margin, i.e., in the QP problem, am=C becomes the unconstrained solution rather than just the constrained solution. This is the first breakpoint. The slopes dam/dl change, but only for the two training vectors the margin; all other training vectors continue to have am=C.Calculate the new values of dam/dl for these two training vectors. • Iteratively find the next breakpoint. The next breakpoint occurs when one of the following occurs: • A value of am that was on the margin leaves the margin, i.e., the piece-wise-linear function am(l) hits am=0 or am=C. • One or more interior points enter the margin, i.e., in the QP problem, am=0 becomes the unconstrained solution rather than just the constrained solution. • One or more interior points enter the margin, i.e., in the QP problem, am=C becomes the unconstrained solution rather than just the constrained solution.

  46. One Method for Using SVMPath (WS04, Johns Hopkins, 2004) • Download SVMPath code from Trevor Hastie’s web page • Test several values of g, including values within a few orders of magnitude from g=1/K. • For each candidate value of g, use SVMPath to find the C-breakpoints. Choose a few dozen C-breakpoints for further testing, and write out the corresponding values of am. • Test the SVMs on a separate development test database: for each combination (C,g), find the development test error. Choose the combination that gives least development test error.

  47. Results, RBF SVM SVM test corpus error vs. l=1/C, classification of nasal vs. non-nasal vowels. Wang, WS04 Student Presentation, 2004

  48. High-Dimensional Linear SVMs

  49. Motivation: “Project it Yourself” • The purpose of a nonlinear SVM: • f(x) contains higher-order polynomial terms in the elements of x. • By combining these higher-order polynomial terms, SymamK(x,xm) can create a more flexible boundary than can SymamxTxm. • The flexibility of the boundary does not lead to generalization error: the regularization term l|v|2 avoids generalization error. • A different approach: • Augment x with higher-order terms, up to a very large dimension. These terms can include: • Polynomial terms, e.g., xixj • N-gram terms, e.g., (xi at time t AND xj at time t) • Other features suggested by knowledge-based analysis of the problem • Then: apply a linear SVM to the higher-dimensional problem

  50. Example #1: Acoustic Classification of Stop Place of Articulation • Feature Dimension: K=483/10ms • MFCCs+d+dd, 25ms window: K=39/10ms • Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecond: K=40/10ms • Noise-robust MUSIC-based formant frequencies, amplitudes, and bandwidths: K=10/10ms • Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures): K=42/10ms • Rate-place model of neural response fields in the cat auditory cortex: K=352/10ms • Observation = concatenation of up to 17 frames, for a total of K=17 X 483 = 8211 dimensions • Results: Accuracy improves as you add more features, up to 7 frames (one/10ms; 3381-dimensional x). Adding more frames didn’t help. • RBF SVM still outperforms linear SVM, but only by 1%

More Related