Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA

Lecture 5: Generalization Error; Support Vector Machines • Observation Vector Summary Statistic; Principal Components Analysis (PCA) • Risk Minimization • If Posterior Probability is known: MAP is optimal • Example: Linear Discriminant Analysis (LDA) • When true Posterior is unknown: Generalization Error • VC Dimension, and bounds on Generalization Error • Lagrangian Optimization • Linear Support Vector Machines • The SVM Optimality Metric • Lagrangian Optimization of SVM Metric • Hyper-parameters & Over-training • Kernel-Based Support Vector Machines • Kernel-based classification & optimization formulas • Hyperparameters & Over-training • The Entire Regularization Path of the SVM • High-Dimensional Linear SVM • Text classification using indicator functions • Speech acoustic classification using redundant features

What is an Observation? • Observation can be: • A vector created by “vectorizing” many consecutive MFCC or mel-spectra • A vector including MFCC, formants, pitch, PLP, auditory model features, …

Normalized Observations

Plotting the Observations, Part I: Scatter Plots and Histograms

Problem: Where is the Information in a 1000-Dimensional Vector?

Statistics that Summarize a Training Corpus

Summary Statistics: Matrix Notation Examples of y=-1 Examples of y=+1

Eigenvectors and Eigenvalues of R

Plotting the Observations, Part 2: Principal Components Analysis

What Does PCA Extract from the Spectrogram? Plot: “PCAGram” • 1024-dimensional principal component → 32X32 spectrogram, plot as an image: • 1st principal component (not shown) measures total energy of the spectrogram • 2nd principal component: E(after landmark) – E(before landmark) • 3rd principal component: E(at the landmark) – E(surrounding syllables)

Minimum-Risk Classifier Design

True Risk, Empirical Risk, and Generalization

When PDF is Known: Maximum A Posteriori (MAP) is Optimal

Another Way to Write the MAP Classifier: Test the Sign of the Log Likelihood Ratio

MAP Example: Gaussians with Equal Covariance

Linear Discriminant Projection of the Data

Other Linear Classifiers: Empirical Risk Minimization (Choose v, b to Minimize Remp(v,b))

A Serious Problem: Over-Training The same projection, applied to new test data Minimum-Error projection of training data

When the True PDF is Unknown: Upper Bounds on True Risk

The VC Dimension of a Hyperplane Classifier

Schematic Depiction: |w| Controls the Expressiveness of the Classifier(and a less expressive classifier is less prone to overtrain)

The SVM = An Optimality Criterion

Lagrangian Optimization: Inequality Constraint • Consider minimizing f(v), subject to the constraint g(v) ≥ 0. Two solution types exist: • g(v*) = 0 • g(v)=0 curve is tangent to f(v)=fmin curve at v=v* • g(v*) > 0 • v* minimizes f(v) g(v) < 0 Unconstrained Minimum g(v) < 0 v* g(v) = 0 v* g(v) > 0 g(v) > 0 g(v) = 0 Diagram from Osborne, 2004

Case 1: gm(v*)=0

Case 2: gm(v*)>0

Training an SVM

Differentiate the Lagrangian

… now Simplify the Lagrangian…

… and impose Kuhn-Tucker…

Three Types of Vectors Interior Vector: a=0 Margin Support Vector: 0<a<C Error: a=C Partial Error: a=C From Hastie et al., NIPS 2004

… and finally, Solve the SVM

Quadratic Programming ai2 C ai1 C ai* ai2 is off the margin; truncate to ai2=0. ai1 is still a margin candidate; solve for it again in iteration i+1.

Linear SVM Example

Choosing the Hyper-Parameter to Avoid Over-Training(Wang, Presentation at CLSP workshop WS04) SVM test corpus error vs. l=1/C, classification of nasal vs. non-nasal vowels.

Choosing the Hyper-Parameter to Avoid Over-Training • Recall that v=Sm amymxm • Therefore, |v| < (C Sm |xm|)1/2 < (CM max|xm|)1/2 • Therefore, width of the margin is constrained to 1/|v| > (CM max|xm|)-1/2, and therefore, the SVM is not allowed to make the margin very small in its quest to fix individual errors • Recommended solution: • Normalize xm so that max|xm|≈1 (e.g., using libsvm) • Set C≈1/M • If desired, adjust C up or down by a factor of 2, to see if error rate on independent development test data will decrease

From Linear to Nonlinear SVM

Example: RBF Classifier

An RBF Classification Boundary

Two Hyperparameters  Choosing Hyperparameters is Much Harder(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)

Optimum Value of C Depends on g(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004) From Hastie et al., NIPS 2004

SVM is a “Regularized Learner” (l=1/C)

SVM Coefficients are a Piece-Wise Linear Function of l=1/C(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)

The Entire Regularization Path of the SVM: Algorithm(Hastie, Zhu, Tibshirani and Rosset, NIPS 2004) • Start with l large enough (C small enough) so that all training tokens are partial errors (am=C). Compute the solution to the quadratic programming problem in this case, including inversion of XTX or XXT. • Reduce l (increase C) until the initial event occurs: two partial error points enter the margin, i.e., in the QP problem, am=C becomes the unconstrained solution rather than just the constrained solution. This is the first breakpoint. The slopes dam/dl change, but only for the two training vectors the margin; all other training vectors continue to have am=C.Calculate the new values of dam/dl for these two training vectors. • Iteratively find the next breakpoint. The next breakpoint occurs when one of the following occurs: • A value of am that was on the margin leaves the margin, i.e., the piece-wise-linear function am(l) hits am=0 or am=C. • One or more interior points enter the margin, i.e., in the QP problem, am=0 becomes the unconstrained solution rather than just the constrained solution. • One or more interior points enter the margin, i.e., in the QP problem, am=C becomes the unconstrained solution rather than just the constrained solution.

One Method for Using SVMPath (WS04, Johns Hopkins, 2004) • Download SVMPath code from Trevor Hastie’s web page • Test several values of g, including values within a few orders of magnitude from g=1/K. • For each candidate value of g, use SVMPath to find the C-breakpoints. Choose a few dozen C-breakpoints for further testing, and write out the corresponding values of am. • Test the SVMs on a separate development test database: for each combination (C,g), find the development test error. Choose the combination that gives least development test error.

Results, RBF SVM SVM test corpus error vs. l=1/C, classification of nasal vs. non-nasal vowels. Wang, WS04 Student Presentation, 2004

High-Dimensional Linear SVMs

Motivation: “Project it Yourself” • The purpose of a nonlinear SVM: • f(x) contains higher-order polynomial terms in the elements of x. • By combining these higher-order polynomial terms, SymamK(x,xm) can create a more flexible boundary than can SymamxTxm. • The flexibility of the boundary does not lead to generalization error: the regularization term l|v|2 avoids generalization error. • A different approach: • Augment x with higher-order terms, up to a very large dimension. These terms can include: • Polynomial terms, e.g., xixj • N-gram terms, e.g., (xi at time t AND xj at time t) • Other features suggested by knowledge-based analysis of the problem • Then: apply a linear SVM to the higher-dimensional problem

Example #1: Acoustic Classification of Stop Place of Articulation • Feature Dimension: K=483/10ms • MFCCs+d+dd, 25ms window: K=39/10ms • Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecond: K=40/10ms • Noise-robust MUSIC-based formant frequencies, amplitudes, and bandwidths: K=10/10ms • Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures): K=42/10ms • Rate-place model of neural response fields in the cat auditory cortex: K=352/10ms • Observation = concatenation of up to 17 frames, for a total of K=17 X 483 = 8211 dimensions • Results: Accuracy improves as you add more features, up to 7 frames (one/10ms; 3381-dimensional x). Adding more frames didn’t help. • RBF SVM still outperforms linear SVM, but only by 1%

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA