Maximum Entropy Discrimination for Model Averaging

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT

Classification • inputs x, class y = +1, -1 • data D = { (x1,y1), …. (xT,yT) } • learn fopt(x) discriminant function from F = {f} family of discriminants • classify y = sign fopt(x)

many f with near optimal performance Instead of choosingfopt, average over all f in F Q(f) = weight of f y(x) = sign Q(f)f(x) F = sign < f(x) >Q To specify: F = { f } family of discriminant functions To learn Q(f) distribution over F Model averaging

Goal of this work • Define a discriminative criterion for averaging over models Advantages • can incorporate prior • can use generative model • computationally feasible • generalizes to other discrimination tasks

Maximum Entropy Discrimination given data set D = { (x1,y1), … (xT,yT) } find QME= argmaxQ H(Q) s.t. yt< f(xt) >Q gfor all t = 1,…,T (C) and some g > 0 • solution QME correctly classifies D • among all admissible Q, QME has max entropy • max entropy least specific about f

l=0 uniform Q0 lME QME admissible Q Solution: Q ME as a projection • convex problem: QME unique • solution T QME (f) ~ exp{ Sltytf(xt) } t=1 • lt 0 Lagrange multipliers • finding QME: start with l=0 and follow gradient of unsatisfied constraints

Finding the solution • needed lt, t = 1,...T • by solving the dual problem max J(l) = max [ - log Z + - log Z- - gSlt ] s.t.lt>= 0 for t = 1,...T Algorithm • start with lt= 0(uniform distribution) • iterative ascent on J(l) until convergence • derivative J/ lt= yt<log +b >Q(P) - g l l P+(x) P-(x)

QMEas sparse solution • Classification rule y(x) = sign< f(x) >QME • g is classification margin • lt> 0 for yt< f(xt) >Q =g c xt on the margin (support vector!)

Q(f) fopt QME Q0 f QMEas regularization • Uniform distribution Q0 l=0 • ”smoothness” of Q = H(Q) • QME is smoothest admissible distribution

Goal of this work • Define a discriminative criterion for averaging over models 4 Extensions • incorporate prior • relationship to support vectors • use generative models • generalizes to other discrimination tasks

prior Q0 KL( Q || Q0) QMRE admissible Q Priors • prior Q0( f ) • Minimum Relative Entropy Discrimination QMRE = argminQ KL( Q || Q0) s.t. yt< f(xt) >Q g for all t = 1,…,T(C) • prior on g learn QMRE( f, g) soft margin

Soft margins • average also over margin g • define Q0 (f,g) = Q0(f) Q0(g) • constraints <ytf(xt) - g >Q(f,g) 0 • learn QMRE (f, g) = QMRE(f) QMRE(g) Q0(g) =c exp[c(g-1)] Potential as function of l

Examples: support vector machines • Theorem For f(x) = q.x + b, Q0(q) = Normal( 0, I ), Q0(b) = non-informative prior, the Lagrange multipliers lare obtained by maximizing J(l) subject to 0lt0 and Stltyt = 0 , where J(l) = St[ lt + log( 1 - lt/c) ] - 1/2St,sltlsytysxt.xs • Separable D SVM recovered exactly • Inseparable D SVM recovered with different misclassification penalty • Adaptive kernel SVM....

Linear SVM Max Likelihood Gaussian MRE Gaussian SVM extensions P+(x) P-(x) • Example: Leptograpsus Crabs (5 inputs, Ttrain=80, Ttest=120) f(x) = log + b with P+( x ) = normal( x ; m+, V+ ) quadratic classifier Q( V+, V- ) = distribution of kernel width

Using generative models • generative models P+(x), P-(x)for y = +1, -1 • f(x) = log + b • learn QMRE (P+,P-, b, g) • if Q0 (P+,P- b,g) = Q0 (P+) Q0 ( P-) Q0 ( b) Q0 (g ) • QMRE (P+,P-) = QME (P+) QME (P-) QMRE( b) QMRE (g ) (factored prior factored posterior) P+(x) P-(x)

Examples: other distributions • Multinomial (1 discrete variable) 4 • Graphical model 4 (fixed structure, no hidden variables) • Tree graphical model 4 ( Q over structures and parameters)

Tree graphical models P E • P(x| E, q) = P0(x) Puv(xuxv|quv) • prior Q0(P) = Q0(E) Q0(q|E) • Q0(E) = auv • Q0(q|E) = conjugate prior QMRE(P) = W0 Wuv can be integrated analytically P E Q0(P) conjugate prior over E and q P E

ML, err=14% MaxEnt, err=12.3% Trees: experiments • Splice junction classification task • 25 inputs, 400 training examples • compared with Max Likelihood trees

Tree edges’ weights Trees experiments (contd)

x + + + x x x - x x + x - x x x - x + + + + + + + + + + + + + + + + + + + + + + Discrimination tasks • Classification • Classification with partially labeled data • Anomaly detection

Partially labeled data • Problem: given F families of discriminants and data set D = { (x1, y1)… (xT ,yT), xT+1,… xN } find Q(f,g,y) = argminQ KL(Q||Q0) s. t. < ytf(x) - g >Q 0 for all t = 1,…,T (C)

Complete data 10% labeled + 90% unlabeled 10% labeled Partially labeled data : experiment • Splice junction classification • 25 inputs • Ttotal=1000

Anomaly detection • Problem: given P = { P } family of generative models and data set D = { x1, … xT } find Q(P) that Q( P,g ) = argminQ KL(Q||Q0) s. t. < log P(x) - g >Q 0 for all t = 1,…,T (C)

MaxLikelihood MaxEnt Anomaly detection: experiments

MaxEnt MaxLikelihood Anomaly detection: experiments

Conclusions • New framework for classification • Based on regularization in the space of distributions • Enables use of generative models • Enables use of priors • Generalizes to other discrimination tasks

Maximum Entropy Discrimination for Model Averaging