Maximum Entropy Discrimination for Model Averaging
260 likes | 325 Vues
Learn how to find the optimal discriminant function for classification tasks using the Maximum Entropy Discrimination approach, incorporating priors and generative models for superior performance.
Maximum Entropy Discrimination for Model Averaging
E N D
Presentation Transcript
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT
Classification • inputs x, class y = +1, -1 • data D = { (x1,y1), …. (xT,yT) } • learn fopt(x) discriminant function from F = {f} family of discriminants • classify y = sign fopt(x)
many f with near optimal performance Instead of choosingfopt, average over all f in F Q(f) = weight of f y(x) = sign Q(f)f(x) F = sign < f(x) >Q To specify: F = { f } family of discriminant functions To learn Q(f) distribution over F Model averaging
Goal of this work • Define a discriminative criterion for averaging over models Advantages • can incorporate prior • can use generative model • computationally feasible • generalizes to other discrimination tasks
Maximum Entropy Discrimination given data set D = { (x1,y1), … (xT,yT) } find QME= argmaxQ H(Q) s.t. yt< f(xt) >Q gfor all t = 1,…,T (C) and some g > 0 • solution QME correctly classifies D • among all admissible Q, QME has max entropy • max entropy least specific about f
l=0 uniform Q0 lME QME admissible Q Solution: Q ME as a projection • convex problem: QME unique • solution T QME (f) ~ exp{ Sltytf(xt) } t=1 • lt 0 Lagrange multipliers • finding QME: start with l=0 and follow gradient of unsatisfied constraints
Finding the solution • needed lt, t = 1,...T • by solving the dual problem max J(l) = max [ - log Z + - log Z- - gSlt ] s.t.lt>= 0 for t = 1,...T Algorithm • start with lt= 0(uniform distribution) • iterative ascent on J(l) until convergence • derivative J/ lt= yt<log +b >Q(P) - g l l P+(x) P-(x)
QMEas sparse solution • Classification rule y(x) = sign< f(x) >QME • g is classification margin • lt> 0 for yt< f(xt) >Q =g c xt on the margin (support vector!)
Q(f) fopt QME Q0 f QMEas regularization • Uniform distribution Q0 l=0 • ”smoothness” of Q = H(Q) • QME is smoothest admissible distribution
Goal of this work • Define a discriminative criterion for averaging over models 4 Extensions • incorporate prior • relationship to support vectors • use generative models • generalizes to other discrimination tasks
prior Q0 KL( Q || Q0) QMRE admissible Q Priors • prior Q0( f ) • Minimum Relative Entropy Discrimination QMRE = argminQ KL( Q || Q0) s.t. yt< f(xt) >Q g for all t = 1,…,T(C) • prior on g learn QMRE( f, g) soft margin
Soft margins • average also over margin g • define Q0 (f,g) = Q0(f) Q0(g) • constraints <ytf(xt) - g >Q(f,g) 0 • learn QMRE (f, g) = QMRE(f) QMRE(g) Q0(g) =c exp[c(g-1)] Potential as function of l
Examples: support vector machines • Theorem For f(x) = q.x + b, Q0(q) = Normal( 0, I ), Q0(b) = non-informative prior, the Lagrange multipliers lare obtained by maximizing J(l) subject to 0lt0 and Stltyt = 0 , where J(l) = St[ lt + log( 1 - lt/c) ] - 1/2St,sltlsytysxt.xs • Separable D SVM recovered exactly • Inseparable D SVM recovered with different misclassification penalty • Adaptive kernel SVM....
Linear SVM Max Likelihood Gaussian MRE Gaussian SVM extensions P+(x) P-(x) • Example: Leptograpsus Crabs (5 inputs, Ttrain=80, Ttest=120) f(x) = log + b with P+( x ) = normal( x ; m+, V+ ) quadratic classifier Q( V+, V- ) = distribution of kernel width
Using generative models • generative models P+(x), P-(x)for y = +1, -1 • f(x) = log + b • learn QMRE (P+,P-, b, g) • if Q0 (P+,P- b,g) = Q0 (P+) Q0 ( P-) Q0 ( b) Q0 (g ) • QMRE (P+,P-) = QME (P+) QME (P-) QMRE( b) QMRE (g ) (factored prior factored posterior) P+(x) P-(x)
Examples: other distributions • Multinomial (1 discrete variable) 4 • Graphical model 4 (fixed structure, no hidden variables) • Tree graphical model 4 ( Q over structures and parameters)
Tree graphical models P E • P(x| E, q) = P0(x) Puv(xuxv|quv) • prior Q0(P) = Q0(E) Q0(q|E) • Q0(E) = auv • Q0(q|E) = conjugate prior QMRE(P) = W0 Wuv can be integrated analytically P E Q0(P) conjugate prior over E and q P E
ML, err=14% MaxEnt, err=12.3% Trees: experiments • Splice junction classification task • 25 inputs, 400 training examples • compared with Max Likelihood trees
Tree edges’ weights Trees experiments (contd)
x + + + x x x - x x + x - x x x - x + + + + + + + + + + + + + + + + + + + + + + Discrimination tasks • Classification • Classification with partially labeled data • Anomaly detection
Partially labeled data • Problem: given F families of discriminants and data set D = { (x1, y1)… (xT ,yT), xT+1,… xN } find Q(f,g,y) = argminQ KL(Q||Q0) s. t. < ytf(x) - g >Q 0 for all t = 1,…,T (C)
Complete data 10% labeled + 90% unlabeled 10% labeled Partially labeled data : experiment • Splice junction classification • 25 inputs • Ttotal=1000
Anomaly detection • Problem: given P = { P } family of generative models and data set D = { x1, … xT } find Q(P) that Q( P,g ) = argminQ KL(Q||Q0) s. t. < log P(x) - g >Q 0 for all t = 1,…,T (C)
MaxLikelihood MaxEnt Anomaly detection: experiments
MaxEnt MaxLikelihood Anomaly detection: experiments
Conclusions • New framework for classification • Based on regularization in the space of distributions • Enables use of generative models • Enables use of priors • Generalizes to other discrimination tasks