230 likes | 312 Vues
This research analyzes the dynamics and performance of Learning Vector Quantization (LVQ) algorithms, aiming to improve classification abilities after learning from examples. The study explores the dynamics of the learning process and convergence properties of LVQ algorithms in a model situation.
E N D
Dynamical analysis of LVQ type learning rules Barbara Hammer Michael Biehl, Anarta Ghosh Clausthal University of Technology Institute of Computing Science Rijksuniversiteit Groningen Mathematics and Computing Science http://www.cs.rug.nl/~biehl m.biehl@rug.nl
often: heuristically motivated variations of competitive learning • initialize prototype vectors for different classes example: basic LVQ scheme[Kohonen]: “LVQ 1” • present a single example • identify the closest prototype, i.ethe so-calledwinner classification: assignment of a vector to the class of the closest prototype w • move the winner -closertowards the data (same class) -away from the data (different class) Learning Vector Quantization (LVQ) - identification of prototype vectors from labelled example data - parameterization of distance based classification schemes aim: generalization ability classificationof novel data after learning from examples
often based on heuristic arguments or cost functions with unclear relation to generalization here: analysis of LVQ algorithms w.r.t. - dynamics of the learning process - performance, i.e. generalization ability - typical properties in a model situation LVQ algorithms ... • frequently applied in a variety • of practical problems • plausible, intuitive, flexible • - fast, easy to implement • limited theoretical understanding of • - dynamics and convergence properties • - achievable generalization ability
orthonormal center vectors: B+, B-∈ ℝN, ( B )2 =1, B+·B- =0 prior weights of classes p+,p- p+ + p- = 1 (p-) ℓ separation ∝ ℓ B- B+ (p+) Model situation: two clusters of N-dimensional data random vectors ∈ ℝN according to mixture of two Gaussians: ℝN independent components: with variance:
learning rate, step size change of prototype towards or away from the current data update of two prototype vectors w+, w- : competition, direction of update etc. Dynamics of on-line training sequence of new, independent random examples drawn according to example: LVQ1, original formulation [Kohonen] Winner-Takes-All (WTA) algorithm
1. description in terms of a few characteristic quantitities ( here: ℝ2N ℝ7) length and relative position of prototypes projections into the (B+, B-)-plane Mathematical analysis of the learning dynamics recursions random vector ξμ enters only through its length and the projections
characteristic quantities - depend on the random sequence of example data - their variance vanishes with N (here: ∝ N-1) random vector according to : avg. length averaged recursionsclosed in 2. average over the current example correlated Gaussian random quantities in the thermodynamic limit N completely specified in terms of first and second moments 3. self-averaging property learning dynamics is completely described in terms of averages
# of examples # of learning steps per degree of freedom stochastic recursions deterministic ODE 4. continuous learning time integration yields evolution of projections 5. learning curve probability for misclassification of a novel example generalization error εg(α)after training with α N examples
initializationws(0)≈0 Q-- w+ ℓ B- w- Q++ RSσ ℓ B+ Q+- w+ α theory and simulation(N=100) p+=0.8, v+=4, v+=9, ℓ=2.0, =1.0 averaged over100indep. runs LVQ1: The winner takes it all only the winner is updated according to the class label 1 winnerws RS- RS+ Trajectories in the(B+,B- )-plane (•)=20,40,....140 ....... optimal decision boundary ____ asymptotic position
η= 2.0 1.0 0.2 η Learning curve • suboptimal, non-monotonic • behavior for small η εg p+ = 0.2, ℓ=1.0 v+= v- = 1.0 - stationary state: εg (α∞) grows linearly withη -well-defined asymptotics: η 0, α∞, (ηα ) ∞ achievable generalization error: εg εg v+= v- =1.0 v+ =0.25 v-=0.81 .... best linear boundary ― LVQ1 p+ p+
problem: instability of the algorithm due to repulsion of wrong prototypes trivial classification für α∞: εg = min { p+,p- } theory and simulation (N=100) p+=0.8, ℓ=1, v+=v-=1, =0.5 averages over 100 independent runs “LVQ 2.1“ [Kohonen] here:update correct and wrong winner RS- RS+
εg η= 2.0, 1.0, 0.5 η suggested strategy: selection of data in a window close to the current decision boundary slows down the repulsion, system remains instable Early stopping: end training process at minimal εg (idealized) • pronounced minimum in εg (α) • depends on initialization and • cluster geometry • lowest minimum • assumed for η0 εg v+ =0.25 v-=0.81 ―LVQ1 __early stopping p+
Learning curves: εg p+=0.8, ℓ=3.0 v+=4.0, v-=9.0 η= 2.0, 1.0, 0.5 η-independent asymptotic εg “Learning From Mistakes (LFM)” LVQ2.1 update only if the current classification is wrong crisp limit of Soft Robust LVQ [Seo and Obermayer, 2003] projected trajetory: RS- ℓ B- ℓ B+ RS+ p+=0.8, ℓ= 1.2, v+=v=1.0
Comparison: achievable generalization ability v+=v-=1.0 v+=0.25 v-=0.81 equal cluster variances unequal variances εg p+ p+ ..... best linear boundary ―LVQ1 --- LVQ2.1 (early stopping) ·-·LFM
Summary • prototype-based learning • Vector Quantization and Learning Vector Quantization • a model scenario: two clusters, two prototypes • dynamics of online training • comparison of algorithms: • LVQ 1 : close to optimal asymptotic generalization • LVQ 2.1. : instability, trivial (stationary) classification • + stopping : potentially very good performance • LFM : far from optimal generalization behavior • work in progress, outlook • multi-class, multi-prototype problems • optimized procedures: learning rate schedules • variational approach / Bayes optimal on-line
Generalized Relevance LVQ [e.g. Hammer & Villmann] • adaptive metrics, e.g. distance measure training neighborhood preserving SOM Neural Gas (distance based) Perspectives • Self-Organizing Maps (SOM) • (many) N-dim. prototypes form a (low) d-dimensional grid • representation of data in a topology preserving map • applications
random vector according to : avg. length correlated Gaussian random quantities in the thermodynamic limit N completely specified in terms of first and second moments (w/o indices μ): averaged recursionsclosed in 2. average over the current example
N investigation and comparison of given algorithms • - repulsive/attractive fixed points of the dynamics • - asymptotic behavior for • dependence on learning rate, separation, initialization • ... optimization and development of new prescriptions • - time-dependent learning rate η(α) • variational optimization w.r.t. fs[...] • - ... maximize
initialization ws(0)=0 Q-- R++ (α=10) Q++ RSσ Q+- 1/N α theory and simulation(N=100) p+=0.8, v+=4, p+=9, ℓ=2.0, =1.0 averaged over 100 indep. runs self-averaging property (mean and variances) LVQ1: The winner takes it all only the winner is updated according to the class label 1 winner ws
projections on two independent random directions w1,2 μ w ξ = × x 2 2 μ ξ = × y B - - μ w ξ μ × = ξ x = × y B + + 1 1 high-dimensional data (formally: N∞) ξμ∈ℝN , N=200, ℓ=1, p+=0.4, v+=0.44, v-=0.44 (● 240) (○ 160) projections into the plane of center vectors B+,B-