Dynamical Analysis of LVQ Learning Rules

Dynamical analysis of LVQ type learning rules Barbara Hammer Michael Biehl, Anarta Ghosh Clausthal University of Technology Institute of Computing Science Rijksuniversiteit Groningen Mathematics and Computing Science http://www.cs.rug.nl/~biehl m.biehl@rug.nl

often: heuristically motivated variations of competitive learning • initialize prototype vectors for different classes example: basic LVQ scheme[Kohonen]: “LVQ 1” • present a single example   • identify the closest prototype, i.ethe so-calledwinner classification:    assignment of a vector  to the class of the closest prototype w  • move the winner -closertowards the data (same class)  -away from the data (different class) Learning Vector Quantization (LVQ) - identification of prototype vectors from labelled example data - parameterization of distance based classification schemes  aim: generalization ability classificationof novel data after learning from examples

often based on heuristic arguments or cost functions with unclear relation to generalization here: analysis of LVQ algorithms w.r.t. - dynamics of the learning process - performance, i.e. generalization ability - typical properties in a model situation LVQ algorithms ... • frequently applied in a variety • of practical problems • plausible, intuitive, flexible • - fast, easy to implement • limited theoretical understanding of • - dynamics and convergence properties • - achievable generalization ability

orthonormal center vectors: B+, B-∈ ℝN, ( B )2 =1, B+·B- =0 prior weights of classes p+,p- p+ + p- = 1 (p-) ℓ separation ∝ ℓ B- B+ (p+) Model situation: two clusters of N-dimensional data random vectors ∈ ℝN according to mixture of two Gaussians: ℝN independent components: with variance:

learning rate, step size change of prototype towards or away from the current data update of two prototype vectors w+, w- : competition, direction of update etc. Dynamics of on-line training sequence of new, independent random examples drawn according to example: LVQ1, original formulation [Kohonen] Winner-Takes-All (WTA) algorithm

1. description in terms of a few characteristic quantitities ( here: ℝ2N  ℝ7) length and relative position of prototypes projections into the (B+, B-)-plane Mathematical analysis of the learning dynamics  recursions random vector ξμ enters only through its length and the projections

characteristic quantities - depend on the random sequence of example data - their variance vanishes with N  (here: ∝ N-1) random vector according to : avg. length  averaged recursionsclosed in 2. average over the current example correlated Gaussian random quantities in the thermodynamic limit N   completely specified in terms of first and second moments 3. self-averaging property learning dynamics is completely described in terms of averages

# of examples # of learning steps per degree of freedom stochastic recursions  deterministic ODE 4. continuous learning time integration yields evolution of projections 5. learning curve probability for misclassification of a novel example  generalization error εg(α)after training with α N examples

initializationws(0)≈0 Q-- w+ ℓ B- w- Q++ RSσ ℓ B+ Q+- w+ α theory and simulation(N=100) p+=0.8, v+=4, v+=9, ℓ=2.0, =1.0 averaged over100indep. runs LVQ1: The winner takes it all only the winner is updated according to the class label 1 winnerws RS- RS+ Trajectories in the(B+,B- )-plane (•)=20,40,....140 ....... optimal decision boundary ____ asymptotic position

η= 2.0 1.0 0.2  η Learning curve • suboptimal, non-monotonic • behavior for small η εg p+ = 0.2, ℓ=1.0 v+= v- = 1.0 - stationary state: εg (α∞) grows linearly withη -well-defined asymptotics: η 0, α∞, (ηα ) ∞ achievable generalization error: εg εg v+= v- =1.0 v+ =0.25 v-=0.81 .... best linear boundary ― LVQ1 p+ p+

problem: instability of the algorithm due to repulsion of wrong prototypes trivial classification für α∞: εg = min { p+,p- } theory and simulation (N=100) p+=0.8, ℓ=1, v+=v-=1, =0.5 averages over 100 independent runs “LVQ 2.1“ [Kohonen] here:update correct and wrong winner RS- RS+

εg η= 2.0, 1.0, 0.5  η suggested strategy: selection of data in a window close to the current decision boundary slows down the repulsion, system remains instable Early stopping: end training process at minimal εg (idealized) • pronounced minimum in εg (α) • depends on initialization and • cluster geometry • lowest minimum • assumed for η0 εg v+ =0.25 v-=0.81 ―LVQ1 __early stopping p+

Learning curves: εg p+=0.8, ℓ=3.0 v+=4.0, v-=9.0 η= 2.0, 1.0, 0.5  η-independent asymptotic εg “Learning From Mistakes (LFM)” LVQ2.1 update only if the current classification is wrong crisp limit of Soft Robust LVQ [Seo and Obermayer, 2003] projected trajetory: RS- ℓ B- ℓ B+ RS+ p+=0.8, ℓ= 1.2, v+=v=1.0

Comparison: achievable generalization ability v+=v-=1.0 v+=0.25 v-=0.81 equal cluster variances unequal variances εg p+ p+ ..... best linear boundary ―LVQ1 --- LVQ2.1 (early stopping) ·-·LFM

Summary • prototype-based learning • Vector Quantization and Learning Vector Quantization • a model scenario: two clusters, two prototypes • dynamics of online training • comparison of algorithms: • LVQ 1 : close to optimal asymptotic generalization • LVQ 2.1. : instability, trivial (stationary) classification • + stopping : potentially very good performance • LFM : far from optimal generalization behavior • work in progress, outlook • multi-class, multi-prototype problems • optimized procedures: learning rate schedules • variational approach / Bayes optimal on-line

Generalized Relevance LVQ [e.g. Hammer & Villmann] • adaptive metrics, e.g. distance measure training neighborhood preserving SOM Neural Gas (distance based) Perspectives • Self-Organizing Maps (SOM) • (many) N-dim. prototypes form a (low) d-dimensional grid • representation of data in a topology preserving map • applications

Outlook:

random vector according to : avg. length correlated Gaussian random quantities in the thermodynamic limit N   completely specified in terms of first and second moments (w/o indices μ):  averaged recursionsclosed in 2. average over the current example

N   investigation and comparison of given algorithms • - repulsive/attractive fixed points of the dynamics • - asymptotic behavior for  • dependence on learning rate, separation, initialization • ... optimization and development of new prescriptions • - time-dependent learning rate η(α) • variational optimization w.r.t. fs[...] • - ... maximize

initialization ws(0)=0 Q-- R++ (α=10) Q++ RSσ Q+- 1/N α theory and simulation(N=100) p+=0.8, v+=4, p+=9, ℓ=2.0, =1.0 averaged over 100 indep. runs self-averaging property (mean and variances) LVQ1: The winner takes it all only the winner is updated according to the class label 1 winner ws

projections on two independent random directions w1,2 μ w ξ = × x 2 2 μ ξ = × y B - - μ w ξ μ × = ξ x = × y B + + 1 1 high-dimensional data (formally: N∞) ξμ∈ℝN , N=200, ℓ=1, p+=0.4, v+=0.44, v-=0.44 (● 240) (○ 160) projections into the plane of center vectors B+,B-

Dynamical Analysis of LVQ Learning Rules

Dynamical Analysis of LVQ Learning Rules

Presentation Transcript

Learning set of rules

LVQ Algorithms

Learning Sets of Rules

Your type of LEARNING?

Inductive Learning of Rules

Learning Sets of Rules

Nonparametric Bayesian Learning of Switching Dynamical Processes

Dynamical Systems Analysis

Dynamical Analysis of V-particles Decays

formal type rules

kNN, LVQ, SOM

Type III burst Dynamical spectrum:

Learning Sets of Rules

Dynamical coupled channel analysis

Type Equivalence Rules

Symbolic Analysis of Dynamical systems

Nonparametric Bayesian Learning of Switching Dynamical Processes

Type III burst Dynamical spectrum:

Performance analysis of LVQ algorithms A statistical physics approach

Type Equivalence Rules

Learning Sets of Rules

Symbolic Analysis of Dynamical systems