1 / 40

Online Learning Algorithms

Online Learning Algorithms. Outline. Online learning Framework Design principles of online learning algorithms (additive updates) Perceptron, Passive-Aggressive and Confidence weighted classification Classification – binary, multi-class and structured prediction

evonne
Télécharger la présentation

Online Learning Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Learning Algorithms

  2. Outline • Online learning Framework • Design principles of online learning algorithms (additive updates) • Perceptron, Passive-Aggressive and Confidence weighted classification • Classification – binary, multi-class and structured prediction • Hypothesis averaging and Regularization • Multiplicative updates • Weighted majority, Winnow, and connections to Gradient Descent(GD) and Exponentiated Gradient Descent (EGD)

  3. Formal setting – Classification • Instances • Images, Sentences • Labels • Parse tree, Names • Prediction rule • Linear prediction rule • Loss • No. of mistakes

  4. Predictions • Continuous predictions : • Label • Confidence • Linear Classifiers • Prediction : • Confidence:

  5. Loss Functions • Natural Loss: • Zero-One loss: • Real-valued-predictions loss: • Hinge loss: • Exponential loss (Boosting)

  6. Loss Functions Hinge Loss Zero-One Loss 1 1

  7. Online Framework • Initialize Classifier • Algorithm works in rounds • On round the online algorithm : • Receives an input instance • Outputs a prediction • Receives a feedback label • Computes loss • Updates the prediction rule • Goal : • Suffer small cumulative loss

  8. Margin • Margin of an example with respect to the classifier : • Note : • The set is separable iff there exists u such that

  9. Geometrical Interpretation Margin <<0 Margin >0 Margin <0 Margin >>0

  10. Hinge Loss

  11. Why Online Learning? • Fast • Memory efficient - process one example at a time • Simple to implement • Formal guarantees – Mistake bounds • Online to Batch conversions • No statistical assumptions • Adaptive

  12. Update Rules • Online algorithms are based on an update rule which defines from (and possibly other information) • Linear Classifiers : find from based on the input • Some Update Rules : • Perceptron (Rosenblat) • ALMA (Gentile) • ROMMA (Li & Long) • NORMA (Kivinen et. al) • MIRA (Crammer & Singer) • EG (Littlestown and Warmuth) • Bregman Based (Warmuth) • CWL (Dredge et. al)

  13. Design Principles of Algorithms • If the learner suffers non-zero loss at any round, then we want to balance two goals: • Corrective: Change weights enough so that we don’t make this error again (1) • Conservative: Don’t change the weights too much (2) How to define too much ?

  14. Design Principles of Algorithms • If we use Euclidean distance to measure the change between old and new weights • Enforcing (1) and minimizing (2) • e.g., Perceptron for squared loss (Windrow-Hoff or Least Mean Squares) • Passive-Aggressive algorithms do exactly same • except (1) is much stronger – we want to make a correct classification with margin of at least 1 • Confidence-Weighted classifiers • maintains a distribution over weight vectors • (1) is same as passive-aggressive with a probabilistic notion of margin • Change is measured by KL divergence between two distributions

  15. Design Principles of Algorithms • If we assume all weights are positive • we can use (unnormalized) KL divergence to measure the change • Multiplicative update or EG algorithm (Kivinen and Warmuth)

  16. The Perceptron Algorithm • If No-Mistake • Do nothing • If Mistake • Update • Margin after update:

  17. Passive-Aggressive Algorithms

  18. Passive-Aggressive: Motivation • Perceptron: No guaranties of margin after the update • PA: Enforce a minimal non-zero margin after the update • In particular: • If the margin is large enough (1), then do nothing • If the margin is less then unit, update such that the margin after the update is enforced to be unit

  19. Aggressive Update Step • Set to be the solution of the following optimization problem: • Closed-form update: (2) (1) where,

  20. Passive-Aggressive Update

  21. Unrealizable Case

  22. Confidence Weighted Classification

  23. Confidence-Weighted Classification: Motivation • Many positive reviews with the word best Wbest • Later negative review • “boring book – best if you want to sleep in seconds” • Linear update will reduce both Wbest Wboring • But best appeared more than boring • How to adjust weights at different rates? Wboring Wbest

  24. Update Rules • The weight vector is a linear combination of examples • Two rate schedules (among others): • Perceptron algorithm, conservative: • Passive-aggressive

  25. Distributions in Version Space Mean weight-vector Example

  26. Margin as a Random Variable • Signed margin is a Gaussian-distributed variable • Thus:

  27. PA-like Update • PA: • New Update :

  28. Weight Vector (Version) Space Place most of the probability mass in this region

  29. Passive Step Nothing to do, most weight vectors already classify the example correctly

  30. Aggressive Step Mean moved past the mistake line (large margin) Project the current Gaussian distribution onto the half-space The covariance is shirked in the direction of the new example

  31. Extensions: Multi-class and Structured Prediction

  32. Multiclass Representation I • k Prototypes • New instance • Compute • Prediction: the class achieving the highest Score

  33. Multiclass Representation II • Map all input and labels into a joint vector space • Score labels by projecting the corresponding feature vector Estimated volume was a light 2.4 million ounces . F = (0 1 1 0 … ) B I O B I I I I O

  34. Multiclass Representation II • Predict label with highest score (Inference) • Naïve search is expensive if the set of possible labels is large • No. of labelings = 3No. of words Estimated volume was a light 2.4 million ounces . B I O B I I I I O Efficient Viterbi decoding for sequences!

  35. 0 0 0 x 0 F(x,4) = Two Representations • Weight-vector per class (Representation I) • Intuitive • Improved algorithms • Single weight-vector (Representation II) • Generalizes representation I • Allows complex interactions between input and output

  36. Margin for Multi Class • Binary: • Multi Class:

  37. Margin for Multi Class • But different mistakes cost (aka loss function) differently – so use it! • Margin scaled by loss function:

  38. Perceptron Multiclass online algorithm • Initialize • For • Receive an input instance • Outputs a prediction • Receives a feedback label • Computes loss • Update the prediction rule

  39. PA Multiclass online algorithm • Initialize • For • Receive an input instance • Outputs a prediction • Receives a feedback label • Computes loss • Update the prediction rule

  40. Regularization • Key Idea: • If an online algorithm works well on a sequence of i.i.d examples, then an ensemble of online hypotheses should generalize well. • Popular choices: • the averaged hypothesis • the majority vote • use validation set to make a choice

More Related