1 / 183

Online Learning for Real-World Problems

Online Learning for Real-World Problems. Koby Crammer University of Pennsylvania. Ofer Dekel Josheph Keshet Shai Shalev-Schwatz Yoram Singer. Axel Bernal Steve Caroll Mark Dredze Kuzman Ganchev Ryan McDonald Artemis Hatzigeorgiu Fernando Pereira Fei Sha Partha Pratim Talukdar.

hila
Télécharger la présentation

Online Learning for Real-World Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Learning forReal-World Problems Koby Crammer University of Pennsylvania

  2. Ofer Dekel Josheph Keshet Shai Shalev-Schwatz Yoram Singer Axel Bernal Steve Caroll Mark Dredze Kuzman Ganchev Ryan McDonald Artemis Hatzigeorgiu Fernando Pereira Fei Sha Partha Pratim Talukdar Thanks

  3. Tutorial Context SVMs Real-World Data Online Learning Tutorial Multicass, Structured Optimization Theory

  4. Online Learning Tyrannosaurus rex

  5. Online Learning Triceratops

  6. Online Learning Velocireptor Tyrannosaurus rex

  7. Formal Setting – Binary Classification • Instances • Images, Sentences • Labels • Parse tree, Names • Prediction rule • Linear predictions rules • Loss • No. of mistakes

  8. Predictions • Discrete Predictions: • Hard to optimize • Continuous predictions : • Label • Confidence

  9. Loss Functions • Natural Loss: • Zero-One loss: • Real-valued-predictions loss: • Hinge loss: • Exponential loss (Boosting) • Log loss (Max Entropy, Boosting)

  10. Loss Functions Hinge Loss Zero-One Loss 1 1

  11. Online Framework • Initialize Classifier • Algorithm works in rounds • On round the online algorithm : • Receives an input instance • Outputs a prediction • Receives a feedback label • Computes loss • Updates the prediction rule • Goal : • Suffer small cumulative loss

  12. Notation Abuse Linear Classifiers • Any Features • W.l.o.g. • Binary Classifiers of the form

  13. Linear Classifiers (cntd.) • Prediction : • Confidence in prediction:

  14. Margin • Margin of an example with respect to the classifier : • Note : • The set is separable iff there exists such that

  15. Geometrical Interpretation

  16. Geometrical Interpretation

  17. Geometrical Interpretation

  18. Geometrical Interpretation Margin <<0 Margin >0 Margin <0 Margin >>0

  19. Hinge Loss

  20. Separable Set

  21. Inseparable Sets

  22. Degree of Freedom - I The same geometrical hyperplane can be represented by many parameter vectors

  23. Degree of Freedom - II Problem difficulty does not change if we shrink or expand the input space

  24. Why Online Learning? • Fast • Memory efficient - process one example at a time • Simple to implement • Formal guarantees – Mistake bounds • Online to Batch conversions • No statistical assumptions • Adaptive • Not as good as a well designed batch algorithms

  25. Update Rules • Online algorithms are based on an update rule which defines from (and possibly other information) • Linear Classifiers : find from based on the input • Some Update Rules : • Perceptron (Rosenblat) • ALMA (Gentile) • ROMMA (Li & Long) • NORMA (Kivinen et. al) • MIRA (Crammer & Singer) • EG (Littlestown and Warmuth) • Bregman Based (Warmuth)

  26. Three Update Rules • The Perceptron Algorithm : • Agmon 1954; Rosenblatt 1952-1962, Block 1962, Novikoff 1962, Minsky & Papert 1969, Freund & Schapire 1999, Blum & Dunagan 2002 • Hildreth’s Algorithm : • Hildreth 1957 • Censor & Zenios 1997 • Herbster 2002 • Loss Scaled : • Crammer & Singer 2001, • Crammer & Singer 2002

  27. The Perceptron Algorithm • If No-Mistake • Do nothing • If Mistake • Update • Margin after update :

  28. Geometrical Interpretation

  29. Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Cumulative Loss Suffered by the Algorithm Sequence of Prediction Functions Cumulative Loss of Competitor

  30. Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Inequality Possibly Large Gap Regret Extra Loss Competitiveness Ratio

  31. Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Grows With T Grows With T Constant

  32. Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Best Prediction Function in hindsight for the data sequence

  33. Remarks • If the input is inseparable, then the problem of finding a separating hyperplane which attains less then M errors is NP-hard (Open hemisphere) • Obtaining a zero-one loss bound with a unit competitiveness ratio is as hard as finding a constant approximating error for the Open Hemisphere problem. • Bound of the number of mistakes the perceptron makes with the hinge loss of any competitor

  34. Definitions • Any Competitor • The parameters vector can be chosen using the input data • The parameterized hinge loss of on • True hinge loss • 1-norm and 2-norm of hinge loss

  35. Geometrical Assumption • All examples are bounded in a ball of radius R

  36. Perceptron’s Mistake Bound • Bounds : • If the sample is separable then

  37. [FS99, SS05] Proof - Intuition • Two views : • The angle between and decreases with • The following sum is fixed as we make more mistakes, our solution is better

  38. [C04] Proof • Define the potential : • Bound it’s cumulative sum from above and below

  39. Proof • Bound from above : Telescopic Sum Zero Vector Non-Negative

  40. Proof • Bound From Below : • No error on tth round • Error on tth round

  41. Proof • We bound each term :

  42. Proof • Bound From Below : • No error on tth round • Error on tth round • Cumulative bound :

  43. Proof • Putting both bounds together : • We use first degree of freedom (and scale) : • Bound :

  44. Proof • General Bound : • Choose : • Simple Bound : Objective of SVM

  45. Proof • Better bound : optimize the value of

  46. Remarks • Bound does not depend on dimension of the feature vector • The bound holds for all sequences. It is not tight for most real world data • But, there exists a setting for which it is tight

  47. Three Bounds

  48. Separable Case • Assume there exists such that for all examples Then all bounds are equivalent • Perceptron makes finite number of mistakes until convergence (not necessarily to )

  49. Separable Case – Other Quantities • Use 1st (parameterization) degree of freedom • Scale the such that • Define • The bound becomes

  50. Separable Case - Illustration

More Related