1.83k likes | 1.98k Vues
This tutorial provides an in-depth understanding of online learning methods for binary classification in real-world applications. Covering essential topics like loss functions (hinge, zero-one, and log loss), prediction rules, and update algorithms (Perceptron, MIRA), it highlights the geometric interpretations of these concepts. The course emphasizes the advantages of online learning, including memory efficiency, adaptability, and the absence of statistical assumptions. Practical insights, alongside theoretical foundations, equip learners with the skills to tackle real-world data challenges effectively.
E N D
Online Learning forReal-World Problems Koby Crammer University of Pennsylvania
Ofer Dekel Josheph Keshet Shai Shalev-Schwatz Yoram Singer Axel Bernal Steve Caroll Mark Dredze Kuzman Ganchev Ryan McDonald Artemis Hatzigeorgiu Fernando Pereira Fei Sha Partha Pratim Talukdar Thanks
Tutorial Context SVMs Real-World Data Online Learning Tutorial Multicass, Structured Optimization Theory
Online Learning Tyrannosaurus rex
Online Learning Triceratops
Online Learning Velocireptor Tyrannosaurus rex
Formal Setting – Binary Classification • Instances • Images, Sentences • Labels • Parse tree, Names • Prediction rule • Linear predictions rules • Loss • No. of mistakes
Predictions • Discrete Predictions: • Hard to optimize • Continuous predictions : • Label • Confidence
Loss Functions • Natural Loss: • Zero-One loss: • Real-valued-predictions loss: • Hinge loss: • Exponential loss (Boosting) • Log loss (Max Entropy, Boosting)
Loss Functions Hinge Loss Zero-One Loss 1 1
Online Framework • Initialize Classifier • Algorithm works in rounds • On round the online algorithm : • Receives an input instance • Outputs a prediction • Receives a feedback label • Computes loss • Updates the prediction rule • Goal : • Suffer small cumulative loss
Notation Abuse Linear Classifiers • Any Features • W.l.o.g. • Binary Classifiers of the form
Linear Classifiers (cntd.) • Prediction : • Confidence in prediction:
Margin • Margin of an example with respect to the classifier : • Note : • The set is separable iff there exists such that
Geometrical Interpretation Margin <<0 Margin >0 Margin <0 Margin >>0
Degree of Freedom - I The same geometrical hyperplane can be represented by many parameter vectors
Degree of Freedom - II Problem difficulty does not change if we shrink or expand the input space
Why Online Learning? • Fast • Memory efficient - process one example at a time • Simple to implement • Formal guarantees – Mistake bounds • Online to Batch conversions • No statistical assumptions • Adaptive • Not as good as a well designed batch algorithms
Update Rules • Online algorithms are based on an update rule which defines from (and possibly other information) • Linear Classifiers : find from based on the input • Some Update Rules : • Perceptron (Rosenblat) • ALMA (Gentile) • ROMMA (Li & Long) • NORMA (Kivinen et. al) • MIRA (Crammer & Singer) • EG (Littlestown and Warmuth) • Bregman Based (Warmuth)
Three Update Rules • The Perceptron Algorithm : • Agmon 1954; Rosenblatt 1952-1962, Block 1962, Novikoff 1962, Minsky & Papert 1969, Freund & Schapire 1999, Blum & Dunagan 2002 • Hildreth’s Algorithm : • Hildreth 1957 • Censor & Zenios 1997 • Herbster 2002 • Loss Scaled : • Crammer & Singer 2001, • Crammer & Singer 2002
The Perceptron Algorithm • If No-Mistake • Do nothing • If Mistake • Update • Margin after update :
Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Cumulative Loss Suffered by the Algorithm Sequence of Prediction Functions Cumulative Loss of Competitor
Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Inequality Possibly Large Gap Regret Extra Loss Competitiveness Ratio
Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Grows With T Grows With T Constant
Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Best Prediction Function in hindsight for the data sequence
Remarks • If the input is inseparable, then the problem of finding a separating hyperplane which attains less then M errors is NP-hard (Open hemisphere) • Obtaining a zero-one loss bound with a unit competitiveness ratio is as hard as finding a constant approximating error for the Open Hemisphere problem. • Bound of the number of mistakes the perceptron makes with the hinge loss of any competitor
Definitions • Any Competitor • The parameters vector can be chosen using the input data • The parameterized hinge loss of on • True hinge loss • 1-norm and 2-norm of hinge loss
Geometrical Assumption • All examples are bounded in a ball of radius R
Perceptron’s Mistake Bound • Bounds : • If the sample is separable then
[FS99, SS05] Proof - Intuition • Two views : • The angle between and decreases with • The following sum is fixed as we make more mistakes, our solution is better
[C04] Proof • Define the potential : • Bound it’s cumulative sum from above and below
Proof • Bound from above : Telescopic Sum Zero Vector Non-Negative
Proof • Bound From Below : • No error on tth round • Error on tth round
Proof • We bound each term :
Proof • Bound From Below : • No error on tth round • Error on tth round • Cumulative bound :
Proof • Putting both bounds together : • We use first degree of freedom (and scale) : • Bound :
Proof • General Bound : • Choose : • Simple Bound : Objective of SVM
Proof • Better bound : optimize the value of
Remarks • Bound does not depend on dimension of the feature vector • The bound holds for all sequences. It is not tight for most real world data • But, there exists a setting for which it is tight
Separable Case • Assume there exists such that for all examples Then all bounds are equivalent • Perceptron makes finite number of mistakes until convergence (not necessarily to )
Separable Case – Other Quantities • Use 1st (parameterization) degree of freedom • Scale the such that • Define • The bound becomes