Linear Functions

CS546: Machine Learning and Natural LanguageLecture 7: Introduction to Classification:Linear Learning Algorithms2009

{ 1 if w1 x1 + w2 x2 +. . . wn xn >=  0 Otherwise f (x) = y = x1 x3 x5 • Disjunctions: y = ( 1•x1 + 1•x3 + 1•x5 >= 1) y = at least 2 of {x1 ,x3 , x5} • At least m of n: y = ( 1•x1 + 1•x3 + 1•x5 >=2) Linear Functions y = (x1 x2) v (x1 x2) • Exclusive-OR: y = (x1 x2) v (x3 x4) • Non-trivial DNF:

Linear Functions w ¢ x =  - - - - - - - - - - - - - - - w ¢ x = 0

1 2 T 7 3 4 5 6 Perceptron learning rule • On-line, mistake driven algorithm. • Rosenblatt (1959) suggested that when a target output value is provided for a single neuron with fixed input, it can incrementally change weights and learn to produce the output using the Perceptron learning rule Perceptron == Linear Threshold Unit

Given Labeled examples: • Initialize w=0 • 2. Cycle through all examples • a. Predict the label of instance x to bey’ = sgn{wx) • b. If y’y, update the weight vector: • w = w + r y x(r - a constant, learning rate) • Otherwise, if y’=y, leave weights unchanged. Perceptron learning rule • We learn f:X{-1,+1} represented as f = sgn{wx) Where X= or X= w

Footnote About the Threshold • On previous slide, Perceptron has no threshold • But we don’t lose generality:

Geometric View

Initialize w=0 • 2. Cycle through all examples • a. Predict the label of instance x to bey’ = sgn{wx) • b. If y’y, update the weight vector to • w = w + r y x (r - a constant, learning rate) • Otherwise, if y’=y, leave weights unchanged. Perceptron learning rule • If x is Boolean, only weights of active features are updated.

Perceptron Learnability • Obviously can’t learn what it can’t represent • Only linearly separable functions • Minsky and Papert (1969)wrote an influential book demonstrating Perceptron’s representational limitations • Parity functions can’t be learned (XOR) • In vision, if patterns are represented with local features, can’t represent symmetry, connectivity • Research on Neural Networks stopped for years • Rosenblatt himself (1959) asked, • “What pattern recognition problems can be transformed so as to become linearly separable?”

(x1 x2) v (x3 x4) y1 y2

Perceptron Convergence • Perceptron Convergence Theorem: • If there exist a set of weights that are consistent with the • (I.e., the data is linearly separable) the perceptron learning • algorithm will converge • -- How long would it take to converge ? • Perceptron Cycling Theorem: If the training data is not linearly • the perceptron learning algorithm will eventually repeat the • same set of weights and therefore enter an infinite loop. • -- How to provide robustness, more expressivity ?

Perceptron: Mistake Bound Theorem • Maintains a weight vector wRN, w0=(0,…,0). • Upon receiving an example x  RN • Predicts according to the linear threshold function w•x 0. Theorem [Novikoff,1963] Let (x1; y1),…,: (xt; yt), be a sequence of labeled examples with xiRN, xiRand yi{-1,1} for all i. Let uRN, > 0 be such that, ||u|| = 1 and yi u • xifor all i. Then Perceptron makes at most R2 /  2mistakes on this example sequence. (see additional notes) Margin Complexity Parameter

Perceptron-Mistake Bound Proof: Let vkbe the hypothesis before the k-th mistake. Assume that the k-th mistake occurs on the input example (xi, yi). Assumptions v1 = 0 ||u|| ≤ 1 yi u • xi Multiply by u By definition of u By induction Projection K < R2 /  2

Perceptron for Boolean Functions • How many mistakes will the Perceptron algorithms make • when learning a k-disjunction? • It can make O(n) mistakes on k-disjunction on n attributes. • Our bound:R2 /  2 • w : 1 / k 1/2 – for k components, 0 for others, •  : difference only in one variable : 1 / k ½ • R: n 1/2 • Thus, we get : n k • Is it possible to do better? • This is important if n—the number of features is very large

Winnow Algorithm • The Winnow Algorithm learns Linear Threshold Functions. • For the class of disjunction, • instead of demotion we can use elimination.

Winnow - Example

Winnow - Example • Notice that the same algorithm will learn a conjunction over • these variables (w=(256,256,0,…32,…256,256) )

Winnow - Mistake Bound Claim: Winnow makes O(k log n) mistakes on k-disjunctions u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions)

Winnow - Mistake Bound Claim: Winnow makes O(k log n) mistakes on k-disjunctions u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) 1. u < k log(2n)

Winnow - Mistake Bound Claim: Winnow makes O(k log n) mistakes on k-disjunctions u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) 1. u < k log(2n) A weight that corresponds to a good variable is only promoted. When these weights get to n there will no more mistakes on positives

Winnow - Mistake Bound u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) 2. v < 2(u + 1)

Winnow - Mistake Bound u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) 2. v < 2(u + 1) Total weight: TW=n initially

Winnow - Mistake Bound u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) 2. v < 2(u + 1) Total weight: TW=n initially Mistake on positive: TW(t+1) < TW(t) + n

Winnow - Mistake Bound u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) 2. v < 2(u + 1) Total weight TW=n initially Mistake on positive: TW(t+1) < TW(t) + n Mistake on negative: TW(t+1) < TW(t) - n/2

Winnow - Mistake Bound u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) 2. v < 2(u + 1) Total weight TW=n initially Mistake on positive: TW(t+1) < TW(t) + n Mistake on negative: TW(t+1) < TW(t) - n/2 0 < TW < n + u n - v n/2 v < 2(u+1)

Winnow - Mistake Bound u - # of mistakes on positive examples (promotions) v - # of mistakes on negative examples (demotions) # of mistakes: u+ v < 3u + 2 = O(k log n)

Winnow - Extensions • This algorithm learns monotone functions • in Boolean algebra sense • For the general case: • - Duplicate variables • For the negation of variable x, introduce a new variable y. • Learn monotone functions over 2n variables • - Balanced version: • Keep two weights for each variable; effective weight is the difference

Winnow - A Robust Variation • Winnow is robust in the presence of various kinds of noise. • (classification noise, attribute noise) • Importance: sometimes we learn under some distribution • but test under a slightly different one. • (e.g., natural language applications)

Winnow - A Robust Variation • Modeling: • Adversary’s turn: may change the target concept by adding or • removing some variable from the target disjunction. • Cost of each addition move is 1. • Learner’s turn: makes prediction on the examples given, and is • then told the correct answer (according to current target function) • Winnow-R: Same as Winnow, only doesn’t let weights go below 1/2 • Claim: Winnow-R makes O(c log n) mistakes, (c - cost of adversary) • (generalization of previous claim)

Additive update algorithms: Perceptron SVM (not on-line, but a close relative of Perceptron) • Multiplicative update algorithms: Winnow Close relatives: Boosting; Max Entropy Algorithmic Approaches • Focus: Two families of algorithms (one of the on-line representative) Which Algorithm to choose?

Additive weight update algorithm • (Perceptron, Rosenblatt, 1958. Variations exist) Algorithm Descriptions • Multiplicative weight update algorithm (Winnow, Littlestone, 1988. Variations exist)

How to Compare? • Generalization (since the representation is the same) How many examples are needed to get to a given level of accuracy? • Efficiency How long does it take to learn a hypothesis and evaluate it (per-example)? • Robustness; Adaptation to a new domain, ….

Sentence Representation S= I don’t know whether to laugh or cry - Define a set of features: features are relations that hold in the sentence - Map a sentence to its feature-based representation The feature-based representation will give some of the information in the sentence - Use this as an example to your algorithm

Sentence Representation S= I don’t know whether to laugh or cry - Define a set of features: features are relations that hold in the sentence - Conceptually, there are two steps in coming up with a feature-based representation 1. What are the information sources available? Sensors: words, order of words, properties (?) of words 2. What features to construct based on these? Why needed?

Whether Weather Embedding New discriminator in functionally simpler

Domain Characteristics • The number of potential features is very large • The instance space is sparse • Decisions depend on a small set of features (sparse) • Want to learn from a number of examples that is small relative to the dimensionality

Which Algorithm to Choose? • Generalization • Multiplicative algorithms: • Bounds depend on||u||, the separating hyperplane • M =2ln n ||u||12 maxi||x(i)||12/mini(u ¢ x(i))2 • Advantage with few relevant features in concept • Additive algorithms: • Bounds depend on||x|| (Kivinen / Warmuth, ‘95) • M = ||u||2 maxi||x(i)||2/mini(u ¢ x(i))2 • Advantage with few active features per example The l1 norm: ||x||1 = i|xi| The l2 norm: ||x||2 =(1n|xi|2)1/2 The lp norm: ||x||p = (1n|xi|p )1/p The l1 norm: ||x||1 = maxi|xi|

Generalization • Dominated by the sparseness of the function space • Most features are irrelevant • # of examples required by multiplicative algorithms • depends mostly on # of relevant features • (Generalization bounds depend on ||w||;) • Lesser issue: Sparseness of features space: • advantage to additive. Generalization depend on ||x|| • (Kivinen/Warmuth 95); see additional notes.

Mistakes bounds for 10 of 100 of n Function: At least 10 out of fixed 100 variables are active Dimensionality is n Perceptron,SVMs # of mistakes to convergence Winnow n: Total # of Variables (Dimensionality)

Dual Perceptron • We can replace xi ¢ xj with K(xi ,xj) which can be regarded a dot product in some large (or infinite) space • K(x,y) - often can be computed efficiently without computing mapping to this space

Efficiency • Dominatedby the size of the feature space • Most features are functions (e.g., conjunctions) of raw attributes • Additive algorithms allow the use of Kernels • No need to explicitly generate the complex features • Could be more efficient since work is done in the original feature space. • In practice: explicit Kernels (feature space blow-up) is often more efficient.

Practical Issues and Extensions • There are many extensions that can be made to these basic algorithms. • Some are necessary for them to perform well. • Infinite attribute domain • Regularization

Extensions: Regularization • In general – regularization is used to bias the learner in the direction of a low-expressivity (low VC dimension) separator • Thick Separator (Perceptron or Winnow) • Promote if: • w x > + • Demote if: • w x < - w ¢ x =  - - - - - - - - - - - - - - - w ¢ x = 0

SNoW • A learning architecture that supports several linear update rules (Winnow, Perceptron, naïve Bayes) • Allows regularization; voted Winnow/Perceptron; pruning; many options • True multi-class classification • Variable size examples; very good support for large scale domains in terms of number of examples and number of features. • “Explicit” kernels (blowing up feature space). • Very efficient (1-2 order of magnitude faster than SVMs) • Stand alone, implemented in LBJ [Dowload from: http://L2R.cs.uiuc.edu/~cogcomp ]

COLT approach to explaining Learning • No Distributional Assumption • Training Distribution is the same as the Test Distribution • Generalization bounds depend on this view and affects model selection. ErrD(h) < ErrTR(h) + P(VC(H), log(1/±),1/m) • This is also called the “Structural Risk Minimization” principle.

COLT approach to explaining Learning • No Distributional Assumption • Training Distribution is the same as the Test Distribution • Generalization bounds depend on this view and affect model selection. ErrD(h) < ErrTR(h) + P(VC(H), log(1/±),1/m) • As presented, the VC dimension is a combinatorial parameter that is associated with a class of functions. • We know that the class of linear functions has a lower VC dimension than the class of quadratic functions. • But, this notion can be refined to depend on a given data set, and this way directly affect the hypothesis chosen for this data set.

Data Dependent VC dimension • Consider the class of linear functions, parameterized by their margin. • Although both classifiers separate the data, the distance with which the separation is achieved is different: • Intuitively, we can agree that: Large Margin  Small VC dimension

Linear Functions