Linear Model (III)

Linear Model (III) Rong Jin

Announcement • Homework 2 is out and is due 02/05/2004 (next Tuesday) • Homework 1 is handed out

Recap: Logistic Regression Model • Assume the inputs and outputs are related in the log linear function • Estimate weights: MLE approach

Example: Text Classification • Input x: a binary vector • Each word is a different dimension • xi = 0 if the ith word does not appear in the document xi = 1 if it appears in the document • Output y: interesting document or not • +1: interesting • -1: uninteresting

Example: Text Classification Doc 1 The purpose of the Lady Bird Johnson Wildflower Center is to educate people around the world, … Doc 2 Rain Bird is one of the leading irrigation manufacturers in the world, providingcomplete irrigation solutions for people…

Example 2: Text Classification • Logistic regression model • Every term ti is assigned with a weight wi • Learning parameters: MLE approach • Need numerical solutions

Example 2: Text Classification • Weight wi • wi > 0: term ti is a positive evidence • wi < 0: term ti is a negative evidence • wi = 0: term ti is irrelevant to whether or not the document is intesting • The larger the | wi |, the more important ti term is determining whether the document is interesting. • Threshold c

Example 2: Text Classification • Dataset: Reuter-21578 • Classification accuracy • Naïve Bayes: 77% • Logistic regression: 88%

Why Logistic Regression Works better for Text Classification? • Common words • Small weights in logistic regression • Large weights in naïve Bayes • Weight ~ p(w|+) – p(w|-) • Independence assumption • Naive Bayes assumes that each word is generated independently • Logistic regression is able to take into account of the correlation of words

Comparison • Discriminative Model • Model P(y|x) directly • Model the decision boundary • Usually good performance • But • Slow convergence • Expensive computation • Sensitive to noise data • Generative Model • Model P(x|y) • Model the input patterns • Usually fast converge • Cheap computation • Robust to noise data • But • Usually performs worse

Problems with Logistic Regression? How about words that only appears in one class?

Overfitting Problem with Logistic Regression • Consider word t that only appears in one document d, and d is a positive document. Let w be its associated weight • Consider the derivative of l(Dtrain) with respect to w • w will be infinite !

Solution: Regularization • Regularized log-likelihood • Large weights  small weights • Prevent weights from being too large • Small weights  zero • Sparse weights

Why do We Need Sparse Solution? • Two types of solutions • Many non-zero weights but many of them are small • Only a small number of weights, and many of them are large • Occam’s Razor: the simpler the better • A simpler model that fits data unlikely to be coincidence • A complicated model that fit data might be coincidence • Smaller number of non-zero weights  less amount of evidence to consider  simpler model  case 2 is preferred

Occam’s Razer

Occam’s Razer: Power = 1

Occam’s Razer: Power = 3

Occam’s Razor: Power = 10

Finding Optimal Solutions • Concave objective function • No local maximum • Many standard optimization algorithms work

Predication Errors Preventing weights from being too large Gradient Ascent • Maximize the log-likelihood by iteratively adjusting the parameters in small increments • In each iteration, we adjust w in the direction that increases the log-likelihood (toward the gradient)

Graphical Illustration No regularization case

Using regularization Without regularization Iteration

When should Stop? • The gradient ascent learning method converges when there is no incentive to move the parameters in any particular direction: • In many cases, it can be very tricky • Small first order derivative  close to the maximum point

Extend Logistic Regression Model to Multiple Classes • y{1,2,…,C} • How to extend the above definition to the case when the number of classes is more than 2?

Conditional Exponential Model • It is simple! • Ensure the sum of probability to be 1

Conditional Exponential Model • Predication probability • Model parameters: • For each class y, we have weights wy and threshold cy • Maximum likelihood estimation • Any problem with the above optimization problem?

Conditional Exponential Model • Add a constant vector to every weight vector, we have the same log-likelihood function • Usually set w1 to be a zero vector and c1 to be zero

Maximum Entropy Model: Motivation • Consider a translation example • English ‘in’  French {dans, en, à, au cours de, pendant} • Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) • Case 1: no prior knowledge on tranlation • What is your guess of the probabilities? • p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5 • Case 2: 30% time use either dans or en • What is your guess of the probabilities? • p(dans)=p(en)=3/20 p(à)=p(au-cours-de)=p(pendant)=7/30 • Uniform distribution is favored

Maximum Entropy Model: Motivation • Case 3: 30% use dans or en, and 50% use dans or à • What is your guess of the probabilities? • How to measure the uniformity of any distribution?

Maximum Entropy Principle (MaxEnt) • A uniformity of distribution is measured by entropy of the distribution • Solution: p(dans) = 0.2, p(a) = 0.3, p(en)=0.1, p(au-cours-de) = 0.2, p(pendant) = 0.2

MaxEnt for Classification Problems • Requiring the first order moment to be consistent between the empirical data and model predication • No assumption about the parametric form for likelihood • Usually assume it is Cn continuous • What is the solution for ?

Solution to MaxEnt • Surprisingly, the solution is just conditional exponential model without thresholds • Why?

Solution to MaxEnt

Linear Model (III)

Linear Model (III)

Presentation Transcript

Multiple Linear Regression, General Linear Model, & Generalized Linear Model

General Linear Model

General Linear Model

Part III The General Linear Model Chapter 9 Regression

General Linear Model

Generalized Linear Model

Linear Modelling III

General Linear Model

Linear Delay Model

Graphing Linear Equations

Chapter 6 Linear Programming: Model Formulation and Graphical Solution

Linear Model

Chapter 2 Linear Programming: Model Formulation and Graphical Solution

Linear Modelling III

Chapter 13 General Linear Model

General Linear Model

Linear Statistical Model

Linear Model Application

Model I

General Linear Model

Linear Model (III)

Linear Model (III)

Presentation Transcript

Multiple Linear Regression, General Linear Model, &amp; Generalized Linear Model

General Linear Model

General Linear Model

Part III The General Linear Model Chapter 9 Regression

General Linear Model

Generalized Linear Model

Linear Modelling III

General Linear Model

Linear Delay Model

Graphing Linear Equations

Chapter 6 Linear Programming: Model Formulation and Graphical Solution

Linear Model

Chapter 2 Linear Programming: Model Formulation and Graphical Solution

Linear Modelling III

Chapter 13 General Linear Model

General Linear Model

Linear Statistical Model

Linear Model Application

Model I

General Linear Model

Multiple Linear Regression, General Linear Model, & Generalized Linear Model