1 / 33

Linear Model (III)

Linear Model (III). Rong Jin. Announcement. Homework 2 is out and is due 02/05/2004 (next Tuesday) Homework 1 is handed out. Recap: Logistic Regression Model. Assume the inputs and outputs are related in the log linear function Estimate weights: MLE approach. Example: Text Classification.

gizela
Télécharger la présentation

Linear Model (III)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Model (III) Rong Jin

  2. Announcement • Homework 2 is out and is due 02/05/2004 (next Tuesday) • Homework 1 is handed out

  3. Recap: Logistic Regression Model • Assume the inputs and outputs are related in the log linear function • Estimate weights: MLE approach

  4. Example: Text Classification • Input x: a binary vector • Each word is a different dimension • xi = 0 if the ith word does not appear in the document xi = 1 if it appears in the document • Output y: interesting document or not • +1: interesting • -1: uninteresting

  5. Example: Text Classification Doc 1 The purpose of the Lady Bird Johnson Wildflower Center is to educate people around the world, … Doc 2 Rain Bird is one of the leading irrigation manufacturers in the world, providingcomplete irrigation solutions for people…

  6. Example 2: Text Classification • Logistic regression model • Every term ti is assigned with a weight wi • Learning parameters: MLE approach • Need numerical solutions

  7. Example 2: Text Classification • Weight wi • wi > 0: term ti is a positive evidence • wi < 0: term ti is a negative evidence • wi = 0: term ti is irrelevant to whether or not the document is intesting • The larger the | wi |, the more important ti term is determining whether the document is interesting. • Threshold c

  8. Example 2: Text Classification • Dataset: Reuter-21578 • Classification accuracy • Naïve Bayes: 77% • Logistic regression: 88%

  9. Why Logistic Regression Works better for Text Classification? • Common words • Small weights in logistic regression • Large weights in naïve Bayes • Weight ~ p(w|+) – p(w|-) • Independence assumption • Naive Bayes assumes that each word is generated independently • Logistic regression is able to take into account of the correlation of words

  10. Comparison • Discriminative Model • Model P(y|x) directly • Model the decision boundary • Usually good performance • But • Slow convergence • Expensive computation • Sensitive to noise data • Generative Model • Model P(x|y) • Model the input patterns • Usually fast converge • Cheap computation • Robust to noise data • But • Usually performs worse

  11. Problems with Logistic Regression? How about words that only appears in one class?

  12. Overfitting Problem with Logistic Regression • Consider word t that only appears in one document d, and d is a positive document. Let w be its associated weight • Consider the derivative of l(Dtrain) with respect to w • w will be infinite !

  13. Solution: Regularization • Regularized log-likelihood • Large weights  small weights • Prevent weights from being too large • Small weights  zero • Sparse weights

  14. Why do We Need Sparse Solution? • Two types of solutions • Many non-zero weights but many of them are small • Only a small number of weights, and many of them are large • Occam’s Razor: the simpler the better • A simpler model that fits data unlikely to be coincidence • A complicated model that fit data might be coincidence • Smaller number of non-zero weights  less amount of evidence to consider  simpler model  case 2 is preferred

  15. Occam’s Razer

  16. Occam’s Razer: Power = 1

  17. Occam’s Razer: Power = 3

  18. Occam’s Razor: Power = 10

  19. Finding Optimal Solutions • Concave objective function • No local maximum • Many standard optimization algorithms work

  20. Predication Errors Preventing weights from being too large Gradient Ascent • Maximize the log-likelihood by iteratively adjusting the parameters in small increments • In each iteration, we adjust w in the direction that increases the log-likelihood (toward the gradient)

  21. Graphical Illustration No regularization case

  22. Using regularization Without regularization Iteration

  23. When should Stop? • The gradient ascent learning method converges when there is no incentive to move the parameters in any particular direction: • In many cases, it can be very tricky • Small first order derivative  close to the maximum point

  24. Extend Logistic Regression Model to Multiple Classes • y{1,2,…,C} • How to extend the above definition to the case when the number of classes is more than 2?

  25. Conditional Exponential Model • It is simple! • Ensure the sum of probability to be 1

  26. Conditional Exponential Model • Predication probability • Model parameters: • For each class y, we have weights wy and threshold cy • Maximum likelihood estimation • Any problem with the above optimization problem?

  27. Conditional Exponential Model • Add a constant vector to every weight vector, we have the same log-likelihood function • Usually set w1 to be a zero vector and c1 to be zero

  28. Maximum Entropy Model: Motivation • Consider a translation example • English ‘in’  French {dans, en, à, au cours de, pendant} • Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) • Case 1: no prior knowledge on tranlation • What is your guess of the probabilities? • p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5 • Case 2: 30% time use either dans or en • What is your guess of the probabilities? • p(dans)=p(en)=3/20 p(à)=p(au-cours-de)=p(pendant)=7/30 • Uniform distribution is favored

  29. Maximum Entropy Model: Motivation • Case 3: 30% use dans or en, and 50% use dans or à • What is your guess of the probabilities? • How to measure the uniformity of any distribution?

  30. Maximum Entropy Principle (MaxEnt) • A uniformity of distribution is measured by entropy of the distribution • Solution: p(dans) = 0.2, p(a) = 0.3, p(en)=0.1, p(au-cours-de) = 0.2, p(pendant) = 0.2

  31. MaxEnt for Classification Problems • Requiring the first order moment to be consistent between the empirical data and model predication • No assumption about the parametric form for likelihood • Usually assume it is Cn continuous • What is the solution for ?

  32. Solution to MaxEnt • Surprisingly, the solution is just conditional exponential model without thresholds • Why?

  33. Solution to MaxEnt

More Related