1 / 14

Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th Feb , 2011

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 16– Linear and Logistic Regression). Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th Feb , 2011. Least Square Method: fitting a line (following Manning and Schutz , Foundation of Statistical NLP, 1999).

tausiq
Télécharger la présentation

Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th Feb , 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS460/626 : Natural Language Processing/Speech, NLP and the Web(Lecture 16– Linear and Logistic Regression) Pushpak BhattacharyyaCSE Dept., IIT Bombay 14th Feb, 2011

  2. Least Square Method: fitting a line (following Manning and Schutz, Foundation of Statistical NLP, 1999) • Given set of N points (x1,y1), (x2,y2),…, (xN,yN) • Find a line f(x)=mx+b that best fits the data • m and b are the parameters to be found • W: <m, b> is the weight vector • The line that best fits the data is the one that minimizes the sum of squares of the distances

  3. Values of m and b • Partial differentiation of SS(m,b) wrt b and m yieldsrespectively

  4. Example (Manning and Schutz, FSNLP, 1999)

  5. Implication of the “line” fitting 4 2 C D B A 3 O 1 • 1, 2, 3, 4: are the points • A, B, C, D: are their projections on the fitted line • Suppose 1, 2 form a class and 3, 4 another class • Of course, it is easy to set up a hyper plane that will separate 1 and 2 from 3 and 4 • That will be classification in 2 dimension • But suppose we form another attribute of these points, viz., distances of their • projections On the line from “O” • Then the points can be classified by a threshold on these distances • This effectively is classification in the reduced dimension (1 dimension)

  6. When the dimensionality is more than 2 • Let X be the input vectors: M X N (M input vectors with N features) • Yj=w0+w1.xj1+w2.xj2+w3.xj3+…+wn.xjn • find the weight vector W:<w0, w1, w2, w3, …, wn> • It can be shown that

  7. The multivariate data f1 f2 f3 f4 f5… fn x11 x12 x13 x14 x15 … x1n y1 x21 x22 x23 x24 x25 … x2n y2 x31 x32 x33 x34 x35 … x3n y3 x41 x42 x43 x44 x45 … x4n y4 … xm1 xm2 xm3 xm4 xm5 … xmnym

  8. Logistic Regression • Linear regression: predicting a real-valued outcome • Classification: Output takes a value from a small set of discrete values • Simplest classification: Two classes (o/1 or true/false) • Predict the class and also give the probability of belongingness to the class

  9. Linear to logistic regression • P(y=true |x)=Σi=0,n wi X fi= w.f • But, not a legal probability value! Value from –∞ to +∞ • Predict the ratio of the probability of being in the class to the probability of not being in the class • Odds Ratio: • If an event has probability 0.75 of occurring and probability 0.25 of not occurring, we say the odds of occurring is 0.75/0.25 = 3.

  10. Odds Ratio (following Jurafski and Martin, Speech and NLP, 2009) Ratio of probabilities can lie between 0 and ∞. But the RHS is between -∞ and + ∞. Introduce log. Then get the expression for p(y=true|x)

  11. Logistic function for p(y=true|x) • The form of p() is called the logistic function • It maps values from –∞ to +∞ to lie between 0 and 1

  12. Classification using logistic regression For belonging to the true class This gives In other words, Equivalent to placing a Hyperplane to separate the Two classes

  13. Learning in logistic regression • In linear regression we used minimizing the sum square error (SSE) • In Logistic regression, we use maximum likelihood estimation • Choose the weights such that the conditional probability p(y|x) is maximized

  14. Steps of learning w For a particular <x,y> For all <x,y> pairs Working with log This can be converted to Substituting the values of Ps

More Related