Download
text classification day 35 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Text classification Day 35 PowerPoint Presentation
Download Presentation
Text classification Day 35

Text classification Day 35

102 Views Download Presentation
Download Presentation

Text classification Day 35

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Text classificationDay 35 LING 681.02 Computational Linguistics Harry Howard Tulane University

  2. Course organization • http://www.tulane.edu/~ling/NLP/ LING 681.02, Prof. Howard, Tulane University

  3. Learning to classify text NLPP §6

  4. Classification • What is it? • Supervision • A classifier is supervised if it is built on training corpora containing the correct label for each input. • This usually means that the program can calculate an error when the predicted label does not match the correct label. • A classifier is unsupervised if it is built on training corpora that does not contain the correct label for each input. • There is no way to calculate an error. LING 681.02, Prof. Howard, Tulane University

  5. Diagram of supervised classification LING 681.02, Prof. Howard, Tulane University

  6. Philosophical question • Does supervised classification work for the majority of stuff that you learned spontaneously as a child? • NO, life does not come neatly labelled. LING 681.02, Prof. Howard, Tulane University

  7. Algorithm • Divide the corpus into three sets: • training set • test set • development (dev-test) set • Choose an initial set of features that will be used to classify the corpus. • The part of the program that looks for the features in the corpus is called a feature extractor. • Train the classifier on the training set. • Run it on the development set. • Refine the feature extractor from any errors produced on the development set. • Run the improved classifier on the test set. LING 681.02, Prof. Howard, Tulane University

  8. Choosing the right features • Use too few, and the data will be underfitted. • The classifier is too vague and makes too many mistakes. • Use too many, and the data will be overfitted. • The classifier is too specific and will not generalize to new examples. LING 681.02, Prof. Howard, Tulane University

  9. Example: gender id • What would the features be? • A female name ends in a, e, i. • A male name ends in k, o, r, s, t. • Explain how classification would work. • NLTK code pp. 223-4. LING 681.02, Prof. Howard, Tulane University

  10. More examples • Classify movie reviews as positive or negative. • How? • Classify POS of words. • How? LING 681.02, Prof. Howard, Tulane University

  11. Beyond the word • Look at word's context. • As we have seen, this is crucial to POS tagging. • Classify IMs as to dialogue acts that they instantiate. • What could be some such acts? • statement, emotion, yes-no question • How? • Recognizing textual entailment • … is the task of determining whether a given piece of text T entails another text called the "hypothesis". • How? LING 681.02, Prof. Howard, Tulane University

  12. RTE example • T: Parviz Davudi was representing Iran at a meeting of the Shanghai Co-operation Organisation (SCO), the fledgling association that binds Russia, China and four former Soviet republics of central Asia together to fight terrorism. • H: China is a member of SCO. LING 681.02, Prof. Howard, Tulane University

  13. Next time Finish NLPP §6 Go on to NLPP §7 Extracting info from text