1 / 15

Text Categorization

Text Categorization . Rong Jin. Text Categorization. Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents A standard supervised learning problem. Sports Business Education Science. Categorization System. …. …. Sports Business

ossie
Télécharger la présentation

Text Categorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Categorization Rong Jin

  2. Text Categorization • Pre-given categories and labeled document examples (Categories may form hierarchy) • Classify new documents • A standard supervised learning problem Sports Business Education Science Categorization System … … Sports Business Education

  3. Yahoo Shopping Categories

  4. Spam Filtering • Two categories: spam or ham • Automatically decide the category for each incoming email

  5. Text Categorization in IR • Many search engine functions are based TC • Language identification (English vs. French etc.) • Detecting spam pages (spam vs. nonspam) • Detecting sexually explicit content (sexually explicit vs. not) • Sentiment detection: positive or negative review • Vertical search – restrict search to a “vertical” like “related to health” (relevant to vertical vs. not)

  6. Text Categorization (TC) • Given: • A fixed set of categories C = {c1, c2, . . . , cJ} • The categories are human-defined for the needs of an application (e.g., spam vs. non-spam). • A set of labeled documents (i.e., training data)

  7. Text Categorization (TC) • Given: • A fixed set of categories C = {c1, c2, . . . , cJ} • The categories are human-defined for the needs of an application (e.g., spam vs. non-spam). • A set of labeled documents (i.e., training data) • Predict the categories for new documents (i.e., test documents)

  8. Text Categorization (TC) Given Prediction

  9. (k=4) (k=1) K Nearest Neighbor Classifier

  10. (k=4) (k=1) K-Nearest Neighbor Classifier • Keep all training examples • Find k examples that are most similar to the new document (“nearest neighbor” documents) • Assign the category that is most common in these nearest neighbor documents (neighbors vote for the category)

  11. K-Nearest Neighbor Classifier • Implementation issue • Searching the nearest neighbors could be time consuming when the number of training documents is large • Improve the efficiency by text search engines Test Doc Training Docs + Class labels Index Database Search Engine D1 (C1) D113 (C2) D1001 (C2) C2

  12. (k=4) (k=1) K-Nearest Neighbor Classifier • Large K • Small variance: prediction is less sensitive to the given set of training documents • Large bias: prediction is less sensitive to the document content

  13. (k=4) (k=1) K-Nearest Neighbor Classifier • Small K • Large variance: prediction is sensitive to the given set of training documents • Small bias: prediction is sensitive to the document content

  14. K-Nearest Neighbor Classifier • Cross validation to determine K • Split labeled documents into training set (80%) and validation set (20%) • For each K in a given range • Predict the categories for docs in the validation set using the documents in the training set • Compute the classification error (i.e. percentage of documents in the validation set that are misclassified) • Choose K with the smallest classification error

  15. Cross Validation for K • K=1, error = 10 • K=2, error = 5 • K=3, error = 2 • K=4, error = 4 • K=5, error = 7 ------------------------------ Choose K= 3 20% 80% Predict Validation Set Training Set

More Related