1 / 25

Text Categorization Moshe Koppel Lecture 1: Introduction

Text Categorization Moshe Koppel Lecture 1: Introduction. Slides based on Manning, Raghavan and Schutze and odds and ends from here and there. Text Classification. Text classification (text c ategorization ): assign documents to one or more predefined categories

Télécharger la présentation

Text Categorization Moshe Koppel Lecture 1: Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text CategorizationMoshe KoppelLecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there

  2. Text Classification • Text classification (text categorization): assign documents to one or more predefined categories classes Documents ? class1 class2 . . . classn

  3. Illustration of Text Classification Science Sport Art

  4. EXAMPLES OF TEXT CATEGORIZATION • LABELS=TOPICS • “finance” / “sports” / “asia” • LABELS=AUTHOR • “Shakespeare” / “Marlowe” / “Ben Jonson” • The Federalist papers • LABELS=OPINION • “like” / “hate” / “neutral “ • LABELS=SPAM? • “spam” / “not spam”

  5. Text Classification Framework Documents Preprocessing Features/ Indexing Performance measure Applying classification algorithms Feature filtering

  6. Preprocessing • Preprocessing: transform documents into a suitable representation for classification task • Remove HTML or other tags • Remove stopwords • Perform word stemming (Remove suffix)

  7. Features+Indexing • Feature types (task dependent) • Measure

  8. Feature types Most crucial decision you’ll make! • Topic • Words, phrases, ? • Author • Stylistic features • Sentiment • Adjectives, ? • Spam • Specialized vocabulary

  9. Indexing • Indexing by different weighing schemes: • Boolean weighting • word frequency weighting • tf*idf weighting • entropy weighting

  10. Feature Selection • Feature selection: remove non-informative terms from documents =>improve classification effectiveness =>reduce computational complexity

  11. Evaluation measures • Precision wrt ci • Recall wrt ci FPi TPi FNi TNi Classified Ci Test Class Ci

  12. Combined effectiveness measures • a classifier should be evaluated by means of a measure which combines recall and precision (why?) • some combined measures: • F1 measure • the breakeven point

  13. F1 measure • F1 measure is defined as: • for the trivial acceptor,   0 and  = 1, F1  0

  14. Breakeven point breakeven point is the value where precision equals recall Recall Precision

  15. Multiclass Problem: Micro- vs. Macro-Averaging • If we have more than one class, how do we combine multiple performance measures into one quantity? • Macroaveraging: Compute performance for each class, then average. • Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

  16. Experiments • Topic-based categorization • Burst of experiments around 1998 • Content features ~ words • Experiments focused on algorithms • Some focused on feature filtering (next lecture) • Standard corpus: Reuters

  17. Reuters-21578:Typicaldocument <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798"> <DATE> 2-MAR-1987 16:51:43.42</DATE> <TOPICS><D>livestock</D><D>hog</D></TOPICS> <TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE> <DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter &#3;</BODY></TEXT></REUTERS>

  18. Reuters 21578 • Most (over)used data set (c. 1998) • 21578 documents • Average document length: 200 words • 9603 training, 3299 test articles (ModApte split) • 118 categories • article can be in > 1 category (average: 1.24) • only about 10 out of 118 categories are large • Earn (2877, 1087) • Acquisitions (1650, 179) • Money-fx (538, 179) • Grain (433, 149) • Crude (389, 189) • Trade (369,119) • Interest (347, 131) • Ship (197, 89) • Wheat (212, 71) • Corn (182, 56) Common categories (#train, #test)

  19. First Experiment: Yang and Liu • Features: stemmed words (stop words removed) • Indexing: frequency (?) • Feature filtering: top infogain words (1000 to 10000) • Evaluation: macro- and micro-averaged F1

  20. Results: Yang&Liu

  21. Second Experiment: Dumais et al • Features: non-rare words • Indexing: binary • Feature filtering: top infogain words (30 per category) • Evaluation: macro-averaged break-even

  22. Results: Dumais et al. Breakeven

  23. Observations: Dumais et al • Features: words + bigrams No improvement! • Indexing: frequency instead of binary No improvement!

  24. Third Experiment: Joachims • Features: stemmed unigrams (stop words removed) • Indexing: tf*idf • Feature filtering: 1000 top infogain words • Evaluation: micro-averaged break-even

  25. Results: Joachims

More Related