1 / 97

Text Categorization and Images

Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu Chang. Text Categorization and Images. Text Categorization.

otis
Télécharger la présentation

Text Categorization and Images

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu Chang Text Categorization and Images

  2. Text Categorization • Text categorization (TC) refers to the automatic labeling of documents, using natural language text contained in or associated with each document, into one or more pre-defined categories. • Idea: TC techniques can be applied to image captions or articles to label the corresponding images.

  3. Clues for Indoor versus Outdoor:Text (as opposed to visual image features) Denver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21. The two engines of an Amtrak passenger train lie in the mud at the edge a marsh after the train, bound for Boston from Washington, derailed on the bank of the Hackensack River, just after crossing a bridge.

  4. Two Paradigms of Research • Machine learning (ML) techniques • Common in the literature • Usually involve the exploration of new algorithms applied to bag of words representations of documents • Novel representation • Rare in the literature • Usually more specific, but often interesting and can lead to substantial improvement • Important for certain tasks involving images!

  5. Contributions • General: • An in-depth exploration of the categorization of images based on associated text • Incorporating research into Newsblaster • Novel machine learning (ML) techniques: • The creation of two novel TC approaches • The combination of high-precision/low-recall rules with other systems • Novel representation: • The integration of NLP and IR • The use of low-level image features

  6. Framework • Collection of Experiments • Various tasks • Multiple techniques • No clear winner for all tasks • Characteristics of tasks often dictate which techniques work best • “No Free Lunch”

  7. Overview • The Main Idea • Description of Corpus • Novel ML Systems • NLP Based System • High-Precision/Low-Recall Rules • Image Features • Newsblaster • Conclusions and Future Work

  8. Corpus • Raw data: • Postings from news related Usenet newsgroups • Over 2000 include embedded captioned images • Data sets: • Multiple sets of categories representing various levels of abstraction • Mutually exclusive and exhaustive categories

  9. Indoor Outdoor

  10. Events Categories Politics Struggle Disaster Crime Other

  11. Affected People Workers Responding Wreckage Other Subcategories for Disaster Images Politics Struggle Disaster Crime Other

  12. Disaster Image Categories Affected People Workers Responding Wreckage Other

  13. Meeting Announcement Politician Photographed Civilians Military Other Subcategories for Politics Images Politics Struggle Disaster Crime Other

  14. Politics Image Categories Meeting Announcement Civilians Politician Photographed Military Other

  15. Collect Labels to Train Systems

  16. Overview • The Main Idea • Description of Corpus • Novel ML Systems • NLP Based System • High-Precision/Low-Recall Rules • Image Features • Newsblaster • Conclusions and Future Work

  17. Two Novel ML Approaches • Density estimation • Applied to the results of some other system • Often improves performance • Always provides probabilistic confidence measures for predictions • BINS • Uses binning to estimate accurate term weights for words with scarce evidence • Extremely competitive for two data sets in my corpus

  18. Density Estimation • First apply a standard system: • For each document, compute a similarity or score for every category. • Apply to training documents as well as test documents. • For each test document: • Find all documents from training set with similar category scores. • Use categories of close training documents to predict categories of test documents.

  19. Density Estimation Example Category score vector for test document: Category score vectors for training documents: Actual Categories: Distances: 85, 35, 25, 95, 20 (Crime) Struggle Politics Disaster 20.0 Crime Other 100, 75, 20, 30, 5 (Struggle) 92.5 100, 40, 30, 90, 10 106.4 40, 30, 80, 25, 40 (Disaster) Predictions: Rocchio/TF*IDF: Struggle DE: Crime (Probability .679) 27.4 91.4 80, 45, 20, 75, 10 (Struggle) 36.7 60, 95, 20, 30, 5 (Politics) 90, 25, 50, 110, 25 (Crime)

  20. Density Estimation Significantly Improves Performancefor the Indoor versus Outdoor Data Set

  21. Density Estimation Slightly Degrades Performancefor the Events Data Set

  22. Density Estimation Sometimes Improves Performance,Always Provides Confidence Measures Indoor versus Outdoor Events: Politics, Struggle, Disaster, Crime, Other

  23. Results of Density Estimation Experiments for the Indoor versus Outdoor Data Set: Results of Density Estimation Experiments for the Events Data Set:

  24. BINS System:Naïve Bayes + Smoothing • Binning: based on smoothing in the speech recognition literature • Not enough training data to estimate term weights for words with scarce evidence • Words with similar statistical features are grouped into a common “bin” • Estimate a single weight for each bin • This weight is assigned to all words in the bin • Credible estimates even for small (or zero) counts

  25. Binning Uses Statistical Features of Words

  26. “plane” • Sparse data • “plane” does not occur in any Indoor training documents • Infinitely more likely to be Outdoor ??? • Assign “plane” to bins of words with similar features (e.g. IDF, category counts) • In first half of training set, “plane” appears in: • 9 Outdoor documents • 0 Indoor documents

  27. Lambdas: Weights • First half of training set: Assign words to bins • Second half of training set: Estimate term weights

  28. Lambdas for “plane”:4.03 times more likely in an Outdoor document

  29. Binning  Credible Log Likelihood Ratios

  30. Lambdas Decrease with IDF

  31. Methodology of BINS • Divide training set into two halves: • First half used to determine bins for words • Second half used to determine lambdas for bins • For each test document: • Map every word to a bin for each category • Add lambdas, obtaining a score for each category • Switch halves of training and repeat • Combine results and assign each document to category with highest score

  32. Binning Improves Performancefor the Indoor versus Outdoor Data Set

  33. Binning Improves Performancefor the Events Data Set

  34. humans baseline BINS: Robust Version of Naïve Bayes Indoor versus Outdoor Events: Politics, Struggle, Disaster, Crime, Other

  35. Combining Bin Weights and Naïve Bayes Weights • Idea: • It might be better to use the Naïve Bayes weight when there is enough evidence for a word • Back off to the bin weight otherwise • BINS allows combinations of weights to be used based on the level of evidence • How can we automatically determine when to use which weights??? • Entropy • Minimum Squared Error (MSE)

  36. Can Provide File to BINS that Specifies How to Combine Weights Based on Entropy: Based on MSE: 0 0.25 0.5 0.75 1 Use only bin weight for evidence of 0 0 0.5 1 Average bin weight and NB weight for evidence of 1 Use only NB weight for evidence of 2 or more

  37. Appropriately Combining the Bin Weight and the Naïve Bayes Weight Leads to the Best Performance Yet Indoor versus Outdoor Events: Politics, Struggle, Disaster, Crime, Other

  38. BINS SVMs BINS SVMs BINS Performs the Best of All Systems Tested Indoor versus Outdoor Events: Politics, Struggle, Disaster, Crime, Other

  39. How Can We Improve Results? • One idea: Label more documents! • Usually works • Boring • Another idea: Use unlabeled documents! • Easily obtainable • But can this really work??? • Maybe it can…

  40. Binning Using Unlabeled Documents • Apply system to unlabeled documents • Choose documents with “confident” predictions • Each word has new feature: # of unlabeled documents containing the word that are confidently predicted to belong to each category (unlabeled category counts) • Probably less important than regular category counts • Binning provides a natural mechanism for weighting the new feature appropriately

  41. Determining Confident Predictions • BINS computes a score for each category • BINS predicts category with highest score • Confidence for predicted category is score of that category minus score of second place category • Confidence for non-predicted category is score of that category minus score of chosen category • Cross validation experiments can be used to determine a confidence cutoff for each category • Maximize F for category • Beta of 1 gives precision and recall equal weight, lower beta weights precision higher

  42. Use F to Optimize Confidence Cutoffs (example for a single category)

  43. Use F to Optimize Confidence Cutoffs (important region of graph highlighted)

  44. Should the New Feature Matter?

  45. Does the New Feature Help? • No • Why??? • New features add info but make bins smaller • Perhaps more data isn’t needed in the first place • Should more data matter? • Hard to accumulate more labeled data • Easy to try out less labeled data!

  46. Does Size Matter?

  47. Overview • The Main Idea • Description of Corpus • Novel ML Systems • NLP Based System • High-Precision/Low-Recall Rules • Image Features • Newsblaster • Conclusions and Future Work

  48. Disaster Image Categories Affected People Workers Responding Wreckage Other

  49. Performance of Standard SystemsNot Very Satisfying

  50. Workers Responding Affected People Ambiguity for Disaster Images:Workers Responding vs. Affected People Philippine rescuers carry a fire victim March 19 who perished in a blaze at a Manila disco. Hypothetical alternative caption: A fire victim who perished in a blaze at a Manila disco is carried by Philippine rescuers March 19.

More Related