970 likes | 1.23k Vues
Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu Chang. Text Categorization and Images. Text Categorization.
E N D
Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu Chang Text Categorization and Images
Text Categorization • Text categorization (TC) refers to the automatic labeling of documents, using natural language text contained in or associated with each document, into one or more pre-defined categories. • Idea: TC techniques can be applied to image captions or articles to label the corresponding images.
Clues for Indoor versus Outdoor:Text (as opposed to visual image features) Denver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21. The two engines of an Amtrak passenger train lie in the mud at the edge a marsh after the train, bound for Boston from Washington, derailed on the bank of the Hackensack River, just after crossing a bridge.
Two Paradigms of Research • Machine learning (ML) techniques • Common in the literature • Usually involve the exploration of new algorithms applied to bag of words representations of documents • Novel representation • Rare in the literature • Usually more specific, but often interesting and can lead to substantial improvement • Important for certain tasks involving images!
Contributions • General: • An in-depth exploration of the categorization of images based on associated text • Incorporating research into Newsblaster • Novel machine learning (ML) techniques: • The creation of two novel TC approaches • The combination of high-precision/low-recall rules with other systems • Novel representation: • The integration of NLP and IR • The use of low-level image features
Framework • Collection of Experiments • Various tasks • Multiple techniques • No clear winner for all tasks • Characteristics of tasks often dictate which techniques work best • “No Free Lunch”
Overview • The Main Idea • Description of Corpus • Novel ML Systems • NLP Based System • High-Precision/Low-Recall Rules • Image Features • Newsblaster • Conclusions and Future Work
Corpus • Raw data: • Postings from news related Usenet newsgroups • Over 2000 include embedded captioned images • Data sets: • Multiple sets of categories representing various levels of abstraction • Mutually exclusive and exhaustive categories
Indoor Outdoor
Events Categories Politics Struggle Disaster Crime Other
Affected People Workers Responding Wreckage Other Subcategories for Disaster Images Politics Struggle Disaster Crime Other
Disaster Image Categories Affected People Workers Responding Wreckage Other
Meeting Announcement Politician Photographed Civilians Military Other Subcategories for Politics Images Politics Struggle Disaster Crime Other
Politics Image Categories Meeting Announcement Civilians Politician Photographed Military Other
Overview • The Main Idea • Description of Corpus • Novel ML Systems • NLP Based System • High-Precision/Low-Recall Rules • Image Features • Newsblaster • Conclusions and Future Work
Two Novel ML Approaches • Density estimation • Applied to the results of some other system • Often improves performance • Always provides probabilistic confidence measures for predictions • BINS • Uses binning to estimate accurate term weights for words with scarce evidence • Extremely competitive for two data sets in my corpus
Density Estimation • First apply a standard system: • For each document, compute a similarity or score for every category. • Apply to training documents as well as test documents. • For each test document: • Find all documents from training set with similar category scores. • Use categories of close training documents to predict categories of test documents.
Density Estimation Example Category score vector for test document: Category score vectors for training documents: Actual Categories: Distances: 85, 35, 25, 95, 20 (Crime) Struggle Politics Disaster 20.0 Crime Other 100, 75, 20, 30, 5 (Struggle) 92.5 100, 40, 30, 90, 10 106.4 40, 30, 80, 25, 40 (Disaster) Predictions: Rocchio/TF*IDF: Struggle DE: Crime (Probability .679) 27.4 91.4 80, 45, 20, 75, 10 (Struggle) 36.7 60, 95, 20, 30, 5 (Politics) 90, 25, 50, 110, 25 (Crime)
Density Estimation Significantly Improves Performancefor the Indoor versus Outdoor Data Set
Density Estimation Slightly Degrades Performancefor the Events Data Set
Density Estimation Sometimes Improves Performance,Always Provides Confidence Measures Indoor versus Outdoor Events: Politics, Struggle, Disaster, Crime, Other
Results of Density Estimation Experiments for the Indoor versus Outdoor Data Set: Results of Density Estimation Experiments for the Events Data Set:
BINS System:Naïve Bayes + Smoothing • Binning: based on smoothing in the speech recognition literature • Not enough training data to estimate term weights for words with scarce evidence • Words with similar statistical features are grouped into a common “bin” • Estimate a single weight for each bin • This weight is assigned to all words in the bin • Credible estimates even for small (or zero) counts
“plane” • Sparse data • “plane” does not occur in any Indoor training documents • Infinitely more likely to be Outdoor ??? • Assign “plane” to bins of words with similar features (e.g. IDF, category counts) • In first half of training set, “plane” appears in: • 9 Outdoor documents • 0 Indoor documents
Lambdas: Weights • First half of training set: Assign words to bins • Second half of training set: Estimate term weights
Lambdas for “plane”:4.03 times more likely in an Outdoor document
Methodology of BINS • Divide training set into two halves: • First half used to determine bins for words • Second half used to determine lambdas for bins • For each test document: • Map every word to a bin for each category • Add lambdas, obtaining a score for each category • Switch halves of training and repeat • Combine results and assign each document to category with highest score
Binning Improves Performancefor the Indoor versus Outdoor Data Set
humans baseline BINS: Robust Version of Naïve Bayes Indoor versus Outdoor Events: Politics, Struggle, Disaster, Crime, Other
Combining Bin Weights and Naïve Bayes Weights • Idea: • It might be better to use the Naïve Bayes weight when there is enough evidence for a word • Back off to the bin weight otherwise • BINS allows combinations of weights to be used based on the level of evidence • How can we automatically determine when to use which weights??? • Entropy • Minimum Squared Error (MSE)
Can Provide File to BINS that Specifies How to Combine Weights Based on Entropy: Based on MSE: 0 0.25 0.5 0.75 1 Use only bin weight for evidence of 0 0 0.5 1 Average bin weight and NB weight for evidence of 1 Use only NB weight for evidence of 2 or more
Appropriately Combining the Bin Weight and the Naïve Bayes Weight Leads to the Best Performance Yet Indoor versus Outdoor Events: Politics, Struggle, Disaster, Crime, Other
BINS SVMs BINS SVMs BINS Performs the Best of All Systems Tested Indoor versus Outdoor Events: Politics, Struggle, Disaster, Crime, Other
How Can We Improve Results? • One idea: Label more documents! • Usually works • Boring • Another idea: Use unlabeled documents! • Easily obtainable • But can this really work??? • Maybe it can…
Binning Using Unlabeled Documents • Apply system to unlabeled documents • Choose documents with “confident” predictions • Each word has new feature: # of unlabeled documents containing the word that are confidently predicted to belong to each category (unlabeled category counts) • Probably less important than regular category counts • Binning provides a natural mechanism for weighting the new feature appropriately
Determining Confident Predictions • BINS computes a score for each category • BINS predicts category with highest score • Confidence for predicted category is score of that category minus score of second place category • Confidence for non-predicted category is score of that category minus score of chosen category • Cross validation experiments can be used to determine a confidence cutoff for each category • Maximize F for category • Beta of 1 gives precision and recall equal weight, lower beta weights precision higher
Use F to Optimize Confidence Cutoffs (example for a single category)
Use F to Optimize Confidence Cutoffs (important region of graph highlighted)
Does the New Feature Help? • No • Why??? • New features add info but make bins smaller • Perhaps more data isn’t needed in the first place • Should more data matter? • Hard to accumulate more labeled data • Easy to try out less labeled data!
Overview • The Main Idea • Description of Corpus • Novel ML Systems • NLP Based System • High-Precision/Low-Recall Rules • Image Features • Newsblaster • Conclusions and Future Work
Disaster Image Categories Affected People Workers Responding Wreckage Other
Workers Responding Affected People Ambiguity for Disaster Images:Workers Responding vs. Affected People Philippine rescuers carry a fire victim March 19 who perished in a blaze at a Manila disco. Hypothetical alternative caption: A fire victim who perished in a blaze at a Manila disco is carried by Philippine rescuers March 19.