Text Classification and Images by Carl Sable
Overview • Text Classification. • Involves assigning text documents to one or more groups (classes). • Techniques can be applied to image captions to classify corresponding images. • Various methods, evaluation techniques, and related issues will be discussed. • Some discussion of other research involving image captions.
Text Classification Tasks • Text Categorization (TC) - Assign text documents to existing, well-defined categories. • Information Retrieval (IR) - Retrieve text documents which match user query. • Clustering - Group text documents into clusters of similar documents. • Text Filtering - Retrieve documents which match a user profile.
Text Categorization • Classify each test document by assigning category labels. • M-ary categorization assumes M labels per document. • Binary categorization requires yes/no decision for every document/category pair. • Most techniques require training. • Parametric vs non-parametric. • Batch vs on-line.
Early Work • The Federalist papers. • Published anonymously between 1787-1788. • Authorship of 12 papers in dispute (either Hamilton or Madison). • Mostellar and Wallace, 1963. • Compared rate per thousand words of high frequency words. • Collected very strong evidence in favor of Madison.
Rocchio • All documents and categories represented by word vectors. • TF*IDF weights for words. • Term frequency is number of times word appears in document or category. • Inverse document relates to scarcity of word over entire training collection. • Similarity computed for all document, category pairs.
Naïve Bayes • Estimates probabilities of categories given a document. • Uses joint probabilities of words and categories (Bayes’ rule). • Assumes words are independent of each other. • Can incorporate a priori probabilities of categories.
Other Common Methods • K-Nearest Neighbor (kNN) - Use k closest training documents to predict category. • Decision Trees (DTree)- Construct classification trees based on training data. • Neural Networks (NNet) - Learn non-linear mapping from input words to categories. • Expert Systems - Use manually constructed, domain-specific, application-specific rules.
Advanced Techniques • Support Vector Machines (SVMs). • Use Structural Risk Minimization principle. • Find hypothesis which minimizes “true error”. • Widrow-Hoff and EG - Update weight vector based on each training example. • Maximum Entropy - Derive constraints expressing characteristics of training data. • Boosting - Combine weak hypotheses to produce highly accurate classification rule.
Common Test Corpora • Reuters - Collection of newswire stories from 1987 to 1991, labeled with categories. • TREC-AP newswire stories from 1988 to 1990, labeled with categories. • OHSUMED Medline articles from 1987 to 1991, MeSH categories assigned. • UseNet newsgroups. • WebKB - Web pages gathered from university CS departments.
Other Issues to Consider • Which words to use (feature selection). • Normalization. • Use of lexical databases. • Longman Dictionary of Contemporary English (LDOCE), WordNet, English Verb Classes and Alternations (EVCA). • May cause problems due to lexical ambiguity. • High cost of manual labels.
Categorizing Images • Some previous research on content-based image categorization, very little on text-based image categorization! • WebSEEk. • Categorizes images and videos based on key-terms extracted from URL, alt text, hyperlinks, and directory names. • Semi-automated key-term dictionary maps key-terms to subject(s) from a taxonomy.
Per Category Measures: simple accuracy or error measures can be misleading. precision, recall, and fallout. F-measure, average precision, and break-even point (BEP) combine precision and recall. Macro-averaging vs Micro-averaging. Should choose metric ahead of time (maybe)! Evaluation Metrics contingency table: p = a / (a + b) r = a / (a + c) f = b / (b + d) Acc = (a + d) / n Err = (b + c) / n
Some Results and Analysis • Comparisons. • SVM and kNN, AdaBoost, WH, and EG all showed very impressive performance. • Naïve Bayes and Rocchio tended to show relatively poor performance. • Rocchio possibly could have done better. • Should be using probabilistic Rocchio. • Works best if categories are mutually exclusive. • May perform at its best when only 2 categories.
Information Retrieval • User inputs query, system should retrieve all relevant documents. • Simple technique: keyword search. • Other techniques use on word vectors. • TF*IDF commonly used for weights. • Can compute similarity between query vector and document vectors. • Evaluation - Similar to text categorization, treat relevant documents as single category.
Relevance Feedback • After initial retrieval, user makes relevance judgements for retrieved documents. • New round of retrieval based on feedback. • Similar to text categorization with two categories: relevant vs non-relevant. • Rocchio algorithm originally created for this task. • Naïve Bayes very successful.
Possible Improvements • Lexical databases sometimes used for query expansion. • Word sense disambiguation. • Expand query with correct senses. • Used on documents to prevent retrieval based on false matches. • Notion of semantic similarity.
Retrieval of Captioned Images • Typical properties of image captions: • Shorter than documents in typical IR tasks. • Subject noun phrase usually denotes most significant object in picture. • In news domain, first sentence generally describes image, rest is background. • Different types of queries. • Many techniques from general IR not applicable.
Related Research • Smeaton. • Automatically derived Hierarchical Concept Graphs (HCGs) based on WordNet IS-A links. • Computed semantic similarity between nouns. • Some success improving image retrieval. • Guglielmo and Rowe. • Used logical form records to capture meaning of queries and captions for comparison. • System significantly beat keyword search.
Other Text Classification Tasks • Clustering documents. • Create groups with similar attributes. • Various methods and algorithms exist. • Hierarchical vs non-hierarchical. • Each group has centroid. • Can aid in Information Retrieval. • Text Filtering. • Filter articles of potential interest for a user. • Uses many of the same methods as TC and IR.
Processing Image Captions • The Correspondence Problem - How to correlate visual information with words. • Visual semantics. • Symbolic representation of visual data. • Srihari. • Piction - System that automatically identifies human faces in captioned newspaper photos. • Integrates NLP module which parses captions with IU module that detects objects.
Final Observations • Previous Work. • General text categorization studied extensively. • Some research on text-based image retrieval. • Very little research involving text-based image categorization. • Image captions contain information unlikely to be extracted from just images. • High potential exists for significant research involving text-based image categorization.