Text Categorization

Text Categorization • Assigning documents to a fixed set of categories • Applications: • Web pages • Recommending pages • Yahoo-like classification hierarchies • Categorizing bookmarks • Newsgroup Messages /News Feeds / Micro-blog Posts • Recommending messages, posts, tweets, etc. • Message filtering • News articles • Personalized news • Email messages • Routing • Folderizing • Spam filtering

Learning for Text Categorization • Text Categorization is an application of classification • Typical Learning Algorithms: • Bayesian (naïve) • Neural network • Relevance Feedback (Rocchio) • Nearest Neighbor • Support Vector Machines (SVM)

Nearest-Neighbor Learning Algorithm • Learning is just storing the representations of the training examples in data set D • Testing instance x: • Compute similarity between x and all examples in D • Assign x the category of the most similar examples in D • Does not explicitly compute a generalization or category prototypes (i.e., no “modeling”) • Also called: • Case-based • Memory-based • Lazy learning

K Nearest-Neighbor • Using only the closest example to determine categorization is subject to errors due to • A single atypical example. • Noise (i.e. error) in the category label of a single training example. • More robust alternative is to find the k most-similar examples and return the majority category of these k examples. • Value of k is typically odd to avoid ties, 3 and 5 are most common.

Similarity Metrics • Nearest neighbor method depends on a similarity (or distance) metric • Simplest for continuous m-dimensional instance space is Euclidian distance • Simplest for m-dimensional binary instance space is Hamming distance (number of feature values that differ) • For text, cosine similarity of TF-IDF weighted vectors is typically most effective

Basic Automatic Text Processing • Parse documents to recognize structure and meta-data • e.g. title, date, other fields, html tags, etc. • Scan for word tokens • lexical analysis to recognize keywords, numbers, special characters, etc. • Stopword removal • common words such as “the”, “and”, “or” which are not semantically meaningful in a document • Stem words • morphological processing to group word variants (e.g., “compute”, “computer”, “computing”, “computes”, … can be represented by a single stem “comput” in the index) • Assign weight to words • using frequency in documents and across documents • Store Index • Stored in a Term-Document Matrix (“inverted index”) which stores each document as a vector of keyword weights

tf x idf Weighs • tf x idf measure: • term frequency (tf) • inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution • Recall the Zipf distribution • Want to weight terms highly if they are • frequent in relevant documents … BUT • infrequent in the collection as a whole • Goal: assign a tf x idf weight to each term in each document

tf x idf

Inverse Document Frequency • IDF provides high values for rare words and low values for common words

tf x idf Example The initial Term x Doc matrix(Inverted Index) Documents represented as vectors of words tf x idfTerm x Doc matrix

K Nearest Neighbor for Text Training: For each each training example <x, c(x)> D Compute the corresponding TF-IDF vector, dx, for document x Test instance y: Compute TF-IDF vector d for document y For each <x, c(x)> D Let sx = cosSim(d, dx) Sort examples, x, in D by decreasing value of sx Let N be the first k examples in D. (get most similar neighbors) Return the majority class of examples in N

Text Categorization