An Introduction To Categorization

An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

What is Categorization? • {c1 … cm} set of predefined categories • {d1 … dn} set of candidate documents • Fill decision matrix with values {0,1} • Categories are symbolic labels

Uses • Document organization • Document filtering • Word sense disambiguation • Web • Internet directories • Organization of search results • Clustering

Categorization Techniques • Knowledge systems • Machine Learning

Knowledge Systems • Manually build an expert system • Makes categorization judgments • Sequence of rules per category • If <booleancondition> then category • If document contains “buena vista home entertainment” then document category is “Home Video”

UltraSeek Content Classification Engine

UltraSeek CCE

Knowledge System Issues • Scalability • Build • Tune • Requires Domain Experts • Transferability

Machine Learning Approach • Build a classifier for a category • Training set • Hierarchy of categories • Submit candidate documents for automatic classification • Expend effort in building a classifier, not in knowing the knowledge domain

Machine Learning Process taxonomy Training Document Pre-processing Training Set documents Classifier documents DB

Training Set • Initial corpus can be divided into: • Training set • Test set • Role of workflow tools

Document Preprocessing • Document Conversion: • Converts file formats (.doc, .ppt, .xls, .pdf etc) to text • Tokenizing/Parsing: • Stemming • Document vectorization • Dimension reduction

Document Vectorization • Convert document text into “bag of words” • Each document is a vector of nweighted terms Federal express 3 Severe 3 Flight 2 Y2000-Q3 1 Mountain 2 Document Exactly 1 Simple 5

Document Vectorization • Use tfidf function for term weighting • tfidf value may be normalized • All vectors of equal length • [0,1] tfidf(tk, dj) = #(tk, dj) . Log [|Tr| / #(tk)] # of times tk occurs in dj # of documents where tk occurs at least once Cardinality of training set

Dimension Reduction • Reduce dimensionality of vector space • Why? • Reduce computational complexity • Address “overfitting” problem • Overtuning classifier • How? • Feature selection • Feature extraction

Feature Selection • Also known as “term space reduction” • Remove “stop” words • Identify “best” words to be used in categorizing per topic • Document frequency of terms • Keep terms that occur in highest number of documents • Other measures • Chi square • Information gain

Feature Extraction • Synthesize new features from existing features • Term clustering • Use clusters/centroids instead of terms • Co-occurrence and co-absence • Latent Semantic Indexing • Compresses vectors into a lower dimensional space

Creating a Classifier • Define a function, Categorization Status Value, CSV, that for a document d: • CSVi: D -> [0,1] • Confidence that d belongs in ci • Boolean • Probability • Vector distance

Creating a Classifier • Define a threshold, thresh, such that if CSVi(d) > thresh(i) then categorize d under ci otherwise, don’t • CSV thresholding • Fixed value across all categories • Vary per category • Optimize via testing

Naïve Bayes Classifier Probability of doc dj belonging in category ci Training set terms/weights present in dj used to calculate probability of dj belonging to ci

Naïve Bayes Classifier If wkj is binary (0, 1) and pki is short for P(wkx = 1 | ci) After further derivation, the original equation looks like: Can be used for CSV Constants for all docs

Naïve Bayes Classifier • Independence assumption • Feature selection can be counterproductive

k-NN Classifier • Compute closeness between candidate documents and category documents Confidence score indicating whether dz belongs to category ci Similarity between dj and training set document dz

k-NN Classifier • k nearest neighbors • Find k nearest neighbors from all training documents and use their categories • K can also indicate the number of top ranked training documents per category to compare against • Similarity computation can be: • Inner product • Cosine coefficient

Support Vector Machines • “decision surface” that best separates data points in two classes • Support vectors are the training docs that best define hyperplane Max. margin Optimal hyperplane

Support Vector Machines • Training process involves finding the support vectors • Only care about support vectors in the training set, not other documents

Neural Networks • Train net to learn from a mapping of input words to a category • One neural net per category • Too expensive • One network overall • Perceptron approach without a hidden layer • Three layered

Classifier Committees • Combine multiple classifiers • Majority voting • Category specialization • Mixed results

Classification Performance • Category ranking evaluation • Recall = categories found and correct • Precision = categories found and correct • Micro and Macro averaging over categories Total categories correct Total categories found

Classification Performance • Hard • Two studies • Yiming Yang, 1997 • Yiming Yang and Xin Liu, 1999 • SVM, kNN >> Neural Net > Naïve Bayes • Performance converges for common categories (with many training docs)

Computational Bottlenecks • Quiver • # of topics • # of training documents • # of candidate documents

Categorization and the Internet • Classification as a service • Standardizing vocabulary • Confidentiality • performance • Use of hypertext in categorization • Augment existing classifiers to take advantage

Hypertext and Categorization • An already categorized document links to documents within same category • Neighboring documents in a similar category • Hierarchical nature of categories • Metatags

Augmenting Classifiers • Inject anchor text for a document into that document • Treat anchor text as separate terms • Depends on dataset • Mixed experimental results • Links may be noisy • Ads • Navigation

Topics and the Web • Topic distillation • Analysis of hyperlink graph structure • Authorities • popular pages • Hubs • Links to authorities authorities hubs

Topic Distillation • Kleinberg’s HITS algorithm • An initial set of pages: root set • Use this to create an expanded set • Weight propagation phase • Each node: authority score and hub score • Alternate • Authority = sum of current hub weights of all nodes pointing to it • Hub = sum of all authority score of all pages it points to • Normalize node scores and iterate until convergence • Output is a set of hubs and authorities

Conclusion • Why Classifiy? • The Classification Process • Various Classifiers • Which ones are better? • Other applications

An Introduction To Categorization

An Introduction To Categorization

Presentation Transcript

an introduction to

An Introduction to

AN INTRODUCTION TO:

An Introduction to

Categorization

Categorization

AN INTRODUCTION TO:

An Introduction to:

An Introduction to:

An Introduction to…

AN INTRODUCTION TO:

AN INTRODUCTION TO:

AN INTRODUCTION TO:

An Introduction to

AN INTRODUCTION TO:

An introduction to…

An Introduction to

AN INTRODUCTION TO:

Categorization

Categorization

Categorization