370 likes | 501 Vues
This introduction to categorization highlights its importance in document organization and filtering. It outlines key techniques such as decision matrices, machine learning approaches, and feature selection for building classifiers. Various categorization methods, including Naïve Bayes, k-NN, and Support Vector Machines, are explained to provide insight into how documents are classified effectively. The text also addresses challenges like scalability and the need for domain expertise, making it a valuable resource for practitioners in the fields of data management and machine learning.
E N D
An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003
What is Categorization? • {c1 … cm} set of predefined categories • {d1 … dn} set of candidate documents • Fill decision matrix with values {0,1} • Categories are symbolic labels
Uses • Document organization • Document filtering • Word sense disambiguation • Web • Internet directories • Organization of search results • Clustering
Categorization Techniques • Knowledge systems • Machine Learning
Knowledge Systems • Manually build an expert system • Makes categorization judgments • Sequence of rules per category • If <booleancondition> then category • If document contains “buena vista home entertainment” then document category is “Home Video”
Knowledge System Issues • Scalability • Build • Tune • Requires Domain Experts • Transferability
Machine Learning Approach • Build a classifier for a category • Training set • Hierarchy of categories • Submit candidate documents for automatic classification • Expend effort in building a classifier, not in knowing the knowledge domain
Machine Learning Process taxonomy Training Document Pre-processing Training Set documents Classifier documents DB
Training Set • Initial corpus can be divided into: • Training set • Test set • Role of workflow tools
Document Preprocessing • Document Conversion: • Converts file formats (.doc, .ppt, .xls, .pdf etc) to text • Tokenizing/Parsing: • Stemming • Document vectorization • Dimension reduction
Document Vectorization • Convert document text into “bag of words” • Each document is a vector of nweighted terms Federal express 3 Severe 3 Flight 2 Y2000-Q3 1 Mountain 2 Document Exactly 1 Simple 5
Document Vectorization • Use tfidf function for term weighting • tfidf value may be normalized • All vectors of equal length • [0,1] tfidf(tk, dj) = #(tk, dj) . Log [|Tr| / #(tk)] # of times tk occurs in dj # of documents where tk occurs at least once Cardinality of training set
Dimension Reduction • Reduce dimensionality of vector space • Why? • Reduce computational complexity • Address “overfitting” problem • Overtuning classifier • How? • Feature selection • Feature extraction
Feature Selection • Also known as “term space reduction” • Remove “stop” words • Identify “best” words to be used in categorizing per topic • Document frequency of terms • Keep terms that occur in highest number of documents • Other measures • Chi square • Information gain
Feature Extraction • Synthesize new features from existing features • Term clustering • Use clusters/centroids instead of terms • Co-occurrence and co-absence • Latent Semantic Indexing • Compresses vectors into a lower dimensional space
Creating a Classifier • Define a function, Categorization Status Value, CSV, that for a document d: • CSVi: D -> [0,1] • Confidence that d belongs in ci • Boolean • Probability • Vector distance
Creating a Classifier • Define a threshold, thresh, such that if CSVi(d) > thresh(i) then categorize d under ci otherwise, don’t • CSV thresholding • Fixed value across all categories • Vary per category • Optimize via testing
Naïve Bayes Classifier Probability of doc dj belonging in category ci Training set terms/weights present in dj used to calculate probability of dj belonging to ci
Naïve Bayes Classifier If wkj is binary (0, 1) and pki is short for P(wkx = 1 | ci) After further derivation, the original equation looks like: Can be used for CSV Constants for all docs
Naïve Bayes Classifier • Independence assumption • Feature selection can be counterproductive
k-NN Classifier • Compute closeness between candidate documents and category documents Confidence score indicating whether dz belongs to category ci Similarity between dj and training set document dz
k-NN Classifier • k nearest neighbors • Find k nearest neighbors from all training documents and use their categories • K can also indicate the number of top ranked training documents per category to compare against • Similarity computation can be: • Inner product • Cosine coefficient
Support Vector Machines • “decision surface” that best separates data points in two classes • Support vectors are the training docs that best define hyperplane Max. margin Optimal hyperplane
Support Vector Machines • Training process involves finding the support vectors • Only care about support vectors in the training set, not other documents
Neural Networks • Train net to learn from a mapping of input words to a category • One neural net per category • Too expensive • One network overall • Perceptron approach without a hidden layer • Three layered
Classifier Committees • Combine multiple classifiers • Majority voting • Category specialization • Mixed results
Classification Performance • Category ranking evaluation • Recall = categories found and correct • Precision = categories found and correct • Micro and Macro averaging over categories Total categories correct Total categories found
Classification Performance • Hard • Two studies • Yiming Yang, 1997 • Yiming Yang and Xin Liu, 1999 • SVM, kNN >> Neural Net > Naïve Bayes • Performance converges for common categories (with many training docs)
Computational Bottlenecks • Quiver • # of topics • # of training documents • # of candidate documents
Categorization and the Internet • Classification as a service • Standardizing vocabulary • Confidentiality • performance • Use of hypertext in categorization • Augment existing classifiers to take advantage
Hypertext and Categorization • An already categorized document links to documents within same category • Neighboring documents in a similar category • Hierarchical nature of categories • Metatags
Augmenting Classifiers • Inject anchor text for a document into that document • Treat anchor text as separate terms • Depends on dataset • Mixed experimental results • Links may be noisy • Ads • Navigation
Topics and the Web • Topic distillation • Analysis of hyperlink graph structure • Authorities • popular pages • Hubs • Links to authorities authorities hubs
Topic Distillation • Kleinberg’s HITS algorithm • An initial set of pages: root set • Use this to create an expanded set • Weight propagation phase • Each node: authority score and hub score • Alternate • Authority = sum of current hub weights of all nodes pointing to it • Hub = sum of all authority score of all pages it points to • Normalize node scores and iterate until convergence • Output is a set of hubs and authorities
Conclusion • Why Classifiy? • The Classification Process • Various Classifiers • Which ones are better? • Other applications