1 / 37

An Introduction To Categorization

An Introduction To Categorization. Soam Acharya, PhD soamdev@yahoo.com 1/15/2003. What is Categorization?. { c 1 … c m } set of predefined categories { d 1 … d n } set of candidate documents Fill decision matrix with values {0,1} Categories are symbolic labels. Uses.

taima
Télécharger la présentation

An Introduction To Categorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

  2. What is Categorization? • {c1 … cm} set of predefined categories • {d1 … dn} set of candidate documents • Fill decision matrix with values {0,1} • Categories are symbolic labels

  3. Uses • Document organization • Document filtering • Word sense disambiguation • Web • Internet directories • Organization of search results • Clustering

  4. Categorization Techniques • Knowledge systems • Machine Learning

  5. Knowledge Systems • Manually build an expert system • Makes categorization judgments • Sequence of rules per category • If <booleancondition> then category • If document contains “buena vista home entertainment” then document category is “Home Video”

  6. UltraSeek Content Classification Engine

  7. UltraSeek CCE

  8. Knowledge System Issues • Scalability • Build • Tune • Requires Domain Experts • Transferability

  9. Machine Learning Approach • Build a classifier for a category • Training set • Hierarchy of categories • Submit candidate documents for automatic classification • Expend effort in building a classifier, not in knowing the knowledge domain

  10. Machine Learning Process taxonomy Training Document Pre-processing Training Set documents Classifier documents DB

  11. Training Set • Initial corpus can be divided into: • Training set • Test set • Role of workflow tools

  12. Document Preprocessing • Document Conversion: • Converts file formats (.doc, .ppt, .xls, .pdf etc) to text • Tokenizing/Parsing: • Stemming • Document vectorization • Dimension reduction

  13. Document Vectorization • Convert document text into “bag of words” • Each document is a vector of nweighted terms Federal express 3 Severe 3 Flight 2 Y2000-Q3 1 Mountain 2 Document Exactly 1 Simple 5

  14. Document Vectorization • Use tfidf function for term weighting • tfidf value may be normalized • All vectors of equal length • [0,1] tfidf(tk, dj) = #(tk, dj) . Log [|Tr| / #(tk)] # of times tk occurs in dj # of documents where tk occurs at least once Cardinality of training set

  15. Dimension Reduction • Reduce dimensionality of vector space • Why? • Reduce computational complexity • Address “overfitting” problem • Overtuning classifier • How? • Feature selection • Feature extraction

  16. Feature Selection • Also known as “term space reduction” • Remove “stop” words • Identify “best” words to be used in categorizing per topic • Document frequency of terms • Keep terms that occur in highest number of documents • Other measures • Chi square • Information gain

  17. Feature Extraction • Synthesize new features from existing features • Term clustering • Use clusters/centroids instead of terms • Co-occurrence and co-absence • Latent Semantic Indexing • Compresses vectors into a lower dimensional space

  18. Creating a Classifier • Define a function, Categorization Status Value, CSV, that for a document d: • CSVi: D -> [0,1] • Confidence that d belongs in ci • Boolean • Probability • Vector distance

  19. Creating a Classifier • Define a threshold, thresh, such that if CSVi(d) > thresh(i) then categorize d under ci otherwise, don’t • CSV thresholding • Fixed value across all categories • Vary per category • Optimize via testing

  20. Naïve Bayes Classifier Probability of doc dj belonging in category ci Training set terms/weights present in dj used to calculate probability of dj belonging to ci

  21. Naïve Bayes Classifier If wkj is binary (0, 1) and pki is short for P(wkx = 1 | ci) After further derivation, the original equation looks like: Can be used for CSV Constants for all docs

  22. Naïve Bayes Classifier • Independence assumption • Feature selection can be counterproductive

  23. k-NN Classifier • Compute closeness between candidate documents and category documents Confidence score indicating whether dz belongs to category ci Similarity between dj and training set document dz

  24. k-NN Classifier • k nearest neighbors • Find k nearest neighbors from all training documents and use their categories • K can also indicate the number of top ranked training documents per category to compare against • Similarity computation can be: • Inner product • Cosine coefficient

  25. Support Vector Machines • “decision surface” that best separates data points in two classes • Support vectors are the training docs that best define hyperplane Max. margin Optimal hyperplane

  26. Support Vector Machines • Training process involves finding the support vectors • Only care about support vectors in the training set, not other documents

  27. Neural Networks • Train net to learn from a mapping of input words to a category • One neural net per category • Too expensive • One network overall • Perceptron approach without a hidden layer • Three layered

  28. Classifier Committees • Combine multiple classifiers • Majority voting • Category specialization • Mixed results

  29. Classification Performance • Category ranking evaluation • Recall = categories found and correct • Precision = categories found and correct • Micro and Macro averaging over categories Total categories correct Total categories found

  30. Classification Performance • Hard • Two studies • Yiming Yang, 1997 • Yiming Yang and Xin Liu, 1999 • SVM, kNN >> Neural Net > Naïve Bayes • Performance converges for common categories (with many training docs)

  31. Computational Bottlenecks • Quiver • # of topics • # of training documents • # of candidate documents

  32. Categorization and the Internet • Classification as a service • Standardizing vocabulary • Confidentiality • performance • Use of hypertext in categorization • Augment existing classifiers to take advantage

  33. Hypertext and Categorization • An already categorized document links to documents within same category • Neighboring documents in a similar category • Hierarchical nature of categories • Metatags

  34. Augmenting Classifiers • Inject anchor text for a document into that document • Treat anchor text as separate terms • Depends on dataset • Mixed experimental results • Links may be noisy • Ads • Navigation

  35. Topics and the Web • Topic distillation • Analysis of hyperlink graph structure • Authorities • popular pages • Hubs • Links to authorities authorities hubs

  36. Topic Distillation • Kleinberg’s HITS algorithm • An initial set of pages: root set • Use this to create an expanded set • Weight propagation phase • Each node: authority score and hub score • Alternate • Authority = sum of current hub weights of all nodes pointing to it • Hub = sum of all authority score of all pages it points to • Normalize node scores and iterate until convergence • Output is a set of hubs and authorities

  37. Conclusion • Why Classifiy? • The Classification Process • Various Classifiers • Which ones are better? • Other applications

More Related