980 likes | 1.12k Vues
Supervised learning for text. Organizing knowledge. Systematic knowledge structures Ontologies Dewey decimal system, the Library of Congress catalog, the AMS Mathematics Subject Classification, and the US Patent subject classification Web catalogs Yahoo & Dmoz Problem: Manual maintenance.
E N D
Organizing knowledge • Systematic knowledge structures • Ontologies • Dewey decimal system, the Library of Congress catalog, the AMS Mathematics Subject • Classification, and the US Patent subject classification • Web catalogs • Yahoo & Dmoz • Problem: Manual maintenance Chakrabarti & Ramakrishnan
Topic Tagging • Finding similar documents • Guiding queries • Naïve Approach: • Syntactic similarity between documents • Better approach • Topic tagging Chakrabarti & Ramakrishnan
Topic Tagging • Advantages • Increase vocabulary of classes • Hierarchical visualization and browsing aids • Applications • Email/Bookmark organization • News Tracking • Tracking authors of anonymous texts • E.g.: The Flesch-Kincaid index • classify the purpose of hyperlinks. Chakrabarti & Ramakrishnan
Supervised learning • Learning to assign objects to classes given examples • Learner (classifier) A typical supervised text learning scenario. Chakrabarti & Ramakrishnan
Difference with texts • M.L classification techniques used for structured data • Text: lots of features and lot of noise • No fixed number of columns • No categorical attribute values • Data scarcity • Larger number of class label • Hierarchical relationships between classes less systematic unlike structured data Chakrabarti & Ramakrishnan
Techniques • Nearest Neighbor Classifier • Lazy learner: remember all training instances • Decision on test document:distribution of labels on the training documents most similar to it • Assigns large weights to rare terms • Feature selection • removes terms in the training documents which are statistically uncorrelated with the class labels, • Bayesian classifier • Fit a generative term distribution Pr(d|c) to each class c of documents {d}. • Testing: The distribution most likely to have generated a test document is used to label it. Chakrabarti & Ramakrishnan
Other Classifiers • Maximum entropy classifier: • Estimate a direct distribution Pr(cjd) from term space to the probability of various classes. • Support vector machines: • Represent classes by numbers • Construct a direct function from term space to the class variable. • Rule induction: • Induce rules for classification over diverse features • E.g.: information from ordinary terms, the structure of the HTML tag tree in which terms are embedded, link neighbors, citations Chakrabarti & Ramakrishnan
Other Issues • Tokenization • E.g.: replacing monetary amounts by a special token • Evaluating text classifier • Accuracy • Training speed and scalability • Simplicity, speed, and scalability for document modifications • Ease of diagnosis, interpretation of results, and adding human judgment and feedback subjective Chakrabarti & Ramakrishnan
Benchmarks for accuracy • Reuters • 10700 labeled documents • 10% documents with multiple class labels • OHSUMED • 348566 abstracts from medical journals • 20NG • 18800 labeled USENET postings • 20 leaf classes, 5 root level classes • WebKB • 8300 documents in 7 academic categories. • Industry • 10000 home pages of companies from 105 industry sectors • Shallow hierarchies of sector names Chakrabarti & Ramakrishnan
Measures of accuracy • Assumptions • Each document is associated with exactly one class. OR • Each document is associated with a subset of classes. • Confusion matrix (M) • For more than 2 classes • M[i; j] : number of test documents belonging to class i which were assigned to class j • Perfect classifier: diagonal elements M[i; i] would be nonzero. Chakrabarti & Ramakrishnan
Evaluating classifier accuracy • Two-way ensemble • To avoid searching over the power-set of class labels in the subset scenario • Create positive and negative classes for each document d (E.g.: “Sports” and “Not sports” (all remaining documents) • Recall and precision • contingency matrix per (d,c) pair Chakrabarti & Ramakrishnan
Evaluating classifier accuracy (contd.) • micro averaged contingency matrix • micro averaged contingency matrix • micro averaged precision and recall • Equal importance for each document • Macro averaged precision and recall • Equal importance for each class Chakrabarti & Ramakrishnan
Evaluating classifier accuracy (contd.) • Precision – Recall tradeoff • Plot of precision vs. recall: Better classifier has higher curvature • Harmonic mean : Discard classifiers that sacrifice one for the other Chakrabarti & Ramakrishnan
Nearest Neighbor classifiers • Intuition • similar documents are expected to be assigned the same class label. • Vector space model + cosine similarity • Training: • Index each document and remember class label • Testing: • Fetch “k” most similar document to given document • Majority class wins • Alternative: Weighted counts – counts of classes weighted by the corresponding similarity measure • Alternative: per-class offset bc which is tuned by testing the classier on a portion of training data held out for this purpose. Chakrabarti & Ramakrishnan
Nearest neighbor classification Chakrabarti & Ramakrishnan
Pros • Easy availability and reuse of of inverted index • Collection updates trivial • Accuracy comparable to best known classifiers Chakrabarti & Ramakrishnan
Cons • Iceberg category questions • involves as many inverted index lookups as there are distinct terms in dq, • scoring the (possibly large number of) candidate documents which overlap with dq in at least one word, • sorting by overall similarity, • picking the best k documents, • Space overhead and redundancy • Data stored at level of individual documents • No distillation Chakrabarti & Ramakrishnan
Workarounds • To reducing space requirements and speed up classification • Find clusters in the data • Store only a few statistical parameters per cluster. • Compare with documents in only the most promising clusters. • Again…. • Ad-hoc choices for number and size of clusters and parameters. • k is corpus sensitive Chakrabarti & Ramakrishnan
TF-IDF • TF-IDF done for whole corpus • Interclass correlations and term frequencies unaccounted for • Terms which occur relatively frequently in some classes compared to others should have higher importance • Overall rarity in the corpus is not as important. Chakrabarti & Ramakrishnan
Feature selection • Data sparsity: • Term distribution could be estimated if training set larger than test • Not the case however……. • Vocabulary documents • For Reuters, only about 10300 documents available. • Over-fitting problem • Joint distribution may fit training instances….. • But may not fit unforeseen test data that well Chakrabarti & Ramakrishnan
Marginals rather than joint • Marginal distribution of each term in each class • Empirical distributions may not still reflect actual distributions if data is sparse • Therefore feature selection • Purposes: • Improve accuracy by avoiding over fitting • maintain accuracy while discarding as many features as possible to save a great deal of space for storing statistics • Heuristic, guided by linguistic and domain knowledge, or statistical. Chakrabarti & Ramakrishnan
Feature selection • Perfect feature selection • goal-directed • pick all possible subsets of features • for each subset train and test a classier • retain that subset which resulted in the highest accuracy. • COMPUTATIONALLY INFEASIBLE • Simple heuristics • Stop words like “a”, “an”, “the” etc. • Empirically chosen thresholds (task and corpus sensitive) for ignoring “too frequent” or “too rare” terms • Discard “too frequent” and “too rare terms” • Larger and complex data sets • Confusion with stop words • Especially for topic hierarchies • Greedy inclusion (bottom up) vs. top-down Chakrabarti & Ramakrishnan
Greedy inclusion algorithm • Most commonly used in text • Algorithm: • Compute, for each term, a measure of discrimination amongst classes. • Arrange the terms in decreasing order of this measure. • Retain a number of the best terms or features for use by the classier. • Greedy because • measure of discrimination of a term is computed independently of other terms • Over-inclusion: mild effects on accuracy Chakrabarti & Ramakrishnan
Measure of discrimination • Dependent on • model of documents • desired speed of training • ease of updates to documents and class assignments. • Observations • sets included for acceptable accuracy tend to have large overlap. Chakrabarti & Ramakrishnan
The test • Similar to the likelihood ratio test • Build a 2 x 2 contingency matrix per class-term pair • Under the independence hypothesis • Aggregates the deviations of observed values from expected values • Larger the value of , the lower is our belief that the independence assumption is upheld by the observed data. Chakrabarti & Ramakrishnan
The test • Feature selection process • Sort terms in decreasing order of their values, • Train several classifier with varying number of features • Stopping at the point of maximum accuracy. Chakrabarti & Ramakrishnan
Mutual information • Useful when the multinomial document model is used • X and Y are discrete random variables taking values x,y • Mutual information (MI) between them is defined as • Measure of extent of dependence between random variables, • Extent to which the joint deviates from the product of the marginals • Weighted with the distribution mass at (x; y) Chakrabarti & Ramakrishnan
Mutual Information • Advantages • To the extent MI(X,Y) is large, X and Y are dependent. • Deviations from independence at rare values of (x,y) are played down • Interpretations • Reduction in the entropy of Y given X. • MI(X; Y ) = H(X) – H(X|Y) = H(Y) – H(Y|X) • KL distance between no-independence hypothesis and independence hypothesis • KL distance gives the average number of bits wasted by encoding events from the `correct‘ distribution using a code based on a not-quite-right distribution Chakrabarti & Ramakrishnan
Feature selection with MI • Fix a term t and let be an event associated with that term. • E.g.: For the binary model, = 0/1, • Pr() = the empirical fraction of documents in the training set in which event it occurred. • Pr( ,c) = the empirical fraction of training documents which are in class c • Pr(c) = fraction of training documents belonging to class c. • Formula: • Problem : document lengths are not normalized. Chakrabarti & Ramakrishnan
Fisher's discrimination index • Useful when documents are scaled to constant length • Term occurrences are regarded as fractional real numbers. • E.g.: Two class case • Let X and Y be the sets of length normalized document vectors corresponding to the two classes. • Let and be centroids for each class. • Covariance matrices be Chakrabarti & Ramakrishnan
Fisher's discrimination index (contd.) • Goal : find a projection of the data sets X and Y on to a line such that • the two projected centroids are far apart compared to the spread of the point sets projected on to the same line. • Find a column vector such that • the ratio of • the square of the difference in mean vectors projected onto it • & average projected variance • is maximized. • This gives Chakrabarti & Ramakrishnan
Fisher's discrimination index • Formula • Let X and Y for both the training and test data are generated from multivariate Gaussian distributions • Let • Then this value of induces the optimal (minimum error) classier by suitable thresholding on for a test point q. • Problems • Inverting S would be unacceptably slow for tens of thousands of dimensions. • Llinear transformations would destroy already existing sparsity. Chakrabarti & Ramakrishnan
Solution • Recall: • Goal was to eliminate terms from consideration. • Not to arrive at linear projections involving multiple terms • Regard each term t as providing a candidate direction t which is parallel to the corresponding axis in the vector space model. • Compute the Fisher's index of t Chakrabarti & Ramakrishnan
FI : Solution (contd.) • Formula • For two class case • Can be generalized to a set {c}of more than two classes • Feature selection • Terms are sorted in decreasing order of FI(t) • Best ones chosen as features. Chakrabarti & Ramakrishnan
Validation • How to decide a cut-off rank ? • Validation approach • A portion of the training documents are held out • The rest is used to do term ranking • The held-out set used as a test set. • Various cut-off ranks can be tested using the same held-out set. • Leave-one-out cross-validation/partitioning data into two • An aggregate accuracy is computed over all trials. • Wrapper to search for the number of features • In decreasing order of discriminative power • Yields the highest accuracy. Chakrabarti & Ramakrishnan
Validation (contd.) • Simple search heuristic • Keep adding one feature at every step until the classifier's accuracy ceases to improve. A general illustration of wrapping for feature selection. Chakrabarti & Ramakrishnan
Validation (contd.) • For naive Bayes-like classier • Evaluation on many choices of feature sets can be done at once. • For Maximum Entropy/Support vector machines • Essentially involves training a classier from scratch for each choice of the cut-off rank. • Therefore inefficient Chakrabarti & Ramakrishnan
Validation : observations • Bayesian classifier cannot over fit much Effect of feature selection on Bayesian classifiers Chakrabarti & Ramakrishnan
Truncation algorithms • Start from the complete set of terms T • Keep selecting terms to drop • Till you end up with a feature subset • Question: When should you stop truncation ? • Two objectives • minimize the size of selected feature set F. • Keep the distorted distribution Pr(C|F) as similar as possible to the original Pr(CjT) Chakrabarti & Ramakrishnan
Truncation Algorithms: Example • Kullback-Leibler (KL) • Measures similarity or distance between two distributions • Markov Blanket • Let X be a feature in T. Let • The presence of M renders the presence of X unnecessary as a feature => M is a Markov blanket for X • Technically • M is called a Markov blanket for if X is conditionally independent of given M • eliminating a variable because it has a Markov blanket contained in other existing features does not increase the KL distance between Pr(C|T) and Pr(C|F). Chakrabarti & Ramakrishnan
Finding Markov Blankets • Absence of Markov Blanket in practice • Finding approximate Markov blankets • Purpose: To cut down computational complexity • search for Markov blankets M to those with at most k features. • given feature X, search for the members of M to those features which are most strongly correlated (using tests similar to the 2 or MI tests) with X. • Example : For Reuters dataset, over two-thirds of T could be discarded while increasing classification accuracy Chakrabarti & Ramakrishnan
Feature Truncation algorithm • whiletruncated Pr(C|F) is reasonably close to original Pr(C|T) do • foreach remaining feature X do • Identify a candidate Markov blanket M: • For some tuned constant k, find the set M of k variables in F \ X that are most strongly correlated with X • Estimate how good a blanket Mis • Estimate • end for • Eliminate the feature having the best surviving Markov blanket • end while Chakrabarti & Ramakrishnan
General observations on feature selection • The issue of document length should be addressed properly. • Choice of association measures does not make a dramatic difference • Greedy inclusion algorithms scale nearly linearly with the number of features • Markov blanket technique takes time proportional to at least . • Advantage of Markov blankets algo over greedy inclusion • Greedy algo may include features with high individual correlations even though one subsumes the other • Features individually uncorrelated could be jointly more correlated with the class • This rarely happens • Binary feature selection view may not be only view to subscribe to • Suggestion: combine features into fewer, simpler ones • E.g.: project the document vectors to a lower dimensional space Chakrabarti & Ramakrishnan
Bayesian Learner • Very practical text classifier • Assumption • A document can belong to exactly one of a set of classes or topics. • Each class c has an associated prior probability Pr(c), • There is a class-conditional document distribution Pr(djc) for each class. • Posterior probability • Obtained using Bayes Rule • Parameter set consists of all P(d|c) Chakrabarti & Ramakrishnan
Parameter Estimation for Bayesian Learner • Estimate of is based on two sources of information: • Prior knowledge on the parameter set before seeing any training documents • Terms in the training documents D. • Bayes Optimal Classifier • Taking the expectation of each parameter over Pr( |D) • Computationally infeasible • Maximum likelihood estimate • Replace the sum above with the value of the summand (Pr(c|d, )) for arg maxPr(D| ), • Works poorly Chakrabarti & Ramakrishnan
Naïve Bayes Classifier • Naïve • assumption of independence between terms, • joint term distribution is the product of the marginals. • Widely used owing to • simplicity and speed of training, applying, and updating • Two kinds of widely used marginals for text • Binary model • Multinomial model Chakrabarti & Ramakrishnan
Naïve Bayes Models • Binary Model • Each parameter indicates the probability that a document in class c will mention term t at least once. • Multinomial model • each class has an associated die with |W| faces. • each parameter denotes probability of the face turning up on tossing the die. • term t occurs n(d; t) times in document d, • document length is a random variable denoted L, • . • . Chakrabarti & Ramakrishnan
Analysis of Naïve Bayes Models • Multiply together a large number of small probabilities, • Result: extremely tiny probabilities as answers. • Solution : store all numbers as logarithms • Class which comes out at the top wins by a huge margin • Sanitizing scores using likelihood ration • Also called the logit function • . Chakrabarti & Ramakrishnan
Parameter smoothing • What if a test document contains a term t that never occurred in any training document in class c ? • Ans : will be zero • Even if many other terms clearly hint at a high likelihood of class c generating the document. • Bayesian Estimation • Estimating probability from insufficient data. • If you toss a coin n times and it always comes up heads, what is the probability that the (n + 1)thtoss will also come up heads? • posit a prior distribution on , called • E.g.: The uniform distribution • Resultant posterior distribution: Chakrabarti & Ramakrishnan